uBlock Origin: exporting the blocked hosts

I’ve been working on and off on the Commerce Filtered Search Engine (CFS), essentially a survey of the web to find sites that are not commercially driven, in order to index them for searching. The idea is that if we can filter out all the click-bait and commercial stuff, what’s left might actually be interesting, novel, and informative (and a fair bit of rubbish, I expect, but perhaps it’ll at least be honest and sincere rubbish).

Up until now, I’ve been using Puppeteer with uBlock Origin. I was able to handle request failures and check for the error net::ERR_BLOCKED_BY_CLIENT, which indicates that a request was deemed to be ad or tracking related. The more hits per page I survey, the more spammy I rank the site.

However, uBlock Origin will likely soon become obsolete on Chrome, because Google are planning to deprecate the WebRequest API as part of Manifest V3, on which uBlock Origin depends. Therefore, for the sake of longevity, I’m trying to move to Firefox via Selenium instead.

The approach I took with Puppeteer was to read the browser’s logs to check for blocked network requests. However, the logging facilities between the Selenium API, geckodriver and Firefox appears to be unimplemented, so I have given up on getting block logs back.

Turns out this is a good thing! My new approach is to run a logging server on localhost, and modify uBlock Origin to send its blocked requests to the logging server, along with the URL that they originated from, which can then be used to refer back to the page that was originally crawled.

The biggest win with this approach is that I get all ad-block hits, not just those resulting from network requests. I also get details on which ad-server or tracking servers are being used by which sites, which yields more detailed information on which pages we should keep (e.g. drop pages served with the scummiest of trackers).

On µBlock Origin

The following code (in js/start.js) shows that I can run fetch() from the extension code. This is good, because it means that I can GET/POST logs, even to an external server.

const initializeTabs = async function() {
    console.error("This hit the console correctly");
    fetch('http://www.susa.net/')
      .then( data => { console.log(data) } )
      .then( res => { console.log(res) } );

    const manifest = browser.runtime.getManifest();
    ... the remaining code
}

It seems that js/traffic.js contains the main intercept point for web requests. Let’s see what it gets passed in the ‘details’ parameter.

const onBeforeRequest = function(details) {
    console.error(details);

    const fctxt = µBlock.filteringContext.fromWebrequestDetails(details);
    // ... some code, then this line does the decision making
    const result = pageStore.filterRequest(fctxt);
    // We now know if the request (from fctxt) was blocked.
}

So, onBeforeRequest is the main entry point for normal filtering (there seem to be some special cases where additional or different execution paths occur). This handler is defined in js/traffic.js, and is installed by uBlock.webRequest.start() (at a glance, check to be sure) .

The returned result will be 1 if the request was blocked. The object pageStore comes from pageStores, a Map defined in js/background.js that is keyed on tabId – so each tab has a pageStore object.

The filterRequest() method of pageStore, defined in js/pagestore.js, applies the filter tests sequentially to determine the result. The comments in this file state “A PageRequestStore object is used to store net requests in two ways, to record distinct net requests, and to create a log of net requests”. This looks like a promising source of information!

µBlock.pageStores.get(<tabId>).perLoadBlockedRequestCount   // This looks like the bottom-line number I want

The the value of details, passed to onBeforeRequest yields the following data from one of the many requests triggered by a visit to the Daily Mail. Bear in mind that this is the start of the process, and this request will probably have been subsequently blocked (it’s apparently trying to post an event-stream!).

{…}
 url: "https://k.p-n.io/event-stream"
 documentUrl: "https://www.dailymail.co.uk/home/index.html"
 originUrl: "https://www.dailymail.co.uk/home/index.html"
 frameAncestors: Array []
 frameId: 0
 incognito: false
 ip: null
 method: "POST"
 parentFrameId: -1
 proxyInfo: null
 requestId: "134"
 tabId: 1
 timeStamp: 1574632113928
 type: "xmlhttprequest"
 <prototype>: Object { … }
 traffic.js:62:13
     onBeforeRequest moz-extension://89c108a7-97b8-44da-9e15-1e4f45602610/js/traffic.js:62
     <anonymous> moz-extension://89c108a7-97b8-44da-9e15-1e4f45602610/js/vapi-background.js:1169

A pageStore item looks as follows: –

To be added...

We can enable uBlock console logging programmatically by defining consoleLogLevel as 'info' rather than 'unset' in the file js/background.js.

To implement logging of blocked requests, we can hook into js/pagestore.js, after the lines shown below. At this point, the block-count matches the count shown in the shield icon.

    this.perLoadBlockedRequestCount += aggregateCounts & 0xFFFF;
    this.perLoadAllowedRequestCount += aggregateCounts >>> 16 & 0xFFFF;
    this.journalLastCommitted = undefined;

The code that should be added is shown below. It logs the block-list state of the tab if there is more than one blocked request, and it tries to avoid duplicating log entries. The function addItemFromPagestore() is defined in the file js/extlogger.js, which I have created to separate my changes from the main uBlock codebase as much as possible.

    if(this.perLoadBlockedRequestCount > 0 &&
             (typeof this.previousBlockCount == "undefined" ||
                 this.perLoadBlockedRequestCount > this.previousBlockCount) ) { 
         addItemFromPagestore(this);
         this.previousBlockCount = this.perLoadBlockedRequestCount;
 }

Note that background.html should be modified to include js/extlogger.js in the list of scripts loaded, ideally as near to the top as possible (I load after js/console.js).

    <script src="js/extlogger.js"></script>

Leave a Reply

Your email address will not be published. Required fields are marked *