uBlock Origin: exporting the blocked hosts

I’ve been working on and off on the Commerce Filtered Search Engine (CFS), essentially a survey of the web to find sites that are not commercially driven, in order to index them for searching. The idea is that if we can filter out all the click-bait and commercial stuff, what’s left might actually be interesting, novel, and informative (and a fair bit of rubbish, I expect, but perhaps it’ll at least be honest and sincere rubbish).

Up until now, I’ve been using Puppeteer with uBlock Origin. I was able to handle request failures and check for the error net::ERR_BLOCKED_BY_CLIENT, which indicates that a request was deemed to be ad or tracking related. The more hits per page I survey, the more spammy I rank the site.

However, uBlock Origin will likely soon become obsolete on Chrome, because Google are planning to deprecate the WebRequest API as part of Manifest V3, on which uBlock Origin depends. Therefore, for the sake of longevity, I’m trying to move to Firefox via Selenium instead.

So far, I have not been able to get suitable logging data back to my client. Even the Browser Log seems to no longer report extension logs, which now seem to be compartmentalised into their own console. Regardless, the logging facilities between the Selenium API, geckodriver and Firefox appears to be unimplemented, so I’m giving up on getting block logs back.

Therefore, my next approach will be to run a logging server on localhost, and modify uBlock Origin to build a list of blocked requests which is then pushed to the logging server, along with the URL that they originated from. This can then be used to refer back to the page that was originally crawled.

This page will document my efforts, and as always will be mostly notes-to-self. However, if anyone reading this has experience with Firefox extensions, or uBlock Origin specifically, then I would gratefully accept any suggestions you may have regarding what I am trying to achieve.

On µBlock Origin

The following code (in js/start.js) shows that I can run fetch() from the extension code. This is good, because it means that I can GET/POST logs, even to an external server.

const initializeTabs = async function() {
    console.error("This hit the console correctly");
      .then( data => { console.log(data) } )
      .then( res => { console.log(res) } );

    const manifest = browser.runtime.getManifest();
    ... the remaining code

It seems that js/traffic.js contains the main intercept point for web requests. Let’s see what it gets passed in the ‘details’ parameter.

const onBeforeRequest = function(details) {

    const fctxt = µBlock.filteringContext.fromWebrequestDetails(details);
    // ... some code, then this line does the decision making
    const result = pageStore.filterRequest(fctxt);
    // We now know if the request (from fctxt) was blocked.

So, onBeforeRequest is the main entry point for normal filtering (there seem to be some special cases where additional or different execution paths occur). This handler is defined in js/traffic.js, and is installed by uBlock.webRequest.start() (at a glance, check to be sure) .

The returned result will be 1 if the request was blocked. The object pageStore comes from pageStores, a Map defined in js/background.js that is keyed on tabId – so each tab has a pageStore object.

The filterRequest() method of pageStore, defined in js/pagestore.js, applies the filter tests sequentially to determine the result. The comments in this file state “A PageRequestStore object is used to store net requests in two ways, to record distinct net requests, and to create a log of net requests”. This looks like a promising source of information!

µBlock.pageStores.get(<tabId>).perLoadBlockedRequestCount   // This looks like the bottom-line number I want

The the value of details, passed to onBeforeRequest yields the following data from one of the many requests triggered by a visit to the Daily Mail. Bear in mind that this is the start of the process, and this request will probably have been subsequently blocked (it’s apparently trying to post an event-stream!).

 url: "https://k.p-n.io/event-stream"
 documentUrl: "https://www.dailymail.co.uk/home/index.html"
 originUrl: "https://www.dailymail.co.uk/home/index.html"
 frameAncestors: Array []
 frameId: 0
 incognito: false
 ip: null
 method: "POST"
 parentFrameId: -1
 proxyInfo: null
 requestId: "134"
 tabId: 1
 timeStamp: 1574632113928
 type: "xmlhttprequest"
 <prototype>: Object { … }
     onBeforeRequest moz-extension://89c108a7-97b8-44da-9e15-1e4f45602610/js/traffic.js:62
     <anonymous> moz-extension://89c108a7-97b8-44da-9e15-1e4f45602610/js/vapi-background.js:1169

A pageStore item looks as follows: –

To be added...

We can enable uBlock console logging programmatically by defining consoleLogLevel as 'info' rather than 'unset' in the file js/background.js.

To implement logging of blocked requests, we can hook into js/pagestore.js, after the lines shown below. At this point, the block-count matches the count shown in the shield icon.

    this.perLoadBlockedRequestCount += aggregateCounts & 0xFFFF;
    this.perLoadAllowedRequestCount += aggregateCounts >>> 16 & 0xFFFF;
    this.journalLastCommitted = undefined;

The code that should be added is shown below. It logs the block-list state of the tab if there is more than one blocked request, and it tries to avoid duplicating log entries. The function addItemFromPagestore() is defined in the file js/extlogger.js, which I have created to separate my changes from the main uBlock codebase as much as possible.

    if(this.perLoadBlockedRequestCount > 0 &&
             (typeof this.previousBlockCount == "undefined" ||
                 this.perLoadBlockedRequestCount > this.previousBlockCount) ) { 
         this.previousBlockCount = this.perLoadBlockedRequestCount;

Note that background.html should be modified to include js/extlogger.js in the list of scripts loaded, ideally as near to the top as possible (I load after js/console.js).

    <script src="js/extlogger.js"></script>

The source code hasn’t been released for this yet, simply because it’s in a constant state of flux, and because there are a number of components that have to be tied tenuously together (Puppeteer, JavaScript, Firebase, Java, Lucene, Apache Tomcat). If you really want it, send me an email and I’ll make a tarball (all my code will be GPLv3).

Leave a Reply

Your email address will not be published. Required fields are marked *