Personal Search Engine

I’ve been playing with the idea of Commerce Filtered Search, which is essentially a search engine of stuff that doesn’t exist simply to get you to click it. Sort of like how the Internet used to be, a place for those with boundless curiosity by those who just like to share what interests them.

My original plan was to ‘seed’ a web crawler with high-quality links, and fan out from there. The seed links were extracted from Hacker News, Reddit was on my radar. I also wrote an add-on for Firefox that lets me record links that I happen to find for subsequent crawling. The resulting search engine (with an outdated index) can be found here. I find it hugely promising, but each iteration seems to yield more complexity – search is hard.

So, I’m paring back. My goal right now is for a personal search engine that I can throw URLs from various good sources and have them fed into the index. I’m keeping a record of useful tools to help with this, and this page is currently a repository of information that should evolve into a more coherent post.

Interesting stuff to consider:

Sonic describes itself as “Fast, lightweight & schema-less search backend. An alternative to Elasticsearch that runs on a few MBs of RAM.” https://github.com/valeriansaliou/sonic

This seems promising because it claims to work well with limited resources, which is something I’d trade some of Lucene’s flexibility for.

Linkalot is a basic (at the time of writing) self-hosted link manager, and can be used to accumulate interesting links. It works well with Send-Tab-URL extension. https://gitlab.com/dmpop/linkalot

jusText is used for boilerplate removal. It has many implementations and forks of the original unmaintained algorithm. You will likely find things like cookie consent and image captions among your text. https://pypi.org/project/jusText/

Preliminary report suggests 30% Alcohol effective against SARS-CoV-2 (Coronavirus)

Warning: I am neither a biologist, nor a journalist. My reading of this is naive, and is posted simply to draw attention to a study by seemingly reputable scientists. Seek qualified opinions, this may be nonsense.

Please include a link to this post of you pass this information on. To those people who misreport things for the purpose of click-bait, please don’t!

There is a preliminary report, posted on 17th March, which seems to find that lower concentrations of alcohol are effective against SARS-CoV-2.
I’m not qualified to have a meaningful opinion on this, but I’ll post it here to see if anyone can shed more light on it.

https://www.biorxiv.org/content/10.1101/2020.03.10.986711v1.full

Notably, both tested alcohols, ethanol and 2-propanol were efficient in inactivating the virus in 30s at a minimal final concentration of at least 30%

https://www.biorxiv.org/content/10.1101/2020.03.10.986711v1.full

Points to bear in mind:

  • This is only one study.
  • I may be misinterpreting the results.
  • The study was conducted in laboratory conditions, rather than the real world.
  • The study has not yet been peer-reviewed.

The implication, if true, is that alcohol such as Vodka or Rum could be used as a sanitiser for hands and surfaces, particularly relevant given the current shortages of pharmaceutical and industrial products.

All advice is that hand-washing with soap is the most effective way to destroy SARS-CoV-2. However, where running water is not readily available, an effective hand-sanitiser would be the next best thing.

My own approach is currently to act on the findings in this study, but as a last resort.

Hand-washing with soap is, by far, my first choice. A hand-sanitiser that meets WHO standards is my second choice.

If neither of these are available to me, then I’m using a small spray bottle filled with some 40% alcohol (by volume).

The nghttpx Reverse Proxy

I want to expose different containers on specific URL paths, possibly on different hosts, and nghttpx, from the nghttp2 library by Tatsuhiro Tsujikawa, does this in an intuitive way, and does a lot more besides.

    sudo apt-get install nghttp2-proxy

An example configuration is installed in /etc/nghttpx/nghttpx.conf, which configures nghttpx to listen for cleartext http on port localhost:3000 and proxies (e.g. forwards to) localhost:80.

    frontend=127.0.0.1,3000;no-tls
    backend=127.0.0.1,80
Continue reading

uBlock Origin: exporting the blocked hosts

I’ve been working on and off on the Commerce Filtered Search Engine (CFS), essentially a survey of the web to find sites that are not commercially driven, in order to index them for searching. The idea is that if we can filter out all the click-bait and commercial stuff, what’s left might actually be interesting, novel, and informative (and a fair bit of rubbish, I expect, but perhaps it’ll at least be honest and sincere rubbish).

Up until now, I’ve been using Puppeteer with uBlock Origin. I was able to handle request failures and check for the error net::ERR_BLOCKED_BY_CLIENT, which indicates that a request was deemed to be ad or tracking related. The more hits per page I survey, the more spammy I rank the site.

Continue reading

Longan Nano (GD32VF103)

This was an impulse purchase, because for some unfathomable reason I really wanted a RISC-V CPU to play with. Sipeed’s Longan Nano, a small board based on the GigaDevice GD32VF103 SoC is just that; a RISC-V CPU with a bundle of decent peripherals. There are links below if you want details on this board.

The GD32V implements an RV32IMAC CPU, where ‘RV32I’ refers to a 32-bit CPU with the Base Integer Instruction Set, the ‘M’ denotes the Standard Extension for Integer Multiplication and Division, the ‘A’ denotes the Extension for Atomic Instructions, and the ‘C’ refers to the Extension for Compressed Instructions (i.e. 16-bit opcodes for commonly used instructions, useful for this memory constrained device).

Continue reading

Seeed, 4PX, and Yodel – Really Quite Good

So Seeed finally dispatched my Sipeed Longan Nano, a little RISC-V SoC that was too interesting to resist. I say finally not because they were tardy, but because it was on back order. I had to write a few words about the delivery, because while I’ve always been impressed by Chinese suppliers, the ability to track with the level of detail shown below, from Shenzhen to just south of Edinburgh, is particularly impressive.

Here’s the tracking from Shenzhen, China to Livingson, Scotland

Continue reading

LXD eMail, SMTP/IMAP/WebMail with OpenSMTPD, Dovecot, and Roundcube.

Email is one of those conceptually simple things that are a lot more complex in practise – get it wrong and you miss incoming mail, or your mail gets lost or junked, or spammers exploit your server.

This post is intended for technical people who want to run their own personal mail server, and describes the steps required to get a basic server setup that can be run safely and reliably.

Continue reading

VirtualBox VMDK for Raw Disk Access on a Windows Host

Do not act on this article unless you are prepared to trash your disks, or if you are absolutely sure you understand what you are doing. Messing with raw disk sectors is risky!

VirtualBox allows us to use a disk device directly, rather than using a file as a virtual volume. For me, since I have two SSDs in my laptop, it meant I could tinker with virtual machines without risking my Windows 7 partition, while also being able to boot the VMs on real hardware if I wanted.

Continue reading

Arduino Yun Reading WH1080 using AUREL RX-4MM5

Here’s the sketch, it just reads and dumps to the console, the bridge can be used to send the data to the GNU/Linux side of the Yun.

See the other post on doing this with a Raspberry Pi for some code to turn the data into something useful.

I’m using the MCU of the Yun to do the RF stuff, and using the AUREL RX-4MM5 (a proper OOK receiver), it seems a lot more dependable than the Raspberry Pi + RFM01 (or RFM12B).

Continue reading

Raspberry Pi reading WH1081 weather sensors using an RFM01 and RFM12b

This article describes using an RFM01 or RFM12b FSK RF transceiver with a Raspberry Pi to receive sensor data from a Fine Offset WH1080 or WH1081 (specifically a Maplin N96GY) weather station’s RF transmitter.

I originally used the RFM12b, simply because I had one to hand, but later found that the RFM01 appears to work far better – the noise immunity and the range of the RFM01 in OOK mode is noticeably better.  They’re pin compatible, but the SPI registers differ between the modules, in terms of both register-address and function.

This project is changing to be microcontroller based, and using an AM receiver module (Aurel RX-4MM5) – a much more effective approach – arduino-yun-reading-wh1080-using-aurel-rx-4mm5. Currently testing on Arduino Yun, but will probably move to a more platform agnostic design to support Dragino and Carambola etc.

Continue reading

LXD now runs my WordPress

Here are some notes on how I used LXD to run a container for WordPress. This is (a lot) more convenient than using Docker, which was my original approach to getting my WordPress site into a container. The main advantage for me is that a single container runs all the components together – no need for the ‘wiring’ between containers for each process.

There is a bash script that automates this at https://github.com/Kevin-Sangeelee/lxd-wordpress, and is a more complete description of the process since it automatically configures SSL/TLS and Exim.

Continue reading

PIC/MOSFET PWM Model Train Controller

Having been unable to resist buying some old Hornby OO Gauge bits from the second hand cabinet in a model shop, justification came from the educational value it would offer my son if I could make a speed controller, perhaps adding a sensor or two – the essence of industrial control and feedback mechanisms. Being three and a half, he just wanted to make the train fly off the track, but at least he enjoyed it.

This is a project to create a model train speed controller using the Pulse Width Modulation (PWM) output of a PIC16F690 microcontroller, to drive a MOSFET that ultimately controls the voltage on the tracks. The train will automatically switch into reverse when the control is turned anti-clockwise through the zero point. Continue reading

Braun ThermoScan Fix – Low Battery Warning Switch Off

We have a Braun Thermoscan infra-red (IR) thermometer that has been working perfectly for about five years. It started complaining about low batteries and shutting off, despite me replacing with new batteries that I checked had plenty of charge.

When I opened it, I discovered that the batteries connect to the circuit board via simple metal clip contacts, and that the contacts had some corrosion on them, which was preventing power from getting to the board, hence why it was complaining of low batteries.

So a very simple fix is to just clean the corrosion from the battery terminals inside the thermometer. You’ll need a Torx T9 screwdriver (Maplin, eBay, Amazon, maybe pound shops).
Continue reading

Raspberry Pi Power Controller

This article is a work in progress to create a power-controller for the Raspberry Pi based on a PIC microcontroller and MOSFET. The PIC implements an I2C slave to allow power control, and also to approximate the registers of a PCF8563 Real Time Clock (RTC) chip, to allow timed wake-up of the Pi.

  • Power the Raspberry Pi off and on with a push-button.
  • Fully shut down the Raspberry Pi on ‘shutdown -h’.
  • Wake-up at a specified time (one-off or periodic).
  • Monitor the supply voltage.
  • Log glitches in the power-supply (e.g. caused by USB device activity).
  • Maintains the time from a CR2032 button cell.

During power-down, the circuit currently consumes around 5μA of power, useful where a battery is being used to power the Pi (remote solar-power applications, or in-car systems, for example).

The Pi is able to instruct the PIC to power it down using a short I2C command sequence. Wake up events include a push-button, or other voltage-sense on an input pin. Continue reading

Raspberry Pi – Driving a Relay using GPIO

There’s something exciting about crossing the boundary between the abstract world of software and the physical ‘real world’, and a relay driven from a GPIO pin seemed like a good example of this. Although a simple project, I still learned some new things about the Raspberry Pi while doing it.

There are only four components required, and the cost for these is around 70p, so it would be a good candidate for a classroom exercise. Even a cheap relay like the Omron G5LA-1 5DC can switch loads of 10A at 240V. Continue reading

Fighting Click-Bait

The Internet seems awash with ‘click-bait’ and sponsored content – articles created primarily to generate money, sometimes plagiarised, misleading, exaggerated, or provocative just to get views. The good stuff – articles often written simply because it’s good to share knowledge and ideas – is getting harder to find.

My proposal is to create a search engine that, rather than systematically crawl the web, starts with a seed corpus of high quality links, and fans out from there, stopping when the quality drops. The result will hopefully be a searchable index of pages that were created to impart information rather than to earn cash from eyeballs.

Continue reading

Docker WordPress in a subdirectory

Moving a standard WordPress installation to a different host is a minor pain – I only do this occasionally, so every time I need to consider the configuration of the original environment and how this translates to the new server. Nothing too challenging, but tedious and prone to error.

So I figured Docker containers are the way to go and, sure enough, Docker Hub has more than enough images for my needs. The only issue is that I don’t dedicate my server to WordPress – it’s in a ./wordpress subdirectory of the web root. Docker’s official WordPress image keeps reinstating the WordPress files if they’re not found in the web root. Continue reading

Atech Postal – notes on the Fast Server

Atech’s Postal is an SMTP server and web management interface that’s geared towards transactional and bulk mailing (e.g. for application to user communication, and for marketing respectively). It’s quite well documented, but more importantly it’s open source (MIT license), and also seems well written – elegant, self-documenting code that’s easy to follow, useful comments, well structured. A bit of a joy really.

The Fast Server is a web server process that’s separate from the management interface server, that’s used to handle requests from click and open tracking links. However, the documentation on the Fast Server process, which is used for logging email Open and Click events, seems to be at least partially out of date, so I thought I’d dig into the code to understand and document the bits that I was unsure of. Continue reading

Raspberry Pi GPFSEL, GPIO, and PADS Status Viewer

The gpfsel_list (I maybe should have called it lsgpio) utility displays a list of the currently configured function selections across all available GPIO pins and, for pins configured as GPIO, the current state of the pins. For pins configured with ALTn functions, the selected function is listed according to the datasheet information.

It also shows the state of the PADS registers to display the configured drive current, hysteresis, and slew setting for the three groups of pins (GPIO 0-27, 28-45, and 46-53).

It’s been written to produce output that’s easy to grep and cut, and performs only read operations on the registers – it can’t be used to modify settings, though I suppose this could change in future.

Continue reading

I Closed my LinkedIn Account

Admittedly, it had been unused for quite a long time but, regardless, my LinkedIn profile had a few historical recommendations from people I actually knew and respected, so I hesitated before closing it.

The main reason I had for closing my LinkedIn account is to protest in some small way against the lawsuit that LinkedIn are pursuing against hiQ for scraping (automatically fetching and processing) public profiles of members.

I don’t know or care anything much about hiQ or their scraping antics, but LinkedIn pushing to criminalise accessing of public profiles, via a web server bound to a public TCP port, on a publicly visible computer is a dangerous step in the wrong direction. Continue reading

Raspberry Pi PCF8563 Real Time Clock (RTC)

Having recently received my Raspberry Pi, one of the first things I wanted to do was hook up a real-time clock chip I had lying around (a NXP PCF8563) and learn how to drive I2C from the BCM2835 hardware registers. Turns out it’s quite easy to do, and I think makes a useful project to learn with.

So, here are some notes I made getting it to work, initially with Chris Boot’s forked kernel that incorporates some I2C handling code created by Frank Buss into the kernel’s I2C bus driver framework.

After getting it to work with the kernel drivers, I created some C code to drive the RTC chip directly using the BCM2835 I2C registers, using mmap() to expose Peripheral IO space in the user’s (virtual) memory map, the technique I learned from Gert’s Gertboard demo software, though my code’s simpler (hopefully without limiting functionality!).

Note: Revision 2 boards require the code to access BSC1 (I2C1) rather than BSC0 (I2C0), so changes to the peripheral base address may be required, or in the case if the Linux I2C driver, a reference to i2c-1 rather than i2c-0. It should be simple enough, but I don’t want to write about things I haven’t done or tested, so a bit of extra work by the reader may be required.

Continue reading