I’ve been playing with the idea of Commerce Filtered Search, which is essentially a search engine of stuff that doesn’t exist simply to get you to click it. Sort of like how the Internet used to be, a place for those with boundless curiosity by those who just like to share what interests them.
My original plan was to ‘seed’ a web crawler with high-quality links, and fan out from there. The seed links were extracted from Hacker News, Reddit was on my radar. I also wrote an add-on for Firefox that lets me record links that I happen to find for subsequent crawling. The resulting search engine (with an outdated index) can be found here. I find it hugely promising, but each iteration seems to yield more complexity – search is hard.
So, I’m paring back. My goal right now is for a personal search engine that I can throw URLs from various good sources and have them fed into the index. I’m keeping a record of useful tools to help with this, and this page is currently a repository of information that should evolve into a more coherent post.
Interesting stuff to consider:
Sonic describes itself as “Fast, lightweight & schema-less search backend. An alternative to Elasticsearch that runs on a few MBs of RAM.” https://github.com/valeriansaliou/sonic
This seems promising because it claims to work well with limited resources, which is something I’d trade some of Lucene’s flexibility for.
Linkalot is a basic (at the time of writing) self-hosted link manager, and can be used to accumulate interesting links. It works well with Send-Tab-URL extension. https://gitlab.com/dmpop/linkalot
jusText is used for boilerplate removal. It has many implementations and forks of the original unmaintained algorithm. You will likely find things like cookie consent and image captions among your text. https://pypi.org/project/jusText/
spaCy could possibly be used to add a bit of semantic reasoning or keyword extraction to give the index more meaningful information to store. https://spacy.io/usage
readability-cli is a library based on the Firefox Reader Mode code that provides a command to fetch and simplify pages to just the article text. As you’d expect, it works as well as Firefox Reader Mode, which is to say it works really well.