I’ve been playing with the idea of Commerce Filtered Search, which is essentially a search engine of stuff that doesn’t exist simply to get you to click it. Sort of like how the Internet used to be, a place for those with boundless curiosity by those who just like to share what interests them.
My original plan was to ‘seed’ a web crawler with high-quality links, and fan out from there. The seed links were extracted from Hacker News, Reddit was on my radar. I also wrote an add-on for Firefox that lets me record links that I happen to find for subsequent crawling. The resulting search engine (with an outdated index) can be found here. I find it hugely promising, but each iteration seems to yield more complexity – search is hard.
So, I’m paring back. My goal right now is for a personal search engine that I can throw URLs from various good sources and have them fed into the index. I’m keeping a record of useful tools to help with this, and this page is currently a repository of information that should evolve into a more coherent post.
Interesting stuff to consider:
Sonic describes itself as “Fast, lightweight & schema-less search backend. An alternative to Elasticsearch that runs on a few MBs of RAM.” https://github.com/valeriansaliou/sonic
This seems promising because it claims to work well with limited resources, which is something I’d trade some of Lucene’s flexibility for.
Linkalot is a basic (at the time of writing) self-hosted link manager, and can be used to accumulate interesting links. It works well with Send-Tab-URL extension. https://gitlab.com/dmpop/linkalot
jusText is used for boilerplate removal. It has many implementations and forks of the original unmaintained algorithm. You will likely find things like cookie consent and image captions among your text. https://pypi.org/project/jusText/
spaCy could possibly be used to add a bit of semantic reasoning or keyword extraction to give the index more meaningful information to store. https://spacy.io/usage
readability-cli is a library based on the Firefox Reader Mode code that provides a command to fetch and simplify pages to just the article text. As you’d expect, it works as well as Firefox Reader Mode, which is to say it works really well.
What about YaCy? They have several specialised use cases described here: https://wiki.yacy.net/index.php/En:Use_cases
Probably you are interested in: https://wiki.yacy.net/index.php/En:Use_cases#Personal_Web_Assistance
“You have a huge number of bookmarks on your private and on your company computer. You want to use all of them on both computers without publishing them. While you are on a journey, you also want to be able to access your personal bookmarks from any computer. “
Thank you, yes I looked at and played a little with YaCy a while back. The biggest difference from what I envisage is the approach to crawling. In particular, I don’t want to systematically crawl anything (the web is now too full of rubbish). Instead, I want to do it judiciously using indicators such as ad and tracking networks to eliminate junk from the index.
There seems to be a convergence of thought on the value of our bookmarks, and on ‘web directories’. I think the future of free high-quality search is rooted in this, because we have to start crawling from islands of quality in the sea of rubbish. There are also good writers who happen to publish on platforms filled with junk (e.g. Medium) – it’s not worth crawling much beyond the specific article or author.
Essentially, if there’s no garbage in the index, then there’s no garbage in the search results. Looking again at YaCy, it does so much of what’s required, I’m wondering if simply enhancing its crawling strategies for my purpose might be a good thing.
They have special mode called “Robinson mode”: https://wiki.yacy.net/index.php/En:Performance#Switch_to_Robinson_Mode
IMO it’s exactly what you want – custom non-shared index (your personal) created from crawled pages from a list you given to it.
See this use case: https://wiki.yacy.net/index.php/En:Use_cases#Topic_Search and the web interface for the “Expert Crawl” here: https://yacy.searchlab.eu/CrawlStartExpert.html
There are a lot of options you could use to restrict the crawling depth and what should be crawled. Simple regular expressions. For example “Load Filter on URLs” give the following options:
must-match
[ ] Restrict to start domain(s)
[ ] Restrict to sub-path(s)”
The first one restricts crawling only to the domains in the seed list that you’ve gave it (I haven’t played with it but it’s my understanding. I could be wrong).
The second one should cover the case “Author is on a junk platform, crawl only his writings”.
I think YaCy does seem to offer a lot of what is needed. In particular, there’s a possibility to aggregate only commerce-filtered indexes (a private collection of peers). Although I haven’t looked at the YaCy code yet, I can envisage the point at which the URL is fetched, and hooking in (ultimately) a call to Selenium driving Firefox with uBlock Origin to get both the page content and the number of ‘blocked requests’ to measure spamminess of the page. I have reasonably complete code to do this at the moment. Obviously it’s orders of magnitude slower than simple fetching, but it can be parallelised.
A second order crawl would take the URLs harvested from the primary crawl, and use these for the next level of the survey/index. What I have found, observationally at least, is that second order links frequently point to rubbish. Instinctively this feels like something best left to a trained neural network, but I haven’t got to trying this out yet (it’s one of a number of side projects!).