Hackers' Pub

Okay, Back of the napkin math:
- There are probably 100 million sites and 1.5 billion pages worth indexing in a #search engine
- It takes about 1TB to #index 30 million pages.
- We only care about text on a page.

I define a page as worth indexing if:
- It is not a FAANG site
- It has at least one referrer (no DD Web)
- It's active

So, this means we need 40TB of fast data to make a good index for the internet. That's not "runs locally" sized, but it is nonprofit sized.

My size assumptions are basically as follows:
- #URL
- #TFIDF information
- Text #Embeddings
- Snippet

We can store an index for 30kb. So, for 40TB we can store an full internet index. That's about $500 in storage.

Access time becomes a problem. TFIDF for the whole internet can easily fit in ram. Even with #quantized embeddings, you can only fit 2 million per GB in ram.

Assuming you had enough RAM it could be fast: TF-IDF to get 100 million candidated, #FAISS to sort those, load snippets dynamically, potentially modify rank by referers etc.

6 128 MG #Framework #desktops each with 5tb HDs (plus one raspberry pi to sort the final condidates from the six machines) is enough to replace #Google. That's about $15k.

In two to three years this will be doable on a single machine for around $3k.

By the end of the decade it should be able to be run as an app on a powerful desktop

Three years after that it can run on a #laptop.

Three years after that it can run on a #cellphone.

By #2040 it's a background process on your cellphone.

Okay, but we can also #federate this now with the #fediverse. Like, #ActivityPub can handle search queries just fine.

So, just running on microcomputers, everyone can put on their own index whatever they want.

A person can _easily_ index 50,000 pages on a rapsberry pi.

A #FediSearch can broadcast any query to known peers. Each peer returns top-k results. The originating node can then aggregate and rank.

So @alice🅰🅻🅸🅲🅴 (🌈🦄) queries their FediSearch, it searches its own index and queries subscribed peers, those peers do the same thing. Nodes can choose who they trust, cache, etc.

The number of indexes pages will be something along the lines of `pages_per_nod * log(number_nodes)`. So a thousand nodes may only cover a million pages, but if the trust network is good, those are probably the most important million pages.

Also, I would venture that you'd have some nodes specializing in having a lot of pages: tens of millions, others just for stuff they like, others specifically for non-commercial interests. Selecting who you federate your search with really affects the ranking.

#FediSearch #ActivityPub #FederatedSearch #Fediverse #RaspberryPi #SearchEngine #DecentralizedWeb #SelfHosting

Syntax	Description	Examples
`"` keyword `"`	Finds the string within quotes, including spaces. Case-insensitive. (Escape quotes inside with `\"`)	`"Hackers' Pub"`
`from:` handle	Finds content written by the specified user.	`from:hongminhee` `from:hongminhee@hollo.social`
`lang:` ISO 639-1	Finds content written in the specified language.	`lang:en`
`#` tag	Finds content with the specified tag. Case-insensitive.	`#HackersPub`
condition condition	Finds content that satisfies both conditions on either side of the space (logical AND).	`"Hackers' Pub" lang:en`
condition `OR` condition	Finds content that satisfies at least one of the conditions on either side of the OR operator (logical OR).	`#HackersPub OR "Hackers' Pub" lang:en`
`(` condition `)`	Combines the operators within the parentheses first.	`(#HackersPub OR "Hackers' Pub" OR "Hackers Pub") lang:en`