I want to dive into how to make search engines

Our great sponsors

InfluxDB - Power Real-Time Data Analytics at Scale

WorkOS - The modern identity platform for B2B SaaS

SaaSHub - Software Alternatives and Reviews

Our great sponsors

search-engines

2 15 0.0 Markdown

Discontinued Reviewing alternative search engines

I'm in a similar bandwagon. I just started collecting search engines and analyzing them. I've listed some of them at https://github.com/Tintedfireglass/search-engines and what they le it is easy to look at search algorithms, queries and User interfaces I still don't understand how to create one

search-lib

1 1 10.0 Python

A library of classes which can be used to build a search engine.

I wrote my own little search engine years ago.
It's pretty simple: You crawl a website and search for all links, then you crawl all the links from these linked websites and so on.
You can still see some of my code here: https://github.com/Wronnay/search-lib

InfluxDB

www.influxdata.com sponsored

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
tantivy

48 9,839 9.1 Rust

Tantivy is a full-text search engine library inspired by Apache Lucene and written in Rust

I've never worked on a project that encompasses as many computer science algorithms as a search engine. There are a lot of topics you can lookup in "Information Storage and Retrieval":
- Tries (patricia, radix, etc...)
- Trees (b-trees, b+trees, merkle trees, log-structured merge-tree, etc..)
- Consensus (raft, paxos, etc..)
- Block storage (disk block size optimizations, mmap files, delta storage, etc..)
- Probabilistic filters (hyperloloog, bloom filters, etc...)
- Binary Search (sstables, sorted inverted indexes, roaring bitmaps)
- Ranking (pagerank, tf/idf, bm25, etc...)
- NLP (stemming, POS tagging, subject identification, sentiment analysis etc...)
- HTML (document parsing/lexing)
- Images (exif extraction, removal, resizing / proxying, etc...)
- Queues (SQS, NATS, Apollo, etc...)
- Clustering (k-means, density, hierarchical, gaussian distributions, etc...)
- Rate limiting (leaky bucket, windowed, etc...)
- Compression
- Applied linear algebra
- Text processing (unicode-normalization, slugify, sanitation, lossless and lossy hashing like metaphone and document fingerprinting)
- etc...
I'm sure there is plenty more I've missed. There are lots of generic structures involved like hashes, linked-lists, skip-lists, heaps and priority queues and this is just to get 2000's level basic tech.
- https://github.com/quickwit-oss/tantivy
- https://github.com/valeriansaliou/sonic
- https://github.com/mosuka/phalanx
- https://github.com/meilisearch/MeiliSearch
- https://github.com/blevesearch/bleve
- https://github.com/thomasjungblut/go-sstables
A lot of people new to this space mistakenly think you can just throw elastic search or postgres fulltext search in front of terabytes of records and have something decent. The problem is that search with good rankings often requires custom storage so calculations can be sharded among multiple nodes and you can do layered ranking without passing huge blobs of results between systems.

sonic

48 19,390 7.0 Rust

🦔 Fast, lightweight & schema-less search backend. An alternative to Elasticsearch that runs on a few MBs of RAM.

I've never worked on a project that encompasses as many computer science algorithms as a search engine. There are a lot of topics you can lookup in "Information Storage and Retrieval":
- Tries (patricia, radix, etc...)
- Trees (b-trees, b+trees, merkle trees, log-structured merge-tree, etc..)
- Consensus (raft, paxos, etc..)
- Block storage (disk block size optimizations, mmap files, delta storage, etc..)
- Probabilistic filters (hyperloloog, bloom filters, etc...)
- Binary Search (sstables, sorted inverted indexes, roaring bitmaps)
- Ranking (pagerank, tf/idf, bm25, etc...)
- NLP (stemming, POS tagging, subject identification, sentiment analysis etc...)
- HTML (document parsing/lexing)
- Images (exif extraction, removal, resizing / proxying, etc...)
- Queues (SQS, NATS, Apollo, etc...)
- Clustering (k-means, density, hierarchical, gaussian distributions, etc...)
- Rate limiting (leaky bucket, windowed, etc...)
- Compression
- Applied linear algebra
- Text processing (unicode-normalization, slugify, sanitation, lossless and lossy hashing like metaphone and document fingerprinting)
- etc...
I'm sure there is plenty more I've missed. There are lots of generic structures involved like hashes, linked-lists, skip-lists, heaps and priority queues and this is just to get 2000's level basic tech.
- https://github.com/quickwit-oss/tantivy
- https://github.com/valeriansaliou/sonic
- https://github.com/mosuka/phalanx
- https://github.com/meilisearch/MeiliSearch
- https://github.com/blevesearch/bleve
- https://github.com/thomasjungblut/go-sstables
A lot of people new to this space mistakenly think you can just throw elastic search or postgres fulltext search in front of terabytes of records and have something decent. The problem is that search with good rankings often requires custom storage so calculations can be sharded among multiple nodes and you can do layered ranking without passing huge blobs of results between systems.

phalanx

13 341 0.0 Go

Phalanx is a cloud-native distributed search engine that provides endpoints through gRPC and traditional RESTful API.

I've never worked on a project that encompasses as many computer science algorithms as a search engine. There are a lot of topics you can lookup in "Information Storage and Retrieval":
- Tries (patricia, radix, etc...)
- Trees (b-trees, b+trees, merkle trees, log-structured merge-tree, etc..)
- Consensus (raft, paxos, etc..)
- Block storage (disk block size optimizations, mmap files, delta storage, etc..)
- Probabilistic filters (hyperloloog, bloom filters, etc...)
- Binary Search (sstables, sorted inverted indexes, roaring bitmaps)
- Ranking (pagerank, tf/idf, bm25, etc...)
- NLP (stemming, POS tagging, subject identification, sentiment analysis etc...)
- HTML (document parsing/lexing)
- Images (exif extraction, removal, resizing / proxying, etc...)
- Queues (SQS, NATS, Apollo, etc...)
- Clustering (k-means, density, hierarchical, gaussian distributions, etc...)
- Rate limiting (leaky bucket, windowed, etc...)
- Compression
- Applied linear algebra
- Text processing (unicode-normalization, slugify, sanitation, lossless and lossy hashing like metaphone and document fingerprinting)
- etc...
I'm sure there is plenty more I've missed. There are lots of generic structures involved like hashes, linked-lists, skip-lists, heaps and priority queues and this is just to get 2000's level basic tech.
- https://github.com/quickwit-oss/tantivy
- https://github.com/valeriansaliou/sonic
- https://github.com/mosuka/phalanx
- https://github.com/meilisearch/MeiliSearch
- https://github.com/blevesearch/bleve
- https://github.com/thomasjungblut/go-sstables
A lot of people new to this space mistakenly think you can just throw elastic search or postgres fulltext search in front of terabytes of records and have something decent. The problem is that search with good rankings often requires custom storage so calculations can be sharded among multiple nodes and you can do layered ranking without passing huge blobs of results between systems.

MeiliSearch

129 43,043 9.8 Rust

A lightning-fast search API that fits effortlessly into your apps, websites, and workflow

I've never worked on a project that encompasses as many computer science algorithms as a search engine. There are a lot of topics you can lookup in "Information Storage and Retrieval":
- Tries (patricia, radix, etc...)
- Trees (b-trees, b+trees, merkle trees, log-structured merge-tree, etc..)
- Consensus (raft, paxos, etc..)
- Block storage (disk block size optimizations, mmap files, delta storage, etc..)
- Probabilistic filters (hyperloloog, bloom filters, etc...)
- Binary Search (sstables, sorted inverted indexes, roaring bitmaps)
- Ranking (pagerank, tf/idf, bm25, etc...)
- NLP (stemming, POS tagging, subject identification, sentiment analysis etc...)
- HTML (document parsing/lexing)
- Images (exif extraction, removal, resizing / proxying, etc...)
- Queues (SQS, NATS, Apollo, etc...)
- Clustering (k-means, density, hierarchical, gaussian distributions, etc...)
- Rate limiting (leaky bucket, windowed, etc...)
- Compression
- Applied linear algebra
- Text processing (unicode-normalization, slugify, sanitation, lossless and lossy hashing like metaphone and document fingerprinting)
- etc...
I'm sure there is plenty more I've missed. There are lots of generic structures involved like hashes, linked-lists, skip-lists, heaps and priority queues and this is just to get 2000's level basic tech.
- https://github.com/quickwit-oss/tantivy
- https://github.com/valeriansaliou/sonic
- https://github.com/mosuka/phalanx
- https://github.com/meilisearch/MeiliSearch
- https://github.com/blevesearch/bleve
- https://github.com/thomasjungblut/go-sstables
A lot of people new to this space mistakenly think you can just throw elastic search or postgres fulltext search in front of terabytes of records and have something decent. The problem is that search with good rankings often requires custom storage so calculations can be sharded among multiple nodes and you can do layered ranking without passing huge blobs of results between systems.

bleve

13 9,655 7.4 Go

A modern text/numeric/geo-spatial/vector indexing library for go

I've never worked on a project that encompasses as many computer science algorithms as a search engine. There are a lot of topics you can lookup in "Information Storage and Retrieval":
- Tries (patricia, radix, etc...)
- Trees (b-trees, b+trees, merkle trees, log-structured merge-tree, etc..)
- Consensus (raft, paxos, etc..)
- Block storage (disk block size optimizations, mmap files, delta storage, etc..)
- Probabilistic filters (hyperloloog, bloom filters, etc...)
- Binary Search (sstables, sorted inverted indexes, roaring bitmaps)
- Ranking (pagerank, tf/idf, bm25, etc...)
- NLP (stemming, POS tagging, subject identification, sentiment analysis etc...)
- HTML (document parsing/lexing)
- Images (exif extraction, removal, resizing / proxying, etc...)
- Queues (SQS, NATS, Apollo, etc...)
- Clustering (k-means, density, hierarchical, gaussian distributions, etc...)
- Rate limiting (leaky bucket, windowed, etc...)
- Compression
- Applied linear algebra
- Text processing (unicode-normalization, slugify, sanitation, lossless and lossy hashing like metaphone and document fingerprinting)
- etc...
I'm sure there is plenty more I've missed. There are lots of generic structures involved like hashes, linked-lists, skip-lists, heaps and priority queues and this is just to get 2000's level basic tech.
- https://github.com/quickwit-oss/tantivy
- https://github.com/valeriansaliou/sonic
- https://github.com/mosuka/phalanx
- https://github.com/meilisearch/MeiliSearch
- https://github.com/blevesearch/bleve
- https://github.com/thomasjungblut/go-sstables
A lot of people new to this space mistakenly think you can just throw elastic search or postgres fulltext search in front of terabytes of records and have something decent. The problem is that search with good rankings often requires custom storage so calculations can be sharded among multiple nodes and you can do layered ranking without passing huge blobs of results between systems.

WorkOS

workos.com sponsored

The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
go-sstables

4 251 4.0 Go

Go library for protobuf compatible sstables, a skiplist, a recordio format and other database building blocks like a write-ahead log. Ships now with an embedded key-value store.

I've never worked on a project that encompasses as many computer science algorithms as a search engine. There are a lot of topics you can lookup in "Information Storage and Retrieval":
- Tries (patricia, radix, etc...)
- Trees (b-trees, b+trees, merkle trees, log-structured merge-tree, etc..)
- Consensus (raft, paxos, etc..)
- Block storage (disk block size optimizations, mmap files, delta storage, etc..)
- Probabilistic filters (hyperloloog, bloom filters, etc...)
- Binary Search (sstables, sorted inverted indexes, roaring bitmaps)
- Ranking (pagerank, tf/idf, bm25, etc...)
- NLP (stemming, POS tagging, subject identification, sentiment analysis etc...)
- HTML (document parsing/lexing)
- Images (exif extraction, removal, resizing / proxying, etc...)
- Queues (SQS, NATS, Apollo, etc...)
- Clustering (k-means, density, hierarchical, gaussian distributions, etc...)
- Rate limiting (leaky bucket, windowed, etc...)
- Compression
- Applied linear algebra
- Text processing (unicode-normalization, slugify, sanitation, lossless and lossy hashing like metaphone and document fingerprinting)
- etc...
I'm sure there is plenty more I've missed. There are lots of generic structures involved like hashes, linked-lists, skip-lists, heaps and priority queues and this is just to get 2000's level basic tech.
- https://github.com/quickwit-oss/tantivy
- https://github.com/valeriansaliou/sonic
- https://github.com/mosuka/phalanx
- https://github.com/meilisearch/MeiliSearch
- https://github.com/blevesearch/bleve
- https://github.com/thomasjungblut/go-sstables
A lot of people new to this space mistakenly think you can just throw elastic search or postgres fulltext search in front of terabytes of records and have something decent. The problem is that search with good rankings often requires custom storage so calculations can be sharded among multiple nodes and you can do layered ranking without passing huge blobs of results between systems.

now

8 588 9.7 Python

Discontinued 🧞 No-code tool for creating a neural search solution in minutes (by jina-ai)
jina

126 19,884 9.2 Python

☁️ Build multimodal AI applications with cloud-native stack

What kinda thing do you want to search? Text I guess? But there are search engines for images, gifs, video, all kinds of stuff.
I'm working at an open-source project that builds an AI-powered search framework [0], and I've built some examples in very few lines of code (for searching fashion products via image or text [1], PDF text/images/tables search [2]) and one of our community members built a protein search engine [3].
A good place to start might be with a no-code solution like (shameless self-plug time) Jina NOW [4], which lets you build a search engine and GUI with just one CLI command.
[0] https://github.com/jina-ai/jina/

protein_search

1 15 1.8 Python

The neural search engine for proteins.
grub-2.0

4 19 0.0 Python

Grub is an AI powered Web crawler.

Not finished, but the Selenium based crawler works pretty well to combat most blocks: https://github.com/kordless/grub-2.0
For IP blocks, try this: https://github.com/kordless/mitta-screenshot

mitta-screenshot

2 2 1.1 JavaScript

Mitta's Chrome extension for saving the current view of a website.

Not finished, but the Selenium based crawler works pretty well to combat most blocks: https://github.com/kordless/grub-2.0
For IP blocks, try this: https://github.com/kordless/mitta-screenshot

Milvus

104 26,645 10.0 Go

A cloud-native vector database, storage for next generation AI applications

This depends mostly on what kind of search engine you're trying to build. I unfortunately won't be able to point you towards courses, but there are tons of great resources online to help you get started.
Search engines are a fairly broad topic, and a lot of it depends on the _type of data_ that you want to build a search engine for. If you're looking towards more traditional, Google/Yahoo-like search, Elasticsearch's learning center (https://www.elastic.co/learn) has quite a few good resources that can point you in the right direction. Many enterprise search solutions are built on top of Apache Lucene (including Elasticsearch), and videos/blogs discussing Lucene's architecture (https://www.endava.com/en/blog/Engineering/2021/Elasticsearc...) is a great starting point as well.
Opposite text/web search is _unstructured data_ search, i.e. searching across images, video, audio, etc... based on their semantics (https://www.deepset.ai/blog/semantic-search-with-milvus-know...). Work in this space has been ongoing for decades, but an emerging way of doing this is via a _vector database_ (https://frankzliu.com/blog/a-gentle-introduction-to-vector-d...) such as Zilliz Cloud (https://zilliz.com/cloud) or Milvus (https://milvus.io/). The idea here is to turn the type of data that you're searching for into an high-dimensional vector called an embedding, and to perform nearest-neighbor search on the embeddings themselves.
Disclaimer: I'm a part of the Zilliz team and Milvus community.

markov

2 273 0.0 C

Materials for book: "Markov Chains for programmers"

I would try to grasp the 'random surfer' idea, that is modeled by a Markov Chain. A nice free book is Markov Chain for Programmers [1]. A discrete time Markov Chain boils down to a conditional probability that boils down to a matrix, and a steady distribution boils down to an eigenvalue 1 eigenvector of it, which determines PageRank. Then one can jump to 'The $25,000,000,000 Eigenvector: The Linear Algebra behind Google'.
[1] https://github.com/czekster/markov
[2] https://doi.org/10.1137/050623280

Yacy

115 3,244 8.7 Java

Distributed Peer-to-Peer Web Search Engine and Intranet Search Appliance

Have you seen YaCy Search Engine? https://yacy.net/
This might be something to build on or explore.

SaaSHub

www.saashub.com sponsored

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

What is Hybrid Search?
6 projects | dev.to | 6 Feb 2024
We’re the Meilisearch team! To celebrate v1.0 of our open-source search engine, Ask us Anything!
14 projects | /r/rust | 8 Feb 2023
A lightweight alternative to elasticsearch that requires minimal resources, written in Go
3 projects | /r/golang | 19 Aug 2022
Why Writing Your Own Search Engine Is Hard (2004)
5 projects | news.ycombinator.com | 23 Jul 2022
Show HN: I built a self hosted recommendation feed to escape Google's algorithm
5 projects | news.ycombinator.com | 19 Jul 2022

I want to dive into how to make search engines

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
search-engine Database Rust HacktoberFest Search Engines
Post date: 25 Aug 2022

search-engines

search-lib

InfluxDB

tantivy

sonic

phalanx

MeiliSearch

bleve

WorkOS

go-sstables

now

jina

protein_search

grub-2.0

mitta-screenshot

Milvus

markov

Yacy

SaaSHub

Related posts

I want to dive into how to make search engines

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com search-engine Database Rust HacktoberFest Search Engines Post date: 25 Aug 2022

Related posts

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
search-engine Database Rust HacktoberFest Search Engines
Post date: 25 Aug 2022