Top 23 information-retrieval Open-Source Projects

EasyOCR

38 21,795 4.6 Python

Ready-to-use OCR with 80+ supported languages and all popular writing scripts including Latin, Chinese, Arabic, Devanagari, Cyrillic and etc.

Project mention: Leveraging GPT-4 for PDF Data Extraction: A Comprehensive Guide | dev.to | 2023-12-27

PyTesseract Module [ Github ] EasyOCR Module [ Github ] PaddlePaddle OCR [ Github ]
gensim

18 15,212 7.5 Python

Topic Modelling for Humans

Project mention: Aggregating news from different sources | /r/learnprogramming | 2023-07-08
InfluxDB

www.influxdata.com
sponsored

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
haystack

54 13,486 9.9 Python

:mag: LLM orchestration framework to build customizable, production-ready LLM applications. Connect components (models, vector DBs, file converters) to pipelines or agents that can interact with your data. With advanced retrieval methods, it's best suited for building RAG, question answering, semantic search or conversational agent chatbots.

Project mention: Release Radar • March 2024 Edition | dev.to | 2024-04-07

View on GitHub
Weaviate

76 9,359 10.0 Go

Weaviate is an open-source vector database that stores both objects and vectors, allowing for the combination of vector search with structured filtering with the fault tolerance and scalability of a cloud-native database.

Project mention: pgvecto.rs alternatives - qdrant and Weaviate | libhunt.com/r/pgvecto.rs | 2024-03-13
txtai

354 6,910 9.3 Python

💡 All-in-one open-source embeddings database for semantic search, LLM orchestration and language model workflows

Project mention: Build knowledge graphs with LLM-driven entity extraction | dev.to | 2024-02-21

txtai is an all-in-one embeddings database for semantic search, LLM orchestration and language model workflows.
unstructured

12 5,750 9.8 HTML

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.

Project mention: LlamaCloud and LlamaParse | news.ycombinator.com | 2024-02-20

Be careful with unstructured:
https://github.com/Unstructured-IO/unstructured/blob/d11c70c...
from: https://github.com/open-webui/open-webui/issues/687
ragflow

6 4,569 9.5 Python

RAGFlow is an open-source RAG (Retrieval-Augmented Generation) engine based on deep document understanding.

Project mention: RAGFlow is an open-source RAG engine based on deep document understanding | news.ycombinator.com | 2024-04-01

Just link them to https://github.com/infiniflow/ragflow/blob/main/rag/llm/chat... :)
WorkOS

workos.com
sponsored

The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
Apache Solr

30 4,364 0.0 Java

Apache Lucene and Solr open-source search software

Project mention: YaCy, a distributed Web Search Engine, based on a peer-to-peer network | news.ycombinator.com | 2024-03-05

There are already many project about search:
- https://www.marginalia.nu/
- https://searchmysite.net/
- https://lucene.apache.org/
- elastic search
- https://presearch.com/
- https://stract.com/
- https://wiby.me/
I think that all project are fun. I would like to see one succeeding at reaching mainstream level of attention.
I have also been gathering links meta data for some time. Maybe I will use them to feed any eventual self hosted search engine, or language model, if I decide to experiment with that.
- domains for seed https://github.com/rumca-js/Internet-Places-Database
- bookmarks seed https://github.com/rumca-js/RSS-Link-Database
- links for year https://github.com/rumca-js/RSS-Link-Database-2024
marqo

114 4,086 9.3 Python

Unified embedding generation and search engine. Also available on cloud - cloud.marqo.ai

Project mention: Are we at peak vector database? | news.ycombinator.com | 2024-01-25

We (Marqo) are doing a lot on 1 and 2. There is a huge amount to be done on the ML side of vector search and we are investing heavily in it. I think it has not quite sunk in that vector search systems are ML systems and everything that comes with that. I would love to chat about 1 and 2 so feel free to email me (email is in my profile). What we have done so far is here -> https://github.com/marqo-ai/marqo
screenFetch

4 3,725 4.6 Shell

Fetches system/theme information in terminal for Linux desktop screenshots.
catalyst

1 3,221 0.0 Python

Accelerated deep learning R&D (by catalyst-team)

Project mention: Instance segmentation of small objects in grainy drone imagery | /r/computervision | 2023-12-09
llmware

9 3,056 9.8 Python

Providing enterprise-grade LLM-based development framework, tools, and fine-tuned models.

Project mention: More Agents Is All You Need: LLMs performance scales with the number of agents | news.ycombinator.com | 2024-04-06

I couldn't agree more. You should check out LLMWare's SLIM agents (https://github.com/llmware-ai/llmware/tree/main/examples/SLI...). It's focusing on pretty much exactly this and chaining multiple local LLMs together.
A really good topic that ties in with this is the need for deterministic sampling (I may have the terminology a bit incorrect) depending on what the model is indended for. The LLMWare team did a good 2 part video on this here as well (https://www.youtube.com/watch?v=7oMTGhSKuNY)
I think dedicated miniture LLMs are the way forward.
Disclaimer - Not affiliated with them in any way, just think it's a really cool project.
ranking

1 2,713 6.3 Python

Learning to Rank in TensorFlow
InvoiceNet

4 2,378 3.9 Python

Deep neural network to extract intelligent information from invoice documents.
lucene

11 2,333 9.8 Java

Apache Lucene open-source search software

Project mention: Building an efficient sparse keyword index in Python | dev.to | 2023-08-17

First, a review of the landscape. As said in the introduction, there aren't a ton of good options. Apache Lucene is by far the best traditional search index from a speed, performance and functionality standpoint. It's the base for Elasticsearch/OpenSearch and many other projects. But it requires Java.
IP-Tracer

2 1,806 0.0 PHP

Track any ip address with IP-Tracer. IP-Tracer is developed for Linux and Termux. you can retrieve any ip address information using IP-Tracer.
StringZilla

14 1,749 9.8 C++

Up to 10x faster strings for C, C++, Python, Rust, and Swift, leveraging SWAR and SIMD on Arm Neon and x86 AVX2 & AVX-512-capable chips to accelerate search, sort, edit distances, alignment scores, etc 🦖

Project mention: Measuring energy usage: regular code vs. SIMD code | news.ycombinator.com | 2024-02-19

The 3.5x energy-efficiency gap between serial and SIMD code becomes even larger when
A. you do byte-level processing instead of float words;
B. you use embedded, IoT, and other low-energy devices.
A few years ago I've compared Nvidia Jetson Xavier (long before the Orin release), Intel-based MacBook Pro with Core i9, and AVX-512 capable CPUs on substring search benchmarks.
On Xavier one can quite easily disable/enable cores and reconfigure power usage. At peak I got to 4.2 GB/J which was an 8.3x improvement in inefficiency over LibC in substring search operations. The comparison table is still available in the older README: https://github.com/ashvardanian/StringZilla/tree/v2.0.2?tab=...
instructor-embedding

4 1,685 6.1 Python

[ACL 2023] One Embedder, Any Task: Instruction-Finetuned Text Embeddings

Project mention: My experience on starting with fine tuning LLMs with custom data | /r/LocalLLaMA | 2023-07-10

If you li embeddings and vector DB, you should look into this: https://github.com/HKUNLP/instructor-embedding
pke

3 1,519 3.1 Python

Python Keyphrase Extraction module
beir

8 1,364 4.2 Python

A Heterogeneous Benchmark for Information Retrieval. Easy to use, evaluate your models across 15+ diverse IR datasets.

Project mention: On building a semantic search engine | news.ycombinator.com | 2024-01-06

The BEIR project might be what you're looking for: https://github.com/beir-cellar/beir/wiki/Leaderboard
mteb

2 1,314 9.1 Python

MTEB: Massive Text Embedding Benchmark

Project mention: AI for AWS Documentation | news.ycombinator.com | 2023-07-06

RAG is very difficult to do right. I am experimenting with various RAG projects from [1]. The main problems are:
- Chunking can interfer with context boundaries
- Content vectors can differ vastly from question vectors, for this you have to use hypothetical embeddings (they generate artificial questions and store them)
- Instead of saving just one embedding per text-chuck you should store various (text chunk, hypothetical embedding questions, meta data)
- RAG will miserably fail with requests like "summarize the whole document"
- to my knowledge, openAI embeddings aren't performing well, use a embedding that is optimized for question answering or information retrieval and supports multi language. Also look into instructor embeddings: https://github.com/embeddings-benchmark/mteb
1 https://github.com/underlines/awesome-marketing-datascience/...
solr

5 1,003 9.8 Java

Apache Solr open-source search software

Project mention: Swirl: An open-source search engine with LLMs and ChatGPT to provide all the answers you need 🌌 | dev.to | 2023-09-06

Using the Galaxy UI, knowledge workers can systematically review the best results from all configured services including Apache Solr, ChatGPT, Elastic, OpenSearch, PostgreSQL, Google BigQuery, plus generic HTTP/GET/POST with configurations for premium services like Google's Programmable Search Engine, Miro and Northern Light Research.
pisa

1 849 8.2 C++

PISA: Performant Indexes and Search for Academia

Project mention: A Compressed Indexable Bitset | news.ycombinator.com | 2023-07-01

The EF core algorithm implemented in folly [3] may be a bit faster, and implementing partitioning on top of that is relatively easy.
It would definitely compress much better than roaring bitmaps. In terms of performance, it depends on the access patterns. If very sparse (large jumps) PEF would likely be faster, if dense (visit a large fraction of the bitmap) it'd be slower.
It is possible to squeeze a bit more compression out of PEF by introducing a chunk type for Elias-Fano of the chunk complement (for very dense chunks), but you lose the operation of skipping to a given position, which is however not needed in inverted indexes (you only need to skip past a given id, and that can be supported efficiently). That is not mentioned in the paper because at the time I thought the skip-to-position operation was a non-negotiable.
[1] https://github.com/ot/ds2i/
[2] https://github.com/pisa-engine/pisa
[3] https://github.com/facebook/folly/blob/main/folly/experiment...
SaaSHub

www.saashub.com
sponsored

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020). The latest post mention was on 2024-04-07.

information-retrieval related posts

Splade: Sparse Neural Search
1 project | news.ycombinator.com | 11 Mar 2024
Launch HN: Relari (YC W24) – Identify the root cause of problems in LLM apps
1 project | news.ycombinator.com | 8 Mar 2024
On building a semantic search engine
3 projects | news.ycombinator.com | 6 Jan 2024
Ask HN: Is there any good semantic search GUI for images or documents?
2 projects | news.ycombinator.com | 17 Jan 2024
BEIR: A Heterogeneous Benchmark for Information Retrieval
1 project | news.ycombinator.com | 2 Jan 2024
Choosing vector database: a side-by-side comparison
3 projects | news.ycombinator.com | 4 Oct 2023
Benefits of hybrid search
1 project | dev.to | 18 Aug 2023
A note from our sponsor - SaaSHub
www.saashub.com | 18 Apr 2024

SaaSHub helps you find the best software and product alternatives Learn more →

Index

What are some of the best open-source information-retrieval projects? This list will help you:

	Project	Stars
1	EasyOCR	21,795
2	gensim	15,212
3	haystack	13,486
4	Weaviate	9,359
5	txtai	6,910
6	unstructured	5,750
7	ragflow	4,569
8	Apache Solr	4,364
9	marqo	4,086
10	screenFetch	3,725
11	catalyst	3,221
12	llmware	3,056
13	ranking	2,713
14	InvoiceNet	2,378
15	lucene	2,333
16	IP-Tracer	1,806
17	StringZilla	1,749
18	instructor-embedding	1,685
19	pke	1,519
20	beir	1,364
21	mteb	1,314
22	solr	1,003
23	pisa	849