Collect, organize, and act on massive volumes of high-resolution data to power real-time intelligent systems. Learn more →
Top 23 Python information-retrieval Projects
-
EasyOCR
Ready-to-use OCR with 80+ supported languages and all popular writing scripts including Latin, Chinese, Arabic, Devanagari, Cyrillic and etc.
-
Judoscale
Save 47% on cloud hosting with autoscaling that just works. Judoscale integrates with Django, FastAPI, Celery, and RQ to make autoscaling easy and reliable. Save big, and say goodbye to request timeouts and backed-up task queues.
-
haystack
AI orchestration framework to build customizable, production-ready LLM applications. Connect components (models, vector DBs, file converters) to pipelines or agents that can interact with your data. With advanced retrieval methods, it's best suited for building RAG, question answering, semantic search or conversational agent chatbots.
Hey people! I'm actively searching for a job right now, but I'm also open to contracting if you need help integrating AI into your products.
I’ve spent the past couple of years working on Haystack (https://github.com/deepset-ai/haystack) and am now building my own agent orchestrator framework. I love tackling interesting challenges, so if you’re working on something exciting and could use an extra hand let’s chat!
-
-
Project mention: Show HN: Open-source Deep Research across workplace applications | news.ycombinator.com | 2025-03-03
-
txtai
💡 All-in-one open-source embeddings database for semantic search, LLM orchestration and language model workflows
-
Choosing the right embedding model is equally important for effective semantic matching of queries and chunk blocks. To select the appropriate open-source embedding model, the authors conducted another experiment using the evaluation module of FlagEmbedding, which uses the dataset namespace-Pt/msmarco7 for queries and the dataset namespace-Pt/msmarco-corpus8 for the corpus and metrics like RR and MRR were used for evaluation.
-
Project mention: Ask HN: What's your serverless stack for AI/LLM apps in production? | news.ycombinator.com | 2025-01-10
I have a hosted code-first agent builder platform in production, so I respond these question a lot from our customers.
1. Probably the best is fly.io IMHO. It has a nice balance between running ephemeral containers that can support long running tasks, and quickly booting up to respond to a tool call. [1]
2. If your task is truly long running, (I'm thinking several minutes), probably wise to put trigger [2] or temporal [3] under it.
3. A mix of prompt caching, context shedding, progressive context enrichment [4].
4. I'm building a platform that can be self-hosted to do a few of the above, so I can't speak to this. But most of my customers do not.
5. To start with, a simple postgres table and pgvector is all you need. But I've recently been delighted with the DX of Upstash vector [5]. They handle the embeddings for you and give you a text-in, text-out experience. If you want more control, and savings on a higher scale, have heard good things about marqo.ai [6].
Happy to talk more about this at length. (E-mail in the profile)
[1] https://fly.io/docs/reference/architecture/
[2] trigger.dev
[3] temporal.io
[4] https://www.inferable.ai/blog/posts/llm-progressive-context-...
[5] https://upstash.com/docs/vector/overall/getstarted
[6] https://www.marqo.ai/
-
InfluxDB
InfluxDB high-performance time series database. Collect, organize, and act on massive volumes of high-resolution data to power real-time intelligent systems.
-
-
Project mention: Understanding the BM25 full text search algorithm | news.ycombinator.com | 2024-11-19
In the Langroid[1] LLM library we have a clean, extensible RAG implementation in the DocChatAgent[2] -- it uses several retrieval techniques, including lexical (bm25, fuzzy search) and semantic (embeddings), and re-ranking (using cross-encoder, reciprocal-rank-fusion) and also re-ranking for diversity and lost-in-the-middle mitigation:
[1] Langroid - a multi-agent LLM framework from CMU/UW-Madison researchers https://github.com/langroid/langroid
[2] DocChatAgent Implementation -
-
-
-
-
beir
A Heterogeneous Benchmark for Information Retrieval. Easy to use, evaluate your models across 15+ diverse IR datasets.
Project mention: Any* Embedding Model Can Become a Late Interaction Model - If You Give It a Chance! | dev.to | 2024-08-29The source code for these experiments is open-source and utilizes beir-qdrant, an integration of Qdrant with the BeIR library. While this package is not officially maintained by the Qdrant team, it may prove useful for those interested in experimenting with various Qdrant configurations to see how they impact retrieval quality. All experiments were conducted using Qdrant in exact search mode, ensuring the results are not influenced by approximate search.
-
colpali
The code used to train and run inference with the ColVision models, e.g. ColPali, ColQwen2, and ColSmol.
Project mention: Integrating Vision-Language Models into Agentic RAG Systems with ColPali | dev.to | 2025-03-31If you want to learn more about ColPali, you can refer to the official documentation and also I would recommend you to read the 9 part blog series on RAG on DailyDoseofDS by Avi Chawla and Akshay Pachaar.
-
-
Project mention: Show HN: BM25opt – 30-40 x faster BM25 search algorithms (FOSS) | news.ycombinator.com | 2024-10-31
This is a good point and was a difficult design decision. The reasons for changing the API are:
- easier to use with untokenized corpus and questions
- to fix issues with the tokenizing ( e.g. https://github.com/dorianbrown/rank_bm25/issues/38 ); also rank_bm25 provides no default tokenizer, a naive split-on-whitespace is a wrong choice
- considerably simplify the code (way less SLOC)
- point out the similarities of the algorithms for educational purpuses / further development
In practice, the differences are minimal ( see Example 3: comparison with rank_bm25 ).
-
Project mention: BM25 in PostgreSQL – 3x Faster Than Elasticsearch | news.ycombinator.com | 2025-03-02
https://github.com/naver/splade
I'm sure the field has progressed since then, but it sounds like it is still best to not invest in vector search.
The real lesson, it seems, is we need to know our needs, data, etc and act accordingly - most apparently do not do that.
-
Project mention: Show HN: Rerank-Ts – TypeScript Library for Re-Ranking Search Results with LLMs | news.ycombinator.com | 2024-06-11
1. LLM based re-ranking: It uses the algorithm presented in the paper - "Is ChatGPT Good at Search?" https://arxiv.org/abs/2304.09542 - they implement a sliding window based algorithm to re-rank search results which could be potentially larger than the context length of an LLM. We added support for LLama3 and GPT-4. For Llama3, we are using Groq, but other model providers can be added easily.
-
-
AnglE
Train and Infer Powerful Sentence Embeddings with AnglE | 🔥 SOTA on STS and MTEB Leaderboard (by SeanLee97)
-
sycamore
🍁 Sycamore is an LLM-powered search and analytics platform for unstructured data. (by aryn-ai)
-
Project mention: Show HN: Ellipsis – Automated PR reviews and bug fixes | news.ycombinator.com | 2024-05-09
-
Rankify
🔥 Rankify: A Comprehensive Python Toolkit for Retrieval, Re-Ranking, and Retrieval-Augmented Generation 🔥. Our toolkit integrates 40 pre-retrieved benchmark datasets and supports 7+ retrieval techniques, 24+ state-of-the-art Reranking models, and multiple RAG methods.
Project mention: Rankify: A Comprehensive Python Toolkit for Retrieval, Re-Ranking, and RAG | news.ycombinator.com | 2025-04-05 -
CodeRabbit
CodeRabbit: AI Code Reviews for Developers. Revolutionize your code reviews with AI. CodeRabbit offers PR summaries, code walkthroughs, 1-click suggestions, and AST-based analysis. Boost productivity and code quality across all major languages with each PR.
Python information-retrieval discussion
Python information-retrieval related posts
-
Integrating Vision-Language Models into Agentic RAG Systems with ColPali
-
Ask HN: What's your serverless stack for AI/LLM apps in production?
-
Lists of open-source frameworks for building RAG applications
-
Pinecone integrates AI inferencing with vector database
-
Parse Markdown content from PDF documents
-
7 AI Open Source Libraries To Build RAG, Agents & AI Search
-
Show HN: BM25opt – 30-40 x faster BM25 search algorithms (FOSS)
-
A note from our sponsor - InfluxDB
influxdata.com | 18 Apr 2025
Index
What are some of the best open-source information-retrieval projects in Python? This list will help you:
# | Project | Stars |
---|---|---|
1 | EasyOCR | 26,303 |
2 | haystack | 20,338 |
3 | gensim | 15,968 |
4 | onyx | 12,674 |
5 | txtai | 10,706 |
6 | FlagEmbedding | 9,355 |
7 | marqo | 4,825 |
8 | catalyst | 3,336 |
9 | langroid | 3,226 |
10 | ranking | 2,767 |
11 | InvoiceNet | 2,585 |
12 | instructor-embedding | 1,931 |
13 | beir | 1,776 |
14 | colpali | 1,741 |
15 | pke | 1,581 |
16 | rank_bm25 | 1,143 |
17 | splade | 835 |
18 | RankGPT | 578 |
19 | ranx | 537 |
20 | AnglE | 531 |
21 | sycamore | 505 |
22 | continuous-eval | 487 |
23 | Rankify | 394 |