Python information-retrieval

Open-source Python projects categorized as information-retrieval

Top 23 Python information-retrieval Projects

information-retrieval
  1. EasyOCR

    Ready-to-use OCR with 80+ supported languages and all popular writing scripts including Latin, Chinese, Arabic, Devanagari, Cyrillic and etc.

    Project mention: Using Docling’s OCR features with RapidOCR | dev.to | 2025-04-03
  2. Judoscale

    Save 47% on cloud hosting with autoscaling that just works. Judoscale integrates with Django, FastAPI, Celery, and RQ to make autoscaling easy and reliable. Save big, and say goodbye to request timeouts and backed-up task queues.

    Judoscale logo
  3. haystack

    AI orchestration framework to build customizable, production-ready LLM applications. Connect components (models, vector DBs, file converters) to pipelines or agents that can interact with your data. With advanced retrieval methods, it's best suited for building RAG, question answering, semantic search or conversational agent chatbots.

    Project mention: Ask HN: Who wants to be hired? (March 2025) | news.ycombinator.com | 2025-03-03

    Hey people! I'm actively searching for a job right now, but I'm also open to contracting if you need help integrating AI into your products.

    I’ve spent the past couple of years working on Haystack (https://github.com/deepset-ai/haystack) and am now building my own agent orchestrator framework. I love tackling interesting challenges, so if you’re working on something exciting and could use an extra hand let’s chat!

  4. gensim

    Topic Modelling for Humans

  5. onyx

    Gen-AI Chat for Teams - Think ChatGPT if it had access to your team's unique knowledge.

    Project mention: Show HN: Open-source Deep Research across workplace applications | news.ycombinator.com | 2025-03-03
  6. txtai

    💡 All-in-one open-source embeddings database for semantic search, LLM orchestration and language model workflows

    Project mention: Chunking your data for RAG | dev.to | 2025-02-11
  7. FlagEmbedding

    Retrieval and Retrieval-augmented LLMs

    Project mention: Understanding RAG (Part 5): Recommendations and wrap-up | dev.to | 2024-09-09

    Choosing the right embedding model is equally important for effective semantic matching of queries and chunk blocks. To select the appropriate open-source embedding model, the authors conducted another experiment using the evaluation module of FlagEmbedding, which uses the dataset namespace-Pt/msmarco7 for queries and the dataset namespace-Pt/msmarco-corpus8 for the corpus and metrics like RR and MRR were used for evaluation.

  8. marqo

    Unified embedding generation and search engine. Also available on cloud - cloud.marqo.ai

    Project mention: Ask HN: What's your serverless stack for AI/LLM apps in production? | news.ycombinator.com | 2025-01-10

    I have a hosted code-first agent builder platform in production, so I respond these question a lot from our customers.

    1. Probably the best is fly.io IMHO. It has a nice balance between running ephemeral containers that can support long running tasks, and quickly booting up to respond to a tool call. [1]

    2. If your task is truly long running, (I'm thinking several minutes), probably wise to put trigger [2] or temporal [3] under it.

    3. A mix of prompt caching, context shedding, progressive context enrichment [4].

    4. I'm building a platform that can be self-hosted to do a few of the above, so I can't speak to this. But most of my customers do not.

    5. To start with, a simple postgres table and pgvector is all you need. But I've recently been delighted with the DX of Upstash vector [5]. They handle the embeddings for you and give you a text-in, text-out experience. If you want more control, and savings on a higher scale, have heard good things about marqo.ai [6].

    Happy to talk more about this at length. (E-mail in the profile)

    [1] https://fly.io/docs/reference/architecture/

    [2] trigger.dev

    [3] temporal.io

    [4] https://www.inferable.ai/blog/posts/llm-progressive-context-...

    [5] https://upstash.com/docs/vector/overall/getstarted

    [6] https://www.marqo.ai/

  9. InfluxDB

    InfluxDB high-performance time series database. Collect, organize, and act on massive volumes of high-resolution data to power real-time intelligent systems.

    InfluxDB logo
  10. catalyst

    Accelerated deep learning R&D (by catalyst-team)

  11. langroid

    Harness LLMs with Multi-Agent Programming

    Project mention: Understanding the BM25 full text search algorithm | news.ycombinator.com | 2024-11-19

    In the Langroid[1] LLM library we have a clean, extensible RAG implementation in the DocChatAgent[2] -- it uses several retrieval techniques, including lexical (bm25, fuzzy search) and semantic (embeddings), and re-ranking (using cross-encoder, reciprocal-rank-fusion) and also re-ranking for diversity and lost-in-the-middle mitigation:

    [1] Langroid - a multi-agent LLM framework from CMU/UW-Madison researchers https://github.com/langroid/langroid

    [2] DocChatAgent Implementation -

  12. ranking

    Learning to Rank in TensorFlow

  13. InvoiceNet

    Deep neural network to extract intelligent information from invoice documents.

  14. instructor-embedding

    [ACL 2023] One Embedder, Any Task: Instruction-Finetuned Text Embeddings

  15. beir

    A Heterogeneous Benchmark for Information Retrieval. Easy to use, evaluate your models across 15+ diverse IR datasets.

    Project mention: Any* Embedding Model Can Become a Late Interaction Model - If You Give It a Chance! | dev.to | 2024-08-29

    The source code for these experiments is open-source and utilizes beir-qdrant, an integration of Qdrant with the BeIR library. While this package is not officially maintained by the Qdrant team, it may prove useful for those interested in experimenting with various Qdrant configurations to see how they impact retrieval quality. All experiments were conducted using Qdrant in exact search mode, ensuring the results are not influenced by approximate search.

  16. colpali

    The code used to train and run inference with the ColVision models, e.g. ColPali, ColQwen2, and ColSmol.

    Project mention: Integrating Vision-Language Models into Agentic RAG Systems with ColPali | dev.to | 2025-03-31

    If you want to learn more about ColPali, you can refer to the official documentation and also I would recommend you to read the 9 part blog series on RAG on DailyDoseofDS by Avi Chawla and Akshay Pachaar.

  17. pke

    Python Keyphrase Extraction module

  18. rank_bm25

    A Collection of BM25 Algorithms in Python

    Project mention: Show HN: BM25opt – 30-40 x faster BM25 search algorithms (FOSS) | news.ycombinator.com | 2024-10-31

    This is a good point and was a difficult design decision. The reasons for changing the API are:

    - easier to use with untokenized corpus and questions

    - to fix issues with the tokenizing ( e.g. https://github.com/dorianbrown/rank_bm25/issues/38 ); also rank_bm25 provides no default tokenizer, a naive split-on-whitespace is a wrong choice

    - considerably simplify the code (way less SLOC)

    - point out the similarities of the algorithms for educational purpuses / further development

    In practice, the differences are minimal ( see Example 3: comparison with rank_bm25 ).

  19. splade

    SPLADE: sparse neural search (SIGIR21, SIGIR22)

    Project mention: BM25 in PostgreSQL – 3x Faster Than Elasticsearch | news.ycombinator.com | 2025-03-02

    https://github.com/naver/splade

    I'm sure the field has progressed since then, but it sounds like it is still best to not invest in vector search.

    The real lesson, it seems, is we need to know our needs, data, etc and act accordingly - most apparently do not do that.

  20. RankGPT

    Is ChatGPT Good at Search? LLMs as Re-Ranking Agent [EMNLP 2023 Outstanding Paper Award]

    Project mention: Show HN: Rerank-Ts – TypeScript Library for Re-Ranking Search Results with LLMs | news.ycombinator.com | 2024-06-11

    1. LLM based re-ranking: It uses the algorithm presented in the paper - "Is ChatGPT Good at Search?" https://arxiv.org/abs/2304.09542 - they implement a sliding window based algorithm to re-rank search results which could be potentially larger than the context length of an LLM. We added support for LLama3 and GPT-4. For Llama3, we are using Groq, but other model providers can be added easily.

  21. ranx

    ⚡️A Blazing-Fast Python Library for Ranking Evaluation, Comparison, and Fusion 🐍

  22. AnglE

    Train and Infer Powerful Sentence Embeddings with AnglE | 🔥 SOTA on STS and MTEB Leaderboard (by SeanLee97)

  23. sycamore

    🍁 Sycamore is an LLM-powered search and analytics platform for unstructured data. (by aryn-ai)

  24. continuous-eval

    Data-Driven Evaluation for LLM-Powered Applications

    Project mention: Show HN: Ellipsis – Automated PR reviews and bug fixes | news.ycombinator.com | 2024-05-09
  25. Rankify

    🔥 Rankify: A Comprehensive Python Toolkit for Retrieval, Re-Ranking, and Retrieval-Augmented Generation 🔥. Our toolkit integrates 40 pre-retrieved benchmark datasets and supports 7+ retrieval techniques, 24+ state-of-the-art Reranking models, and multiple RAG methods.

    Project mention: Rankify: A Comprehensive Python Toolkit for Retrieval, Re-Ranking, and RAG | news.ycombinator.com | 2025-04-05
  26. CodeRabbit

    CodeRabbit: AI Code Reviews for Developers. Revolutionize your code reviews with AI. CodeRabbit offers PR summaries, code walkthroughs, 1-click suggestions, and AST-based analysis. Boost productivity and code quality across all major languages with each PR.

    CodeRabbit logo
NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Python information-retrieval discussion

Log in or Post with

Python information-retrieval related posts

  • Integrating Vision-Language Models into Agentic RAG Systems with ColPali

    2 projects | dev.to | 31 Mar 2025
  • Ask HN: What's your serverless stack for AI/LLM apps in production?

    1 project | news.ycombinator.com | 10 Jan 2025
  • Lists of open-source frameworks for building RAG applications

    7 projects | dev.to | 2 Jan 2025
  • Pinecone integrates AI inferencing with vector database

    2 projects | news.ycombinator.com | 4 Dec 2024
  • Parse Markdown content from PDF documents

    1 project | news.ycombinator.com | 3 Dec 2024
  • 7 AI Open Source Libraries To Build RAG, Agents & AI Search

    5 projects | dev.to | 14 Nov 2024
  • Show HN: BM25opt – 30-40 x faster BM25 search algorithms (FOSS)

    3 projects | news.ycombinator.com | 31 Oct 2024
  • A note from our sponsor - InfluxDB
    influxdata.com | 18 Apr 2025
    Collect, organize, and act on massive volumes of high-resolution data to power real-time intelligent systems. Learn more →

Index

What are some of the best open-source information-retrieval projects in Python? This list will help you:

# Project Stars
1 EasyOCR 26,303
2 haystack 20,338
3 gensim 15,968
4 onyx 12,674
5 txtai 10,706
6 FlagEmbedding 9,355
7 marqo 4,825
8 catalyst 3,336
9 langroid 3,226
10 ranking 2,767
11 InvoiceNet 2,585
12 instructor-embedding 1,931
13 beir 1,776
14 colpali 1,741
15 pke 1,581
16 rank_bm25 1,143
17 splade 835
18 RankGPT 578
19 ranx 537
20 AnglE 531
21 sycamore 505
22 continuous-eval 487
23 Rankify 394

Sponsored
Save 47% on cloud hosting with autoscaling that just works
Judoscale integrates with Django, FastAPI, Celery, and RQ to make autoscaling easy and reliable. Save big, and say goodbye to request timeouts and backed-up task queues.
judoscale.com