Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality. Learn more →
Top 23 Python information-retrieval Projects
-
EasyOCR
Ready-to-use OCR with 80+ supported languages and all popular writing scripts including Latin, Chinese, Arabic, Devanagari, Cyrillic and etc.
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
-
haystack
:mag: LLM orchestration framework to build customizable, production-ready LLM applications. Connect components (models, vector DBs, file converters) to pipelines or agents that can interact with your data. With advanced retrieval methods, it's best suited for building RAG, question answering, semantic search or conversational agent chatbots.
-
txtai
💡 All-in-one open-source embeddings database for semantic search, LLM orchestration and language model workflows
-
ragflow
RAGFlow is an open-source RAG (Retrieval-Augmented Generation) engine based on deep document understanding.
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
beir
A Heterogeneous Benchmark for Information Retrieval. Easy to use, evaluate your models across 15+ diverse IR datasets.
-
megabots
🤖 State-of-the-art, production ready LLM apps made mega-easy, so you don't have to build them from scratch 🤯 Create a bot, now 🫵
-
gpl
Powerful unsupervised domain adaptation method for dense retrieval. Requires only unlabeled corpus and yields massive improvement: "GPL: Generative Pseudo Labeling for Unsupervised Domain Adaptation of Dense Retrieval" https://arxiv.org/abs/2112.07577 (by UKPLab)
-
forte
Forte is a flexible and powerful ML workflow builder. This is part of the CASL project: http://casl-project.ai/
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
Project mention: Leveraging GPT-4 for PDF Data Extraction: A Comprehensive Guide | dev.to | 2023-12-27PyTesseract Module [ Github ] EasyOCR Module [ Github ] PaddlePaddle OCR [ Github ]
View on GitHub
txtai is an all-in-one embeddings database for semantic search, LLM orchestration and language model workflows.
Project mention: RAGFlow is an open-source RAG engine based on deep document understanding | news.ycombinator.com | 2024-04-01Just link them to https://github.com/infiniflow/ragflow/blob/main/rag/llm/chat... :)
We (Marqo) are doing a lot on 1 and 2. There is a huge amount to be done on the ML side of vector search and we are investing heavily in it. I think it has not quite sunk in that vector search systems are ML systems and everything that comes with that. I would love to chat about 1 and 2 so feel free to email me (email is in my profile). What we have done so far is here -> https://github.com/marqo-ai/marqo
Project mention: Instance segmentation of small objects in grainy drone imagery | /r/computervision | 2023-12-09
Project mention: More Agents Is All You Need: LLMs performance scales with the number of agents | news.ycombinator.com | 2024-04-06I couldn't agree more. You should check out LLMWare's SLIM agents (https://github.com/llmware-ai/llmware/tree/main/examples/SLI...). It's focusing on pretty much exactly this and chaining multiple local LLMs together.
A really good topic that ties in with this is the need for deterministic sampling (I may have the terminology a bit incorrect) depending on what the model is indended for. The LLMWare team did a good 2 part video on this here as well (https://www.youtube.com/watch?v=7oMTGhSKuNY)
I think dedicated miniture LLMs are the way forward.
Disclaimer - Not affiliated with them in any way, just think it's a really cool project.
Project mention: My experience on starting with fine tuning LLMs with custom data | /r/LocalLLaMA | 2023-07-10If you li embeddings and vector DB, you should look into this: https://github.com/HKUNLP/instructor-embedding
The BEIR project might be what you're looking for: https://github.com/beir-cellar/beir/wiki/Leaderboard
RAG is very difficult to do right. I am experimenting with various RAG projects from [1]. The main problems are:
- Chunking can interfer with context boundaries
- Content vectors can differ vastly from question vectors, for this you have to use hypothetical embeddings (they generate artificial questions and store them)
- Instead of saving just one embedding per text-chuck you should store various (text chunk, hypothetical embedding questions, meta data)
- RAG will miserably fail with requests like "summarize the whole document"
- to my knowledge, openAI embeddings aren't performing well, use a embedding that is optimized for question answering or information retrieval and supports multi language. Also look into instructor embeddings: https://github.com/embeddings-benchmark/mteb
1 https://github.com/underlines/awesome-marketing-datascience/...
Rank-BM25 project, the top result when searching for python bm25.
Finally at the end of the pipeline, I added a more expensive filtering and ranking step inspired by RankGPT to do a final ordering over the remaining video clips, picking only the top 10-15 to recommend to the user.
Ranx is a great library for mixing results from different sources.
Project mention: 🤖 Release 0.0.11 in Megabots | Memory and Vectorstores are live! | /r/LLMDevs | 2023-04-26
If you are interested, you can check out the documentation here: https://github.com/raphaelsty/cherche
Project mention: Best pathway for Domain Adaptation with Sentence Transformers? | /r/LanguageTechnology | 2023-04-263) Domain-adapted my bi-encoder using GPL (https://github.com/UKPLab/gpl) and my original corpus from step 1.
Project mention: Launch HN: Relari (YC W24) – Identify the root cause of problems in LLM apps | news.ycombinator.com | 2024-03-08Hi HN, we are the founders of Relari, the company behind continuous-eval (https://github.com/relari-ai/continuous-eval), an evaluation framework that lets you test your GenAI systems at the component level, pinpointing issues where they originate.
We experienced the need for this when we were building a copilot for bankers. Our RAG pipeline blew up in complexity as we added components: a query classifier (to triage user intent), multiple retrievers (to grab information from different sources), filtering LLM (to rerank / compress context), a calculator agent (to call financial functions) and finally the synthesizer LLM that gives the answer. Ensuring reliability became more difficult with each of these we added.
When a bad response was detected by our answer evaluator, we had to backtrack multiple steps to understand which component(s) made a mistake. But this quickly became unscalable beyond a few samples.
I did my Ph.D. in fault detection for autonomous vehicles, and I see a strong parallel between the complexity of autonomous driving software and today's LLM pipelines. In self-driving systems, sensors, perception, prediction, planning, and control modules are all chained together. To ensure system-level safety, we use granular metrics to measure the performance of each module individually. When the vehicle makes an unexpected decision, we use these metrics to pinpoint the problem to a specific component. Only then we can make targeted improvements, systematically.
Based on this thinking, we developed the first version of continuous-eval for ourselves. Since then we’ve made it more flexible to fit various types of GenAI pipelines. Continuous-eval allows you to describe (programmatically) your pipeline and modules, and select metrics for each module. We developed 30+ metrics to cover retrieval, text generation, code generation, classification, agent tool use, etc. We now have a number of companies using us to test complex pipelines like finance copilots, enterprise search, coding agents, etc.
As an example, one customer was trying to understand why their RAG system did poorly on trend analysis queries. Through continuous-eval, they realized that the “retriever” component was retrieving 80%+ of all relevant chunks, but the “reranker” component, that filters out “irrelevant” context, was dropping that to below 50%. This enabled them to fix the problem, in their case by skipping the reranker for certain queries.
We’ve also built ensemble metrics that do a surprisingly good job of predicting user feedback. Users often rate LLM-generated answers by giving a thumbs up/down about how good the answer was. We train our custom metrics on this user data, and then use those metrics to generate thumbs up/down ratings on future LLM answers. The results turn out to be 90% aligned with what the users say. This gives developers a feedback loop from production data to offline testing and development. Some customers have found this to be our most unique advantage.
Lastly, to make the most out of evaluation, you should use a diverse dataset—ideally with ground truth labels for comprehensive and consistent assessment. Because ground truth labels are costly and time-consuming to curate manually, we also have a synthetic data generation pipeline that allows you to get started quickly. Try it here (https://www.relari.ai/#synthetic_data_demo)
What’s been your experience testing and iterating LLM apps? Please let us know your thoughts and feedback on our approaches (modular framework, leveraging user feedback, testing with synthetic data).
Python information-retrieval related posts
- Splade: Sparse Neural Search
- Launch HN: Relari (YC W24) – Identify the root cause of problems in LLM apps
- On building a semantic search engine
- Ask HN: Is there any good semantic search GUI for images or documents?
- BEIR: A Heterogeneous Benchmark for Information Retrieval
- Benefits of hybrid search
- Show HN: Marqo – Vectorless Vector Search
-
A note from our sponsor - InfluxDB
www.influxdata.com | 26 Apr 2024
Index
What are some of the best open-source information-retrieval projects in Python? This list will help you:
Project | Stars | |
---|---|---|
1 | EasyOCR | 21,882 |
2 | gensim | 15,236 |
3 | haystack | 13,633 |
4 | txtai | 6,953 |
5 | ragflow | 5,516 |
6 | marqo | 4,111 |
7 | catalyst | 3,223 |
8 | llmware | 3,086 |
9 | ranking | 2,714 |
10 | InvoiceNet | 2,382 |
11 | instructor-embedding | 1,695 |
12 | pke | 1,522 |
13 | beir | 1,372 |
14 | mteb | 1,372 |
15 | rank_bm25 | 845 |
16 | splade | 629 |
17 | RankGPT | 408 |
18 | ranx | 336 |
19 | megabots | 335 |
20 | cherche | 311 |
21 | gpl | 308 |
22 | continuous-eval | 302 |
23 | forte | 236 |
Sponsored