Top 23 Python llmops Projects

jina

126 20,009 9.2 Python

☁️ Build multimodal AI applications with cloud-native stack

Project mention: Jina.ai: Self-host Multimodal models | news.ycombinator.com | 2024-01-26

vllm

30 18,041 9.9 Python

A high-throughput and memory-efficient inference and serving engine for LLMs

Project mention: Mistral AI Launches New 8x22B Moe Model | news.ycombinator.com | 2024-04-09

The easiest is to use vllm (https://github.com/vllm-project/vllm) to run it on a Couple of A100's, and you can benchmark this using this library (https://github.com/EleutherAI/lm-evaluation-harness)

InfluxDB

www.influxdata.com sponsored

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
OpenLLM

25 8,733 9.9 Python

Run any open-source LLMs, such as Llama 2, Mistral, as OpenAI compatible API endpoint, locally and in the cloud.

Project mention: First 15 Open Source Advent projects | dev.to | 2023-12-15

13. OpenLLM by BentoML | Github | tutorial

BentoML

16 6,537 9.8 Python

The most flexible way to serve AI/ML models in production - Build Model Inference Service, LLM APIs, Inference Graph/Pipelines, Compound AI systems, Multi-Modal, RAG as a Service, and more!

Project mention: Who's hiring developer advocates? (December 2023) | dev.to | 2023-12-04

Link to GitHub -->

ragflow

6 5,516 9.5 Python

RAGFlow is an open-source RAG (Retrieval-Augmented Generation) engine based on deep document understanding.

Project mention: RAGFlow is an open-source RAG engine based on deep document understanding | news.ycombinator.com | 2024-04-01

Just link them to https://github.com/infiniflow/ragflow/blob/main/rag/llm/chat... :)

ragas

10 4,549 9.6 Python

Evaluation framework for your Retrieval Augmented Generation (RAG) pipelines

Project mention: Show HN: Ragas – the de facto open-source standard for evaluating RAG pipelines | news.ycombinator.com | 2024-03-21

congrats on launching! i think my continuing struggle with looking at Ragas as a company rather than an oss library is that the core of it is like 8 metrics (https://github.com/explodinggradients/ragas/tree/main/src/ra...) that are each 1-200 LOC. i can inline that easily in my app and retain full control, or model that in langchain or haystack or whatever.
why is Ragas a library and a company, rather than an overall "standard" or philosophy (eg like Heroku's 12 Factor Apps) that could maybe be more robust?
(just giving an opp to pitch some underappreciated benefits of using this library)

zenml

33 3,657 9.8 Python

ZenML 🙏: Build portable, production-ready MLOps pipelines. https://zenml.io.

Project mention: FLaNK AI - 01 April 2024 | dev.to | 2024-04-01

WorkOS

workos.com sponsored

The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
phidata

14 3,622 9.9 Python

Build AI Assistants with memory, knowledge and tools.

Project mention: Show HN: Use function calling to build AI Assistants | news.ycombinator.com | 2024-02-27

giskard

7 3,111 10.0 Python

🐢 Open-Source Evaluation & Testing framework for LLMs and ML models

Project mention: Show HN: Evaluate LLM-based RAG Applications with automated test set generation | news.ycombinator.com | 2024-04-11

llm-app

12 2,479 8.9 Python

LLM App templates for RAG, knowledge mining, and stream analytics. Ready to run with Docker,⚡in sync with your data sources.

Project mention: How to use LLMs for real-time alerting | dev.to | 2023-11-20

Answering queries and defining alerts: Our application running on Pathway LLM-App exposes the HTTP REST API endpoint to send queries and receive real-time responses. It is used by the Streamlit UI app. Queries are answered by looking up relevant documents in the index, as in the Retrieval-augmented generation (RAG) implementation. Next, queries are categorized for intent: an LLM probes them for natural language commands synonymous with notify or send an alert.

uptrain

34 1,976 9.7 Python

UpTrain is an open-source unified platform to evaluate and improve Generative AI applications. We provide grades for 20+ preconfigured checks (covering language, code, embedding use-cases), perform root cause analysis on failure cases and give insights on how to resolve them.

Project mention: Evaluation of OpenAI Assistants | dev.to | 2024-04-09

Currently seeking feedback for the developed tool. Would love it if you can check it out on: https://github.com/uptrain-ai/uptrain/blob/main/examples/assistants/assistant_evaluator.ipynb

openllmetry

1 1,224 9.8 Python

Open-source observability for your LLM application, based on OpenTelemetry

Project mention: Show HN: You don't need to adopt new tools for LLM observability | news.ycombinator.com | 2024-02-14

So why should it be different when the app you're building happened to be using LLMs?
So today we're open-sourcing OpenLLMetry-JS. It's an open protocol and SDK, based on OpenTelemetry, that provides traces and metrics for LLM JS/TS applications and can be connected to any of the 15+ tools that already support OpenTelemetry. Here's the repo: https://github.com/traceloop/openllmetry-js
A few months ago we launched the python flavor here (https://news.ycombinator.com/item?id=37843907) and we've now built a compatible one for Node.js.
Would love to hear your thoughts and opinions!
Check it out -
Docs: https://www.traceloop.com/docs/openllmetry/getting-started-t...
Github:

LLMStack

20 1,089 9.9 Python

No-code platform to build LLM Agents, workflows and applications with your data

Project mention: Vanna.ai: Chat with your SQL database | news.ycombinator.com | 2024-01-14

We have recently added support to query data from SingleStore to our agent framework, LLMStack (https://github.com/trypromptly/LLMStack). Out of the box performance performance when prompting with just the table schemas is pretty good with GPT-4.
The more domain specific knowledge needed for queries, the harder it has gotten in general. We've had good success `teaching` the model different concepts in relation to the dataset and giving it example questions and queries greatly improved performance.

lanarky

1 939 8.6 Python

The web framework for building LLM microservices

Project mention: Lanarky: Deploy LLM applications in production, built on FastAPI | news.ycombinator.com | 2023-06-10

agenta

8 823 10.0 Python

The all-in-one LLM developer platform: prompt management, evaluation, human feedback, and deployment all in one place.

Project mention: Ask HN: How are you testing your LLM applications? | news.ycombinator.com | 2024-02-06

I am biased, but I would use a platform and not roll your own solution. You will tend to underestimate the depth of capabilities needed for an eval framework.
Now for solutions, shameless plug here, we are building an open-source platform for experimenting and evaluating complex LLM apps (https://github.com/agenta-ai/agenta). We offer automatic evaluators as well as human annotation capabilities. Currently, we only provide testing before deployment, but we have plans to include post-production evaluations as well.
Other tools I would look at in the space are promptfoo (also open-source, more dev oriented), humanloop (one of the most feature complete tools in the space, enterprise oriented), however more enterprise oriented / costly) and vellum (YC company, more focused towards product teams)

llm-guard

2 821 9.6 Python

The Security Toolkit for LLM Interactions

Project mention: llm-guard: The Security Toolkit for LLM Interactions | /r/blueteamsec | 2023-09-19

langcorn

3 812 7.4 Python

⛓️ Serving LangChain LLM apps and agents automagically with FastApi. LLMops
NeumAI

2 774 8.7 Python

Neum AI is a best-in-class framework to manage the creation and synchronization of vector embeddings at large scale.

Project mention: Show HN: Neum AI – Open-source large-scale RAG framework | news.ycombinator.com | 2023-11-21

Interesting to see that the semantic chunking in the tools library is a wrapper around GPT-4. Asks GPT for the python code and executes it: https://github.com/NeumTry/NeumAI/blob/main/neumai-tools/neu...

DataDreamer

5 632 8.1 Python

DataDreamer: Prompt. Generate Synthetic Data. Train & Align Models. 🤖💤

Project mention: FLaNK AI - 01 April 2024 | dev.to | 2024-04-01

llmflows

1 615 8.6 Python

LLMFlows - Simple, Explicit and Transparent LLM Apps

Project mention: Show HN: LLMFlows – LangChain alternative for explicit and transparent apps | news.ycombinator.com | 2023-07-29

burr

3 410 9.6 Python

Build applications that make decisions (chatbots, agents, simulations, etc...). Monitor, persist, and execute on your own infrastructure.

Project mention: Building an Email Assistant Application with Burr | dev.to | 2024-04-26

Burr is a lightweight python library you use to build applications as state machines. You construct your application out of a series of actions (these can be either decorated functions or objects), which declare inputs from state, as well as inputs from the user. These specify custom logic (delegating to any framework), as well as instructions on how to update state. State is immutable, which allows you to inspect it at any given point. Burr handles orchestration, monitoring and persistence.

cognita

1 349 7.9 Python

Cognita by TrueFoundry - Framework for building modular, open source RAG applications for production.

Project mention: Dream – A Distributed RAG Experimentation Framework | news.ycombinator.com | 2024-04-21

Hi, I've come across an open-source API-driven RAG framework launched recently. It's different from other frameworks in a lot of context. Give it a try and let me know your thoughts: https://github.com/truefoundry/cognita

continuous-eval

3 302 8.4 Python

Open-Source Evaluation for GenAI Application Pipelines

Project mention: Launch HN: Relari (YC W24) – Identify the root cause of problems in LLM apps | news.ycombinator.com | 2024-03-08

Hi HN, we are the founders of Relari, the company behind continuous-eval (https://github.com/relari-ai/continuous-eval), an evaluation framework that lets you test your GenAI systems at the component level, pinpointing issues where they originate.
We experienced the need for this when we were building a copilot for bankers. Our RAG pipeline blew up in complexity as we added components: a query classifier (to triage user intent), multiple retrievers (to grab information from different sources), filtering LLM (to rerank / compress context), a calculator agent (to call financial functions) and finally the synthesizer LLM that gives the answer. Ensuring reliability became more difficult with each of these we added.
When a bad response was detected by our answer evaluator, we had to backtrack multiple steps to understand which component(s) made a mistake. But this quickly became unscalable beyond a few samples.
I did my Ph.D. in fault detection for autonomous vehicles, and I see a strong parallel between the complexity of autonomous driving software and today's LLM pipelines. In self-driving systems, sensors, perception, prediction, planning, and control modules are all chained together. To ensure system-level safety, we use granular metrics to measure the performance of each module individually. When the vehicle makes an unexpected decision, we use these metrics to pinpoint the problem to a specific component. Only then we can make targeted improvements, systematically.
Based on this thinking, we developed the first version of continuous-eval for ourselves. Since then we’ve made it more flexible to fit various types of GenAI pipelines. Continuous-eval allows you to describe (programmatically) your pipeline and modules, and select metrics for each module. We developed 30+ metrics to cover retrieval, text generation, code generation, classification, agent tool use, etc. We now have a number of companies using us to test complex pipelines like finance copilots, enterprise search, coding agents, etc.
As an example, one customer was trying to understand why their RAG system did poorly on trend analysis queries. Through continuous-eval, they realized that the “retriever” component was retrieving 80%+ of all relevant chunks, but the “reranker” component, that filters out “irrelevant” context, was dropping that to below 50%. This enabled them to fix the problem, in their case by skipping the reranker for certain queries.
We’ve also built ensemble metrics that do a surprisingly good job of predicting user feedback. Users often rate LLM-generated answers by giving a thumbs up/down about how good the answer was. We train our custom metrics on this user data, and then use those metrics to generate thumbs up/down ratings on future LLM answers. The results turn out to be 90% aligned with what the users say. This gives developers a feedback loop from production data to offline testing and development. Some customers have found this to be our most unique advantage.
Lastly, to make the most out of evaluation, you should use a diverse dataset—ideally with ground truth labels for comprehensive and consistent assessment. Because ground truth labels are costly and time-consuming to curate manually, we also have a synthetic data generation pipeline that allows you to get started quickly. Try it here (https://www.relari.ai/#synthetic_data_demo)
What’s been your experience testing and iterating LLM apps? Please let us know your thoughts and feedback on our approaches (modular framework, leveraging user feedback, testing with synthetic data).

SaaSHub

www.saashub.com sponsored

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Python llmops related posts

Building an Email Assistant Application with Burr
6 projects | dev.to | 26 Apr 2024
Evaluation of OpenAI Assistants
1 project | dev.to | 9 Apr 2024
Show HN: Burr: An OS Framework for Building and Debugging GenAI Apps Faster
2 projects | news.ycombinator.com | 3 Apr 2024
Show HN: Ragas – the de facto open-source standard for evaluating RAG pipelines
4 projects | news.ycombinator.com | 21 Mar 2024
Show HN: Dealing with Claude 3 XML function calling so you don't have to
2 projects | news.ycombinator.com | 20 Mar 2024
Geniusrise – Wannabe Competitor to Vertex AI, Azure AI Studio and Bedrock
1 project | news.ycombinator.com | 15 Mar 2024
Show HN: Prompts as (WASM) Programs
9 projects | news.ycombinator.com | 11 Mar 2024
A note from our sponsor - InfluxDB
www.influxdata.com | 27 Apr 2024

Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality. Learn more →

Index

What are some of the best open-source llmops projects in Python? This list will help you:

	Project	Stars
1	jina	20,009
2	vllm	18,041
3	OpenLLM	8,733
4	BentoML	6,537
5	ragflow	5,516
6	ragas	4,549
7	zenml	3,657
8	phidata	3,622
9	giskard	3,111
10	llm-app	2,479
11	uptrain	1,976
12	openllmetry	1,224
13	LLMStack	1,089
14	lanarky	939
15	agenta	823
16	llm-guard	821
17	langcorn	812
18	NeumAI	774
19	DataDreamer	632
20	llmflows	615
21	burr	410
22	cognita	349
23	continuous-eval	302