Our great sponsors
-
promptfoo
Test your prompts, models, and RAGs. Catch regressions and improve prompt quality. LLM evals for OpenAI, Azure, Anthropic, Gemini, Mistral, Llama, Bedrock, Ollama, and other local & private models with CI/CD integration.
-
langkit
🔍 LangKit: An open-source toolkit for monitoring Large Language Models (LLMs). 📚 Extracts signals from prompts & responses, ensuring safety & security. 🛡️ Features include text quality, relevance metrics, & sentiment analysis. 📊 A comprehensive tool for LLM observability. 👀
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
evals
Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.
I'm building and using promptfoo to iterate on prompts in production. I am responsible for multiple LLM apps with hundreds of thousands of DAU total: https://github.com/promptfoo/promptfoo
It boils down to defining a set of representative test cases and using them to guide prompting. I tend to prefer programmatic test cases over LLM-based evals, but LLM evals seem popular these days. Then, I create a hypothesis, run an eval, and if the results show improvement I share it with the team. In some of my projects, this is integrated with CI.
The next step is closing the feedback loop and gathering real-world examples for your evals. This can be difficult to do if you respect the privacy of your users, which is why I prefer a local, open-source CLI. You'll have to set up the appropriate opt-ins etc. to gather this data, if at all.
Would love to hear feedback and thoughts on how people approach monitoring in production in real world applications in general! It's an area that I think not enough people talk about when operating LLMs.
We spent a lot of time working with various companies with GenAI use cases before LLM was a thing and captured them in our library called LangKit - it's designed to be generic and pluggable into many different systems, including langchain: https://github.com/whylabs/langkit/. It's designed beyond prompt engineering and aims to provide automated ways to monitor LLM once deployed. Happy to answer any questions here!
OpenAI open sourced their evals framework. You can use it to evaluate different models but also your entire prompt chain setup. https://github.com/openai/evals
They also have a registry of evals built in.