Ask HN: How are you improving your use of LLMs in production?

Our great sponsors

InfluxDB - Power Real-Time Data Analytics at Scale

WorkOS - The modern identity platform for B2B SaaS

SaaSHub - Software Alternatives and Reviews

Our great sponsors

promptfoo

20 2,665 9.9 TypeScript

Test your prompts, models, and RAGs. Catch regressions and improve prompt quality. LLM evals for OpenAI, Azure, Anthropic, Gemini, Mistral, Llama, Bedrock, Ollama, and other local & private models with CI/CD integration.

I'm building and using promptfoo to iterate on prompts in production. I am responsible for multiple LLM apps with hundreds of thousands of DAU total: https://github.com/promptfoo/promptfoo
It boils down to defining a set of representative test cases and using them to guide prompting. I tend to prefer programmatic test cases over LLM-based evals, but LLM evals seem popular these days. Then, I create a hypothesis, run an eval, and if the results show improvement I share it with the team. In some of my projects, this is integrated with CI.
The next step is closing the feedback loop and gathering real-world examples for your evals. This can be difficult to do if you respect the privacy of your users, which is why I prefer a local, open-source CLI. You'll have to set up the appropriate opt-ins etc. to gather this data, if at all.

langkit

5 715 8.8 Jupyter Notebook

🔍 LangKit: An open-source toolkit for monitoring Large Language Models (LLMs). 📚 Extracts signals from prompts & responses, ensuring safety & security. 🛡️ Features include text quality, relevance metrics, & sentiment analysis. 📊 A comprehensive tool for LLM observability. 👀

Would love to hear feedback and thoughts on how people approach monitoring in production in real world applications in general! It's an area that I think not enough people talk about when operating LLMs.
We spent a lot of time working with various companies with GenAI use cases before LLM was a thing and captured them in our library called LangKit - it's designed to be generic and pluggable into many different systems, including langchain: https://github.com/whylabs/langkit/. It's designed beyond prompt engineering and aims to provide automated ways to monitor LLM once deployed. Happy to answer any questions here!

InfluxDB

www.influxdata.com sponsored

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
evals

49 13,920 9.3 Python

Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.

OpenAI open sourced their evals framework. You can use it to evaluate different models but also your entire prompt chain setup. https://github.com/openai/evals
They also have a registry of evals built in.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Internet Archive: Open Library
1 project | news.ycombinator.com | 30 Apr 2024
I Witnessed the Future of AI, and It's a Broken Toy
1 project | news.ycombinator.com | 30 Apr 2024
AI leaderboards are no longer useful. It's time to switch to Pareto curves
1 project | news.ycombinator.com | 30 Apr 2024
Pyinfra: Automate Infrastructure Using Python
5 projects | news.ycombinator.com | 30 Apr 2024
GAI Raspberry Pi cat detection and notification
1 project | news.ycombinator.com | 30 Apr 2024

Ask HN: How are you improving your use of LLMs in production?

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com Post date: 19 Jul 2023

promptfoo

langkit

InfluxDB

evals

Related posts