Ask HN: How are you improving your use of LLMs in production?

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • WorkOS - The modern identity platform for B2B SaaS
  • SaaSHub - Software Alternatives and Reviews
  • promptfoo

    Test your prompts, models, and RAGs. Catch regressions and improve prompt quality. LLM evals for OpenAI, Azure, Anthropic, Gemini, Mistral, Llama, Bedrock, Ollama, and other local & private models with CI/CD integration.

  • I'm building and using promptfoo to iterate on prompts in production. I am responsible for multiple LLM apps with hundreds of thousands of DAU total: https://github.com/promptfoo/promptfoo

    It boils down to defining a set of representative test cases and using them to guide prompting. I tend to prefer programmatic test cases over LLM-based evals, but LLM evals seem popular these days. Then, I create a hypothesis, run an eval, and if the results show improvement I share it with the team. In some of my projects, this is integrated with CI.

    The next step is closing the feedback loop and gathering real-world examples for your evals. This can be difficult to do if you respect the privacy of your users, which is why I prefer a local, open-source CLI. You'll have to set up the appropriate opt-ins etc. to gather this data, if at all.

  • langkit

    🔍 LangKit: An open-source toolkit for monitoring Large Language Models (LLMs). 📚 Extracts signals from prompts & responses, ensuring safety & security. 🛡️ Features include text quality, relevance metrics, & sentiment analysis. 📊 A comprehensive tool for LLM observability. 👀

  • Would love to hear feedback and thoughts on how people approach monitoring in production in real world applications in general! It's an area that I think not enough people talk about when operating LLMs.

    We spent a lot of time working with various companies with GenAI use cases before LLM was a thing and captured them in our library called LangKit - it's designed to be generic and pluggable into many different systems, including langchain: https://github.com/whylabs/langkit/. It's designed beyond prompt engineering and aims to provide automated ways to monitor LLM once deployed. Happy to answer any questions here!

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • evals

    Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.

  • OpenAI open sourced their evals framework. You can use it to evaluate different models but also your entire prompt chain setup. https://github.com/openai/evals

    They also have a registry of evals built in.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts