I asked 60 LLMs a set of 20 questions

WorkOS - The modern identity platform for B2B SaaS

The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

workos.com

featured

InfluxDB - Power Real-Time Data Analytics at Scale

Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

www.influxdata.com

featured

ollama

196 62,615 9.9 Go

Get up and running with Llama 3, Mistral, Gemma, and other large language models.

This is very cool. Sorry if I missed it (poked around the site and your GitHub repo), but is the script available anywhere?
Would love to try running this against a series of open-source models with different quantization levels using Ollama and a 192GB M2 Ultra Mac studio: https://github.com/jmorganca/ollama#model-library

litellm

28 8,225 10.0 Python

Call all LLM APIs using the OpenAI format. Use Bedrock, Azure, OpenAI, Cohere, Anthropic, Ollama, Sagemaker, HuggingFace, Replicate (100+ LLMs)

Here's the template I'm using - https://github.com/BerriAI/litellm/blob/5ca8b23e22139a4f49bd...
Anything I'm doing incorrectly?

WorkOS

workos.com featured

The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
promptfoo

20 2,665 9.9 TypeScript

Test your prompts, models, and RAGs. Catch regressions and improve prompt quality. LLM evals for OpenAI, Azure, Anthropic, Gemini, Mistral, Llama, Bedrock, Ollama, and other local & private models with CI/CD integration.

In case anyone's interested in running their own benchmark across many LLMs, I've built a generic harness for this at https://github.com/promptfoo/promptfoo.
I encourage people considering LLM applications to test the models on their _own data and examples_ rather than extrapolating general benchmarks.
This library supports OpenAI, Anthropic, Google, Llama and Codellama, any model on Replicate, and any model on Ollama, etc. out of the box. As an example, I wrote up an example benchmark comparing GPT model censorship with Llama models here: https://promptfoo.dev/docs/guides/llama2-uncensored-benchmar.... Hope this helps someone.

ChainForge

14 1,985 8.9 TypeScript

An open-source visual programming environment for battle-testing prompts to LLMs.

ChainForge has similar functionality for comparing : https://github.com/ianarawjo/ChainForge
LocalAI creates a GPT-compatible HTTP API for local LLMs: https://github.com/go-skynet/LocalAI
Is it necessary to have an HTTP API for each model in a comparative study?

LocalAI

82 19,593 9.9 C++

:robot: The free, Open Source OpenAI alternative. Self-hosted, community-driven and local-first. Drop-in replacement for OpenAI running on consumer-grade hardware. No GPU required. Runs gguf, transformers, diffusers and many more models architectures. It allows to generate Text, Audio, Video, Images. Also with voice cloning capabilities.

ChainForge has similar functionality for comparing : https://github.com/ianarawjo/ChainForge
LocalAI creates a GPT-compatible HTTP API for local LLMs: https://github.com/go-skynet/LocalAI
Is it necessary to have an HTTP API for each model in a comparative study?

TheoremQA

2 152 7.6 Python

The dataset and code for paper: TheoremQA: A Theorem-driven Question Answering dataset

Additional benchmarks:
- "TheoremQA: A Theorem-driven Question Answering dataset" (2023) https://github.com/wenhuchen/TheoremQA#leaderboard
- legalbench

evals

49 13,920 9.3 Python

Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.
InfluxDB

www.influxdata.com featured

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
GodMode

7 4,013 9.3 TypeScript

AI Chat Browser: Fast, Full webapp access to ChatGPT / Claude / Bard / Bing / Llama2! I use this 20 times a day.

and i made https://github.com/smol-ai/GodMode that also includes the closed source LLMs

bench

1 326 8.4 TypeScript

A tool for evaluating LLMs (by arthur-ai)

Thanks for sharing, looks interesting!
I've actually been using a similar LLM evaluation tool called Arthur Bench: https://github.com/arthur-ai/bench
Some great scoring methods built in and a nice UI on top of it as well

fiddler-auditor

2 142 8.1 Python

Fiddler Auditor is a tool to evaluate language models.

This is really cool!
I've been using this auditor tool that some friends at Fiddler created: https://github.com/fiddler-labs/fiddler-auditor
They went with a langchain interface for custom Evals which I really like. I am curious to hear if anyone has tried both of these. What's been your key take away for these?

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

GPT-4 Turbo with Vision is a step backwards for coding

5 projects | news.ycombinator.com | 10 Apr 2024
News DataStax just bought our startup Langflow

1 project | news.ycombinator.com | 4 Apr 2024
Ask HN: How are you testing your LLM applications?

3 projects | news.ycombinator.com | 6 Feb 2024
Show HN: Use Custom GPTs on your website with this open-source project

1 project | news.ycombinator.com | 27 Jan 2024
Speculate HN: How does OpenAI's GPT Builder work?

1 project | news.ycombinator.com | 28 Dec 2023

I asked 60 LLMs a set of 20 questions

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
ai-observability bard generative-ai chatgpt langchain
Post date: 9 Sep 2023

ollama

litellm

WorkOS

promptfoo

ChainForge

LocalAI

TheoremQA

evals

InfluxDB

GodMode

bench

fiddler-auditor

Related posts

GPT-4 Turbo with Vision is a step backwards for coding

News DataStax just bought our startup Langflow

Ask HN: How are you testing your LLM applications?

Show HN: Use Custom GPTs on your website with this open-source project

Speculate HN: How does OpenAI's GPT Builder work?

I asked 60 LLMs a set of 20 questions

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com ai-observability bard generative-ai chatgpt langchain Post date: 9 Sep 2023

Related posts

GPT-4 Turbo with Vision is a step backwards for coding

News DataStax just bought our startup Langflow

Ask HN: How are you testing your LLM applications?

Show HN: Use Custom GPTs on your website with this open-source project

Speculate HN: How does OpenAI's GPT Builder work?

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
ai-observability bard generative-ai chatgpt langchain
Post date: 9 Sep 2023