-
litellm
Call all LLM APIs using the OpenAI format. Use Bedrock, Azure, OpenAI, Cohere, Anthropic, Ollama, Sagemaker, HuggingFace, Replicate (100+ LLMs)
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
-
promptfoo
Test your prompts, models, and RAGs. Catch regressions and improve prompt quality. LLM evals for OpenAI, Azure, Anthropic, Gemini, Mistral, Llama, Bedrock, Ollama, and other local & private models with CI/CD integration.
-
LocalAI
:robot: The free, Open Source OpenAI alternative. Self-hosted, community-driven and local-first. Drop-in replacement for OpenAI running on consumer-grade hardware. No GPU required. Runs gguf, transformers, diffusers and many more models architectures. It allows to generate Text, Audio, Video, Images. Also with voice cloning capabilities.
-
evals
Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
GodMode
AI Chat Browser: Fast, Full webapp access to ChatGPT / Claude / Bard / Bing / Llama2! I use this 20 times a day.
This is very cool. Sorry if I missed it (poked around the site and your GitHub repo), but is the script available anywhere?
Would love to try running this against a series of open-source models with different quantization levels using Ollama and a 192GB M2 Ultra Mac studio: https://github.com/jmorganca/ollama#model-library
Here's the template I'm using - https://github.com/BerriAI/litellm/blob/5ca8b23e22139a4f49bd...
Anything I'm doing incorrectly?
In case anyone's interested in running their own benchmark across many LLMs, I've built a generic harness for this at https://github.com/promptfoo/promptfoo.
I encourage people considering LLM applications to test the models on their _own data and examples_ rather than extrapolating general benchmarks.
This library supports OpenAI, Anthropic, Google, Llama and Codellama, any model on Replicate, and any model on Ollama, etc. out of the box. As an example, I wrote up an example benchmark comparing GPT model censorship with Llama models here: https://promptfoo.dev/docs/guides/llama2-uncensored-benchmar.... Hope this helps someone.
ChainForge has similar functionality for comparing : https://github.com/ianarawjo/ChainForge
LocalAI creates a GPT-compatible HTTP API for local LLMs: https://github.com/go-skynet/LocalAI
Is it necessary to have an HTTP API for each model in a comparative study?
ChainForge has similar functionality for comparing : https://github.com/ianarawjo/ChainForge
LocalAI creates a GPT-compatible HTTP API for local LLMs: https://github.com/go-skynet/LocalAI
Is it necessary to have an HTTP API for each model in a comparative study?
Additional benchmarks:
- "TheoremQA: A Theorem-driven Question Answering dataset" (2023) https://github.com/wenhuchen/TheoremQA#leaderboard
- legalbench
and i made https://github.com/smol-ai/GodMode that also includes the closed source LLMs
Thanks for sharing, looks interesting!
I've actually been using a similar LLM evaluation tool called Arthur Bench: https://github.com/arthur-ai/bench
Some great scoring methods built in and a nice UI on top of it as well
This is really cool!
I've been using this auditor tool that some friends at Fiddler created: https://github.com/fiddler-labs/fiddler-auditor
They went with a langchain interface for custom Evals which I really like. I am curious to hear if anyone has tried both of these. What's been your key take away for these?