I asked 60 LLMs a set of 20 questions

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

WorkOS - The modern identity platform for B2B SaaS
The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
workos.com
featured
InfluxDB - Power Real-Time Data Analytics at Scale
Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
www.influxdata.com
featured
  • ollama

    Get up and running with Llama 3, Mistral, Gemma, and other large language models.

  • This is very cool. Sorry if I missed it (poked around the site and your GitHub repo), but is the script available anywhere?

    Would love to try running this against a series of open-source models with different quantization levels using Ollama and a 192GB M2 Ultra Mac studio: https://github.com/jmorganca/ollama#model-library

  • litellm

    Call all LLM APIs using the OpenAI format. Use Bedrock, Azure, OpenAI, Cohere, Anthropic, Ollama, Sagemaker, HuggingFace, Replicate (100+ LLMs)

  • Here's the template I'm using - https://github.com/BerriAI/litellm/blob/5ca8b23e22139a4f49bd...

    Anything I'm doing incorrectly?

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
  • promptfoo

    Test your prompts, models, and RAGs. Catch regressions and improve prompt quality. LLM evals for OpenAI, Azure, Anthropic, Gemini, Mistral, Llama, Bedrock, Ollama, and other local & private models with CI/CD integration.

  • In case anyone's interested in running their own benchmark across many LLMs, I've built a generic harness for this at https://github.com/promptfoo/promptfoo.

    I encourage people considering LLM applications to test the models on their _own data and examples_ rather than extrapolating general benchmarks.

    This library supports OpenAI, Anthropic, Google, Llama and Codellama, any model on Replicate, and any model on Ollama, etc. out of the box. As an example, I wrote up an example benchmark comparing GPT model censorship with Llama models here: https://promptfoo.dev/docs/guides/llama2-uncensored-benchmar.... Hope this helps someone.

  • ChainForge

    An open-source visual programming environment for battle-testing prompts to LLMs.

  • ChainForge has similar functionality for comparing : https://github.com/ianarawjo/ChainForge

    LocalAI creates a GPT-compatible HTTP API for local LLMs: https://github.com/go-skynet/LocalAI

    Is it necessary to have an HTTP API for each model in a comparative study?

  • LocalAI

    :robot: The free, Open Source OpenAI alternative. Self-hosted, community-driven and local-first. Drop-in replacement for OpenAI running on consumer-grade hardware. No GPU required. Runs gguf, transformers, diffusers and many more models architectures. It allows to generate Text, Audio, Video, Images. Also with voice cloning capabilities.

  • ChainForge has similar functionality for comparing : https://github.com/ianarawjo/ChainForge

    LocalAI creates a GPT-compatible HTTP API for local LLMs: https://github.com/go-skynet/LocalAI

    Is it necessary to have an HTTP API for each model in a comparative study?

  • TheoremQA

    The dataset and code for paper: TheoremQA: A Theorem-driven Question Answering dataset

  • Additional benchmarks:

    - "TheoremQA: A Theorem-driven Question Answering dataset" (2023) https://github.com/wenhuchen/TheoremQA#leaderboard

    - legalbench

  • evals

    Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • GodMode

    AI Chat Browser: Fast, Full webapp access to ChatGPT / Claude / Bard / Bing / Llama2! I use this 20 times a day.

  • and i made https://github.com/smol-ai/GodMode that also includes the closed source LLMs

  • bench

    A tool for evaluating LLMs (by arthur-ai)

  • Thanks for sharing, looks interesting!

    I've actually been using a similar LLM evaluation tool called Arthur Bench: https://github.com/arthur-ai/bench

    Some great scoring methods built in and a nice UI on top of it as well

  • fiddler-auditor

    Fiddler Auditor is a tool to evaluate language models.

  • This is really cool!

    I've been using this auditor tool that some friends at Fiddler created: https://github.com/fiddler-labs/fiddler-auditor

    They went with a langchain interface for custom Evals which I really like. I am curious to hear if anyone has tried both of these. What's been your key take away for these?

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts

  • GPT-4 Turbo with Vision is a step backwards for coding

    5 projects | news.ycombinator.com | 10 Apr 2024
  • News DataStax just bought our startup Langflow

    1 project | news.ycombinator.com | 4 Apr 2024
  • Ask HN: How are you testing your LLM applications?

    3 projects | news.ycombinator.com | 6 Feb 2024
  • Show HN: Use Custom GPTs on your website with this open-source project

    1 project | news.ycombinator.com | 27 Jan 2024
  • Speculate HN: How does OpenAI's GPT Builder work?

    1 project | news.ycombinator.com | 28 Dec 2023