Benchmarks for Recent LLMs

This page summarizes the projects mentioned and recommended in the original post on /r/LocalLLaMA

InfluxDB - Power Real-Time Data Analytics at Scale
Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
www.influxdata.com
featured
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com
featured
  • lm-evaluation-harness

    A framework for few-shot evaluation of language models.

  • Does anyone know of any updated benchmarks for LLMs? I only know of one and it's not updated - https://docs.google.com/spreadsheets/d/1kT4or6b0Fedd-W_jMwYpb63e1ZR3aePczz3zlbJW-Y4/edit#gid=741531996. I think this spreadsheet was made possibly from using this tool https://github.com/EleutherAI/lm-evaluation-harness and language tasks dataset available there. It would be nice if there are benchmarks for recently released LLMs but the spreadsheet is only for viewing and does not allow community edits. Would such benchmarks be helpful for you? What is your favorite open source LLM so far and for which task?

  • LLMZoo

    ⚡LLM Zoo is a project that provides data, models, and evaluation benchmark for large language models.⚡

  • Missing Vicuna, Dolly, BELLE, phoenix, MOSS, the ones used by open assistant.

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • MOSS

    An open-source tool-augmented conversational language model from Fudan University

  • Missing Vicuna, Dolly, BELLE, phoenix, MOSS, the ones used by open assistant.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts

  • Has anyone tried fine tuning on a dataset of complex tasks that require tool use?

    1 project | /r/LocalLLaMA | 5 May 2023
  • [D] Open-Source LLMs vs APIs

    2 projects | /r/MachineLearning | 25 Apr 2023
  • Run 70B LLM Inference on a Single 4GB GPU with This New Technique

    3 projects | news.ycombinator.com | 3 Dec 2023
  • UltraChat's License is now MIT

    1 project | news.ycombinator.com | 11 Oct 2023
  • Looks like there is a new model UltraLM that topped the AlpacaEval Leaderboard

    1 project | /r/LocalLLaMA | 29 Jun 2023