Benchmarks for Recent LLMs

InfluxDB - Power Real-Time Data Analytics at Scale

Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

www.influxdata.com

featured

SaaSHub - Software Alternatives and Reviews

SaaSHub helps you find the best software and product alternatives

www.saashub.com

featured

lm-evaluation-harness

34 5,070 9.9 Python

A framework for few-shot evaluation of language models.

Does anyone know of any updated benchmarks for LLMs? I only know of one and it's not updated - https://docs.google.com/spreadsheets/d/1kT4or6b0Fedd-W_jMwYpb63e1ZR3aePczz3zlbJW-Y4/edit#gid=741531996. I think this spreadsheet was made possibly from using this tool https://github.com/EleutherAI/lm-evaluation-harness and language tasks dataset available there. It would be nice if there are benchmarks for recently released LLMs but the spreadsheet is only for viewing and does not allow community edits. Would such benchmarks be helpful for you? What is your favorite open source LLM so far and for which task?

LLMZoo

6 2,866 8.5 Python

⚡LLM Zoo is a project that provides data, models, and evaluation benchmark for large language models.⚡

Missing Vicuna, Dolly, BELLE, phoenix, MOSS, the ones used by open assistant.

InfluxDB

www.influxdata.com featured

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
MOSS

4 11,819 8.5 Python

An open-source tool-augmented conversational language model from Fudan University

Missing Vicuna, Dolly, BELLE, phoenix, MOSS, the ones used by open assistant.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Has anyone tried fine tuning on a dataset of complex tasks that require tool use?

1 project | /r/LocalLLaMA | 5 May 2023
[D] Open-Source LLMs vs APIs

2 projects | /r/MachineLearning | 25 Apr 2023
Run 70B LLM Inference on a Single 4GB GPU with This New Technique

3 projects | news.ycombinator.com | 3 Dec 2023
UltraChat's License is now MIT

1 project | news.ycombinator.com | 11 Oct 2023
Looks like there is a new model UltraLM that topped the AlpacaEval Leaderboard

1 project | /r/LocalLLaMA | 29 Jun 2023

Benchmarks for Recent LLMs

This page summarizes the projects mentioned and recommended in the original post on /r/LocalLLaMA
chatgpt Deep Learning dialogue-systems large-language-models Natural Language Processing
Post date: 29 Apr 2023

lm-evaluation-harness

LLMZoo

InfluxDB

MOSS

Related posts

Has anyone tried fine tuning on a dataset of complex tasks that require tool use?

[D] Open-Source LLMs vs APIs

Run 70B LLM Inference on a Single 4GB GPU with This New Technique

UltraChat's License is now MIT

Looks like there is a new model UltraLM that topped the AlpacaEval Leaderboard

Benchmarks for Recent LLMs

This page summarizes the projects mentioned and recommended in the original post on /r/LocalLLaMA chatgpt Deep Learning dialogue-systems large-language-models Natural Language Processing Post date: 29 Apr 2023

lm-evaluation-harness

LLMZoo

InfluxDB

MOSS

Related posts

Has anyone tried fine tuning on a dataset of complex tasks that require tool use?

[D] Open-Source LLMs vs APIs

Run 70B LLM Inference on a Single 4GB GPU with This New Technique

UltraChat's License is now MIT

Looks like there is a new model UltraLM that topped the AlpacaEval Leaderboard

This page summarizes the projects mentioned and recommended in the original post on /r/LocalLLaMA
chatgpt Deep Learning dialogue-systems large-language-models Natural Language Processing
Post date: 29 Apr 2023