Dataset of MMLU results broken down by task

InfluxDB - Power Real-Time Data Analytics at Scale

Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

www.influxdata.com

featured

SaaSHub - Software Alternatives and Reviews

SaaSHub helps you find the best software and product alternatives

www.saashub.com

featured

lm-evaluation-harness

34 5,070 9.9 Python

A framework for few-shot evaluation of language models.

I am primarily looking for results of running the MMLU evaluation on modern large language models. I have been able to find some data here https://github.com/EleutherAI/lm-evaluation-harness/tree/master/results and will be asking them if/when, they can provide any additional data.

helm

2 1,654 9.8 Python

Holistic Evaluation of Language Models (HELM), a framework to increase the transparency of language models (https://arxiv.org/abs/2211.09110). This framework is also used to evaluate text-to-image models in Holistic Evaluation of Text-to-Image Models (HEIM) (https://arxiv.org/abs/2311.04287). (by stanford-crfm)

Looking at their github repo, it also seems like the MMLU result is from just those 5 tasks and not all of them https://github.com/stanford-crfm/helm/issues/1335

InfluxDB

www.influxdata.com featured

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Kolmogorov-Arnold Networks

5 projects | news.ycombinator.com | 30 Apr 2024
This Week In Python

5 projects | dev.to | 3 May 2024
TTCP CAGE Challenge 4: autonomous cyber defensive agents

1 project | news.ycombinator.com | 3 May 2024
Show HN: A Python Swiss-knife to manage Wayland compositors (Hyprland, Sway)

1 project | news.ycombinator.com | 3 May 2024
Azure SDK is over 500 MB and growing on each release

1 project | news.ycombinator.com | 3 May 2024

Dataset of MMLU results broken down by task

This page summarizes the projects mentioned and recommended in the original post on /r/datasets Post date: 6 Jul 2023

lm-evaluation-harness

helm

InfluxDB

Related posts

Kolmogorov-Arnold Networks

This Week In Python

TTCP CAGE Challenge 4: autonomous cyber defensive agents

Show HN: A Python Swiss-knife to manage Wayland compositors (Hyprland, Sway)

Azure SDK is over 500 MB and growing on each release