SaaSHub helps you find the best software and product alternatives Learn more →
Lm-evaluation-harness Alternatives
Similar projects and alternatives to lm-evaluation-harness
-
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
-
textgen
Open-source desktop app for local LLMs. Text, vision, tool-calling, OpenAI/Anthropic-compatible API. 100% private.
-
transformers
🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.
-
-
-
-
gpt-neo
Discontinued An implementation of model parallel GPT-2 and GPT-3-style models using the mesh-tensorflow library.
-
-
google-search-results-nodejs
SerpApi client library for Node.js. Previously: Google Search Results Node.js.
-
gpt-neox
An implementation of model parallel autoregressive transformers on GPUs, based on the Megatron and DeepSpeed libraries
-
BIG-bench
Beyond the Imitation Game collaborative benchmark for measuring and extrapolating the capabilities of language models
-
-
-
-
-
-
-
-
lm-evaluation-harness discussion
lm-evaluation-harness reviews and mentions
-
What is an LLM evaluation harness? A deep dive into lm-eval-harness
EleutherAI started the project in 2020 as a unified way to reproduce published LLM benchmark numbers. It's now at v0.4.12 (May 2026), ships with 200+ tasks spanning reasoning, knowledge, coding, math, multilingual, and long-context benchmarks, and supports a long list of model backends: Hugging Face transformers, vLLM, SGLang, GPT-NeoX, Megatron-DeepSpeed, plus API endpoints for OpenAI, Anthropic, and a few others.
- EleutherAI / Lm-Evaluation-Harness
-
War Story: We Migrated from Hugging Face Inference API to Self-Hosted LLMs and Cut Latency by 60%
Quantization is the single biggest lever for reducing self-hosted LLM latency and cost, but not all quantization methods are created equal. In our migration, we tested 12 quantization schemes for Llama 3 8B: FP16 (baseline), INT8, AWQ 4-bit, GPTQ 4-bit, Q4_K_M, Q5_K_S, and 6 others. We measured three metrics for each: inference latency (p99), accuracy (on our internal transaction categorization dataset of 10k labeled samples), and memory usage per GPU. AWQ 4-bit delivered the best balance: 1.8x faster inference than FP16, 2.3x lower memory usage (allowing us to fit the model on 8xA100s with 20% headroom), and only 0.7% accuracy loss vs FP16. GPTQ 4-bit was 12% slower than AWQ for our workload, and Q4_K_M had 2.1% accuracy loss, which was unacceptable for fintech compliance. The key mistake we made early on was assuming "4-bit quantization" was a commodity: each scheme has different tradeoffs for attention layers, KV cache handling, and activation quantization. Always run a benchmark on your actual workload before choosing. We used the lm-evaluation-harness (https://github.com/EleutherAI/lm-evaluation-harness) to automate accuracy testing, and the vLLM benchmark script (included earlier) for latency. Never trust vendor-provided quantization benchmarks: they rarely match real-world production workloads with variable prompt lengths and concurrent requests. If you're using larger models like Llama 3 70B, FP8 quantization is now production-ready in vLLM 0.4.3 and delivers near-FP16 accuracy with 30% lower memory usage.
- Tech Trend Blog list over 200 blogs
-
Ask HN: Who is hiring? (September 2025)
I am a prev:SWE Intern at AWS India(was extended return offer), in the Q for Business Team, where I work on integration of BidirectionalStreaming using TTS models to integrate low latency voice mode usage in our application; and an active contributor to Pickle's Glass[ https://github.com/pickle-com/glass ], and HyprNote[ https://www.linkedin.com/company/hyprnote/ ].
My previous experiences include:
Experience: WorldQuant (Quant Research), AstraZeneca (ML engineer for drug discovery using LLMs), TransHumanity(AI Engineering solving Traffic), Alma [ https://www.linkedin.com/company/tryalma/ ] (AI engineering to simplify the visa process for immigration), Univ. of Missouri (biomedical GraphRAG research).
Competitions: Top 0.016% in Amazon ML Challenge 2024 (12th/75,000 teams) and ranked 11th/50,000 teams in Amazon HackOn 2024, UC Berkeley AGENTX-2025.
Open Source: EleutherAI's LMEvaluationHarness[ https://github.com/EleutherAI/lm-evaluation-harness ], HuggingFace's nanoVLM( https://github.com/huggingface/nanoVLM )[integrating metrics and GRPO training techniques], Pickle's Glass[added search functionality, adding knowledge base support for enterprises] , HyprNote[ https://github.com/fastrepl/hyprnote ] [removing silent responses and quantizing models for effective summarization]
I've attached all my profile links, hereby.
-
Revival Hijacking: How Deleted PyPI Packages Become Threats
Okay, let’s go back to the exploitation. I took most downloadable deleted packages. Some of them I validated manually via GitHub search, and even found possibilities of dependency confusion in popular projects that were fixed after messaging them. Some of them I took only by their download count without validation.
-
Complete Large Language Model (LLM) Learning Roadmap
Resource: EleutherAI
-
Generative AI: A Personal Deep Dive – My Notes and Insights Part-2
A framework for few-shot evaluation of language models.
-
Mistral AI Launches New 8x22B Moe Model
The easiest is to use vllm (https://github.com/vllm-project/vllm) to run it on a Couple of A100's, and you can benchmark this using this library (https://github.com/EleutherAI/lm-evaluation-harness)
-
Show HN: Times faster LLM evaluation with Bayesian optimization
Fair question.
Evaluate refers to the phase after training to check if the training is good.
Usually the flow goes training -> evaluation -> deployment (what you called inference). This project is aimed for evaluation. Evaluation can be slow (might even be slower than training if you're finetuning on a small domain specific subset)!
So there are [quite](https://github.com/microsoft/promptbench) [a](https://github.com/confident-ai/deepeval) [few](https://github.com/openai/evals) [frameworks](https://github.com/EleutherAI/lm-evaluation-harness) working on evaluation, however, all of them are quite slow, because LLM are slow if you don't have infinite money. [This](https://github.com/open-compass/opencompass) one tries to speed up by parallelizing on multiple computers, but none of them takes advantage of the fact that many evaluation queries might be similar and all try to evaluate on all given queries. And that's where this project might come in handy.
-
A note from our sponsor - SaaSHub
www.saashub.com | 9 Jun 2026
Stats
EleutherAI/lm-evaluation-harness is an open source project licensed under MIT License which is an OSI approved license.
The primary programming language of lm-evaluation-harness is Python.
Popular Comparisons
- lm-evaluation-harness VS opencompass
- lm-evaluation-harness VS BIG-bench
- lm-evaluation-harness VS gpt-neo
- lm-evaluation-harness VS vllm
- lm-evaluation-harness VS gpt-neox
- lm-evaluation-harness VS transformers
- lm-evaluation-harness VS StableLM
- lm-evaluation-harness VS llama.cpp
- lm-evaluation-harness VS koboldcpp
- lm-evaluation-harness VS aitextgen