lm-evaluation-harness

A framework for few-shot evaluation of language models. (by EleutherAI)

Lm-evaluation-harness Alternatives

Similar projects and alternatives to lm-evaluation-harness

  1. llama.cpp

    LLM inference in C/C++

  2. SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
  3. textgen

    Open-source desktop app for local LLMs. Text, vision, tool-calling, OpenAI/Anthropic-compatible API. 100% private.

  4. transformers

    🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.

  5. materials

    214 lm-evaluation-harness VS materials

    Bonus materials, exercises, and example projects for our Python tutorials

  6. koboldcpp

    Run GGUF models easily with a KoboldAI UI. One File. Zero Install.

  7. vllm

    A high-throughput and memory-efficient inference and serving engine for LLMs

  8. gpt-neo

    Discontinued An implementation of model parallel GPT-2 and GPT-3-style models using the mesh-tensorflow library.

  9. ggml

    Tensor library for machine learning

  10. google-search-results-nodejs

    SerpApi client library for Node.js. Previously: Google Search Results Node.js.

  11. gpt-neox

    An implementation of model parallel autoregressive transformers on GPUs, based on the Megatron and DeepSpeed libraries

  12. BIG-bench

    Beyond the Imitation Game collaborative benchmark for measuring and extrapolating the capabilities of language models

  13. mesh-transformer-jax

    Model parallel transformers in JAX and Haiku

  14. StableLM

    43 lm-evaluation-harness VS StableLM

    StableLM: Stability AI Language Models

  15. mach-nix

    Create highly reproducible python environments

  16. aitextgen

    A robust Python tool for text-based AI training and generation using GPT-2.

  17. WireViz

    Easily document cables and wiring harnesses.

  18. allennlp

    Discontinued An open-source NLP research library, built on PyTorch.

  19. kiri

    Backprop makes it simple to use, finetune, and deploy state-of-the-art ML models. (by kiri-ai)

  20. dm-haiku

    JAX-based neural network library

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a better lm-evaluation-harness alternative or higher similarity.

lm-evaluation-harness discussion

Log in or Post with

lm-evaluation-harness reviews and mentions

Posts with mentions or reviews of lm-evaluation-harness. We have used some of these posts to build our list of alternatives and similar projects. The last one was on 2026-06-03.
  • What is an LLM evaluation harness? A deep dive into lm-eval-harness
    3 projects | dev.to | 3 Jun 2026
    EleutherAI started the project in 2020 as a unified way to reproduce published LLM benchmark numbers. It's now at v0.4.12 (May 2026), ships with 200+ tasks spanning reasoning, knowledge, coding, math, multilingual, and long-context benchmarks, and supports a long list of model backends: Hugging Face transformers, vLLM, SGLang, GPT-NeoX, Megatron-DeepSpeed, plus API endpoints for OpenAI, Anthropic, and a few others.
  • EleutherAI / Lm-Evaluation-Harness
    1 project | news.ycombinator.com | 13 May 2026
  • War Story: We Migrated from Hugging Face Inference API to Self-Hosted LLMs and Cut Latency by 60%
    2 projects | dev.to | 27 Apr 2026
    Quantization is the single biggest lever for reducing self-hosted LLM latency and cost, but not all quantization methods are created equal. In our migration, we tested 12 quantization schemes for Llama 3 8B: FP16 (baseline), INT8, AWQ 4-bit, GPTQ 4-bit, Q4_K_M, Q5_K_S, and 6 others. We measured three metrics for each: inference latency (p99), accuracy (on our internal transaction categorization dataset of 10k labeled samples), and memory usage per GPU. AWQ 4-bit delivered the best balance: 1.8x faster inference than FP16, 2.3x lower memory usage (allowing us to fit the model on 8xA100s with 20% headroom), and only 0.7% accuracy loss vs FP16. GPTQ 4-bit was 12% slower than AWQ for our workload, and Q4_K_M had 2.1% accuracy loss, which was unacceptable for fintech compliance. The key mistake we made early on was assuming "4-bit quantization" was a commodity: each scheme has different tradeoffs for attention layers, KV cache handling, and activation quantization. Always run a benchmark on your actual workload before choosing. We used the lm-evaluation-harness (https://github.com/EleutherAI/lm-evaluation-harness) to automate accuracy testing, and the vLLM benchmark script (included earlier) for latency. Never trust vendor-provided quantization benchmarks: they rarely match real-world production workloads with variable prompt lengths and concurrent requests. If you're using larger models like Llama 3 70B, FP8 quantization is now production-ready in vLLM 0.4.3 and delivers near-FP16 accuracy with 30% lower memory usage.
  • Tech Trend Blog list over 200 blogs
    4 projects | dev.to | 6 Sep 2025
  • Ask HN: Who is hiring? (September 2025)
    12 projects | news.ycombinator.com | 1 Sep 2025
    I am a prev:SWE Intern at AWS India(was extended return offer), in the Q for Business Team, where I work on integration of BidirectionalStreaming using TTS models to integrate low latency voice mode usage in our application; and an active contributor to Pickle's Glass[ https://github.com/pickle-com/glass ], and HyprNote[ https://www.linkedin.com/company/hyprnote/ ].

    My previous experiences include:

    Experience: WorldQuant (Quant Research), AstraZeneca (ML engineer for drug discovery using LLMs), TransHumanity(AI Engineering solving Traffic), Alma [ https://www.linkedin.com/company/tryalma/ ] (AI engineering to simplify the visa process for immigration), Univ. of Missouri (biomedical GraphRAG research).

    Competitions: Top 0.016% in Amazon ML Challenge 2024 (12th/75,000 teams) and ranked 11th/50,000 teams in Amazon HackOn 2024, UC Berkeley AGENTX-2025.

    Open Source: EleutherAI's LMEvaluationHarness[ https://github.com/EleutherAI/lm-evaluation-harness ], HuggingFace's nanoVLM( https://github.com/huggingface/nanoVLM )[integrating metrics and GRPO training techniques], Pickle's Glass[added search functionality, adding knowledge base support for enterprises] , HyprNote[ https://github.com/fastrepl/hyprnote ] [removing silent responses and quantizing models for effective summarization]

    I've attached all my profile links, hereby.

  • Revival Hijacking: How Deleted PyPI Packages Become Threats
    3 projects | dev.to | 2 Aug 2025
    Okay, let’s go back to the exploitation. I took most downloadable deleted packages. Some of them I validated manually via GitHub search, and even found possibilities of dependency confusion in popular projects that were fixed after messaging them. Some of them I took only by their download count without validation.
  • Complete Large Language Model (LLM) Learning Roadmap
    14 projects | dev.to | 11 Apr 2025
    Resource: EleutherAI
  • Generative AI: A Personal Deep Dive – My Notes and Insights Part-2
    7 projects | dev.to | 21 Dec 2024
    A framework for few-shot evaluation of language models.
  • Mistral AI Launches New 8x22B Moe Model
    4 projects | news.ycombinator.com | 9 Apr 2024
    The easiest is to use vllm (https://github.com/vllm-project/vllm) to run it on a Couple of A100's, and you can benchmark this using this library (https://github.com/EleutherAI/lm-evaluation-harness)
  • Show HN: Times faster LLM evaluation with Bayesian optimization
    6 projects | news.ycombinator.com | 13 Feb 2024
    Fair question.

    Evaluate refers to the phase after training to check if the training is good.

    Usually the flow goes training -> evaluation -> deployment (what you called inference). This project is aimed for evaluation. Evaluation can be slow (might even be slower than training if you're finetuning on a small domain specific subset)!

    So there are [quite](https://github.com/microsoft/promptbench) [a](https://github.com/confident-ai/deepeval) [few](https://github.com/openai/evals) [frameworks](https://github.com/EleutherAI/lm-evaluation-harness) working on evaluation, however, all of them are quite slow, because LLM are slow if you don't have infinite money. [This](https://github.com/open-compass/opencompass) one tries to speed up by parallelizing on multiple computers, but none of them takes advantage of the fact that many evaluation queries might be similar and all try to evaluate on all given queries. And that's where this project might come in handy.

  • A note from our sponsor - SaaSHub
    www.saashub.com | 9 Jun 2026
    SaaSHub helps you find the best software and product alternatives Learn more →

Stats

Basic lm-evaluation-harness repo stats
42
12,818
9.4
7 days ago

Sponsored
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com

Did you know that Python is
the 1st most popular programming language
based on number of references?