Top 23 Evaluation Open-Source Projects

awesome-semantic-segmentation

1 10,220 0.0

:metal: awesome-semantic-segmentation
govaluate

2 3,529 0.0 Go

Arbitrary expression evaluation for golang
InfluxDB

www.influxdata.com
sponsored

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
write-you-a-haskell

4 3,304 0.0 Haskell

Building a modern functional compiler from first principles. (http://dev.stephendiehl.com/fun/)

Project mention: A decade of developing a programming language | news.ycombinator.com | 2023-11-14

I highly recommend https://github.com/sdiehl/write-you-a-haskell as it is very developer friendly. It’s not complete, but it really gets the gears turning and will set you up for writing your own Hendley-Milner style type checker.
klipse

14 3,088 0.0 HTML

Klipse is a JavaScript plugin for embedding interactive code snippets in tech blogs.
opencompass

1 2,403 9.7 Python

OpenCompass is an LLM evaluation platform, supporting a wide range of models (InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.

Project mention: Show HN: Times faster LLM evaluation with Bayesian optimization | news.ycombinator.com | 2024-02-13

Fair question.
Evaluate refers to the phase after training to check if the training is good.
Usually the flow goes training -> evaluation -> deployment (what you called inference). This project is aimed for evaluation. Evaluation can be slow (might even be slower than training if you're finetuning on a small domain specific subset)!
So there are [quite](https://github.com/microsoft/promptbench) [a](https://github.com/confident-ai/deepeval) [few](https://github.com/openai/evals) [frameworks](https://github.com/EleutherAI/lm-evaluation-harness) working on evaluation, however, all of them are quite slow, because LLM are slow if you don't have infinite money. [This](https://github.com/open-compass/opencompass) one tries to speed up by parallelizing on multiple computers, but none of them takes advantage of the fact that many evaluation queries might be similar and all try to evaluate on all given queries. And that's where this project might come in handy.
promptbench

4 1,954 9.2 Python

A unified evaluation framework for large language models

Project mention: Show HN: Times faster LLM evaluation with Bayesian optimization | news.ycombinator.com | 2024-02-13

Fair question.
Evaluate refers to the phase after training to check if the training is good.
Usually the flow goes training -> evaluation -> deployment (what you called inference). This project is aimed for evaluation. Evaluation can be slow (might even be slower than training if you're finetuning on a small domain specific subset)!
So there are [quite](https://github.com/microsoft/promptbench) [a](https://github.com/confident-ai/deepeval) [few](https://github.com/openai/evals) [frameworks](https://github.com/EleutherAI/lm-evaluation-harness) working on evaluation, however, all of them are quite slow, because LLM are slow if you don't have infinite money. [This](https://github.com/open-compass/opencompass) one tries to speed up by parallelizing on multiple computers, but none of them takes advantage of the fact that many evaluation queries might be similar and all try to evaluate on all given queries. And that's where this project might come in handy.
uptrain

34 1,951 9.7 Python

UpTrain is an open-source unified platform to evaluate and improve Generative AI applications. We provide grades for 20+ preconfigured checks (covering language, code, embedding use-cases), perform root cause analysis on failure cases and give insights on how to resolve them.

Project mention: Evaluation of OpenAI Assistants | dev.to | 2024-04-09

Currently seeking feedback for the developed tool. Would love it if you can check it out on: https://github.com/uptrain-ai/uptrain/blob/main/examples/assistants/assistant_evaluator.ipynb
WorkOS

workos.com
sponsored

The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
evaluate

3 1,803 5.2 Python

🤗 Evaluate: A library for easily evaluating machine learning models and datasets.
EvalAI

4 1,673 9.0 Python

:cloud: :rocket: :bar_chart: :chart_with_upwards_trend: Evaluating state of the art in AI
avalanche

1 1,654 9.5 Python

Avalanche: an End-to-End Library for Continual Learning based on PyTorch.
pycm

18 1,428 5.0 Python

Multi-class confusion matrix library in Python

Project mention: PyCM 4.0 Released: Multilabel Confusion Matrix Support | /r/coolgithubprojects | 2023-06-07
LLM-eval-survey

1 1,206 9.2

The official GitHub page for the survey paper "A Survey on Evaluation of Large Language Models".

Project mention: A Survey on Evaluation of Large Language Models | news.ycombinator.com | 2023-07-18
lispy

21 1,183 0.0 Emacs Lisp

Short and sweet LISP editing

Project mention: Sapling: A highly experimental vi-inspired editor where you edit code, not text | news.ycombinator.com | 2024-02-04
alpaca_eval

4 1,058 9.6 Jupyter Notebook

An automatic evaluator for instruction-following language models. Human-validated, high-quality, cheap, and fast.

Project mention: UltraLM-13B reaches top of AlpacaEval leaderboard | /r/LocalLLaMA | 2023-06-28

Alpaca Eval is open source and was developed by the same team who trained the alpaca model afaik. It is not like what you said in the other comment
torch-fidelity

3 870 8.1 Python

High-fidelity performance metrics for generative models in PyTorch
semantic-kitti-api

1 722 4.9 Python

SemanticKITTI API for visualizing dataset, processing data, and evaluating results.
gval

2 696 0.6 Go

Expression evaluation in golang
ExpressionEvaluator

4 562 4.9 C#

A Simple Math and Pseudo C# Expression Evaluator in One C# File. Can also execute small C# like scripts
long-form-factuality

2 428 5.9 Python

Benchmarking long-form factuality in large language models. Original code for our paper "Long-form factuality in large language models".

Project mention: An Open Source Tool for Multimodal Fact Verification | news.ycombinator.com | 2024-04-06

Isn't this similar to the Deepmind paper on long form factuality posted a few days ago?
https://arxiv.org/abs/2403.18802
https://github.com/google-deepmind/long-form-factuality/tree...
Eval-Expression.NET

1 423 2.9 C#

C# Eval Expression | Evaluate, Compile, and Execute C# code and expression at runtime.
simpleeval

5 420 0.0 Python

Simple Safe Sandboxed Extensible Expression Evaluator for Python
errant

2 410 4.5 Python

ERRor ANnotation Toolkit: Automatically extract and classify grammatical errors in parallel original and corrected sentences.

Project mention: Given the rise of LLMs, is a toolkit like ERRANT still relevant? | /r/LanguageTechnology | 2023-12-10

ERRANT automatically annotates parallel English sentences with error type information.
ranx

1 325 6.7 Python

⚡️A Blazing-Fast Python Library for Ranking Evaluation, Comparison, and Fusion 🐍

Project mention: Sparse Vectors in Qdrant: Pure Vector-based Hybrid Search | dev.to | 2024-02-19

Ranx is a great library for mixing results from different sources.
SaaSHub

www.saashub.com
sponsored

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020). The latest post mention was on 2024-04-09.

Evaluation related posts

An Open Source Tool for Multimodal Fact Verification
7 projects | news.ycombinator.com | 6 Apr 2024
Show HN: Times faster LLM evaluation with Bayesian optimization
6 projects | news.ycombinator.com | 13 Feb 2024
Given the rise of LLMs, is a toolkit like ERRANT still relevant?
1 project | /r/LanguageTechnology | 10 Dec 2023
UltraLM-13B reaches top of AlpacaEval leaderboard
3 projects | /r/LocalLLaMA | 28 Jun 2023
[P] AlpacaEval : An Automatic Evaluator for Instruction-following Language Models
2 projects | /r/LocalLLaMA | 8 Jun 2023
evalidate - Safe evaluation of untrusted user-supplied python expression
2 projects | /r/Python | 30 May 2023
@initminal/run - Safe & fast code eval in the browser with modern ESM features, dynamic module injection and more...
1 project | /r/javascript | 16 Apr 2023
A note from our sponsor - InfluxDB
www.influxdata.com | 18 Apr 2024

Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality. Learn more →

Index

What are some of the best open-source Evaluation projects? This list will help you:

	Project	Stars
1	awesome-semantic-segmentation	10,220
2	govaluate	3,529
3	write-you-a-haskell	3,304
4	klipse	3,088
5	opencompass	2,403
6	promptbench	1,954
7	uptrain	1,951
8	evaluate	1,803
9	EvalAI	1,673
10	avalanche	1,654
11	pycm	1,428
12	LLM-eval-survey	1,206
13	lispy	1,183
14	alpaca_eval	1,058
15	torch-fidelity	870
16	semantic-kitti-api	722
17	gval	696
18	ExpressionEvaluator	562
19	long-form-factuality	428
20	Eval-Expression.NET	423
21	simpleeval	420
22	errant	410
23	ranx	325