Top 23 Python Evaluation Projects

opencompass

1 2,559 9.7 Python

OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.

Project mention: Show HN: Times faster LLM evaluation with Bayesian optimization | news.ycombinator.com | 2024-02-13

Fair question.
Evaluate refers to the phase after training to check if the training is good.
Usually the flow goes training -> evaluation -> deployment (what you called inference). This project is aimed for evaluation. Evaluation can be slow (might even be slower than training if you're finetuning on a small domain specific subset)!
So there are [quite](https://github.com/microsoft/promptbench) [a](https://github.com/confident-ai/deepeval) [few](https://github.com/openai/evals) [frameworks](https://github.com/EleutherAI/lm-evaluation-harness) working on evaluation, however, all of them are quite slow, because LLM are slow if you don't have infinite money. [This](https://github.com/open-compass/opencompass) one tries to speed up by parallelizing on multiple computers, but none of them takes advantage of the fact that many evaluation queries might be similar and all try to evaluate on all given queries. And that's where this project might come in handy.

promptbench

4 2,061 9.2 Python

A unified evaluation framework for large language models

Project mention: Show HN: Times faster LLM evaluation with Bayesian optimization | news.ycombinator.com | 2024-02-13

Fair question.
Evaluate refers to the phase after training to check if the training is good.
Usually the flow goes training -> evaluation -> deployment (what you called inference). This project is aimed for evaluation. Evaluation can be slow (might even be slower than training if you're finetuning on a small domain specific subset)!
So there are [quite](https://github.com/microsoft/promptbench) [a](https://github.com/confident-ai/deepeval) [few](https://github.com/openai/evals) [frameworks](https://github.com/EleutherAI/lm-evaluation-harness) working on evaluation, however, all of them are quite slow, because LLM are slow if you don't have infinite money. [This](https://github.com/open-compass/opencompass) one tries to speed up by parallelizing on multiple computers, but none of them takes advantage of the fact that many evaluation queries might be similar and all try to evaluate on all given queries. And that's where this project might come in handy.

InfluxDB

www.influxdata.com featured

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
uptrain

34 1,999 9.6 Python

UpTrain is an open-source unified platform to evaluate and improve Generative AI applications. We provide grades for 20+ preconfigured checks (covering language, code, embedding use-cases), perform root cause analysis on failure cases and give insights on how to resolve them.

Project mention: Evaluation of OpenAI Assistants | dev.to | 2024-04-09

Currently seeking feedback for the developed tool. Would love it if you can check it out on: https://github.com/uptrain-ai/uptrain/blob/main/examples/assistants/assistant_evaluator.ipynb

evaluate

3 1,819 6.1 Python

🤗 Evaluate: A library for easily evaluating machine learning models and datasets.
EvalAI

4 1,688 8.9 Python

:cloud: :rocket: :bar_chart: :chart_with_upwards_trend: Evaluating state of the art in AI
avalanche

1 1,674 9.4 Python

Avalanche: an End-to-End Library for Continual Learning based on PyTorch.
pycm

18 1,430 1.5 Python

Multi-class confusion matrix library in Python

Project mention: PyCM 4.0 Released: Multilabel Confusion Matrix Support | /r/coolgithubprojects | 2023-06-07

SaaSHub

www.saashub.com featured

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
torch-fidelity

3 872 8.1 Python

High-fidelity performance metrics for generative models in PyTorch
semantic-kitti-api

1 725 4.9 Python

SemanticKITTI API for visualizing dataset, processing data, and evaluating results.
long-form-factuality

2 443 6.3 Python

Benchmarking long-form factuality in large language models. Original code for our paper "Long-form factuality in large language models".

Project mention: An Open Source Tool for Multimodal Fact Verification | news.ycombinator.com | 2024-04-06

Isn't this similar to the Deepmind paper on long form factuality posted a few days ago?
https://arxiv.org/abs/2403.18802
https://github.com/google-deepmind/long-form-factuality/tree...

simpleeval

5 423 3.2 Python

Simple Safe Sandboxed Extensible Expression Evaluator for Python
errant

2 410 4.5 Python

ERRor ANnotation Toolkit: Automatically extract and classify grammatical errors in parallel original and corrected sentences.

Project mention: Given the rise of LLMs, is a toolkit like ERRANT still relevant? | /r/LanguageTechnology | 2023-12-10

ERRANT automatically annotates parallel English sentences with error type information.

ranx

1 344 6.0 Python

⚡️A Blazing-Fast Python Library for Ranking Evaluation, Comparison, and Fusion 🐍

Project mention: Sparse Vectors in Qdrant: Pure Vector-based Hybrid Search | dev.to | 2024-02-19

Ranx is a great library for mixing results from different sources.

rexmex

1 276 5.5 Python

A general purpose recommender metrics library for fair evaluation.
generative-evaluation-prdc

2 234 0.0 Python

Code base for the precision, recall, density, and coverage metrics for generative models. ICML 2020.
FActScore

1 215 6.4 Python

A package to evaluate factuality of long-form generation. Original implementation of our EMNLP 2023 paper "FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation"

Project mention: Long-form factuality in large language models | news.ycombinator.com | 2024-04-06

Looks like a slight modification of FActScore [1], but instead of using Wikipedia as a verification source, they use Google Search. They also claim to include a wider range of topics. That said, FActScore allows you to use whatever knowledge source and topics you want [2].
[1]: https://arxiv.org/abs/2305.14251
[2]: https://github.com/shmsw25/FActScore?tab=readme-ov-file#to-u...

ChatGPT_for_IE

1 134 8.1 Python

Evaluating ChatGPT’s Information Extraction Capabilities: An Assessment of Performance, Explainability, Calibration, and Faithfulness
precision-recall-distributions

1 95 0.0 Python

Assessing Generative Models via Precision and Recall (official repository)
CommonGen-Eval

2 79 7.8 Python

Evaluating LLMs with CommonGen-Lite

Project mention: Evaluating LLMs with CommonGen-Lite | news.ycombinator.com | 2024-01-08

Leaderboard: https://github.com/allenai/CommonGen-Eval?tab=readme-ov-file...

django-access

1 76 5.1 Python

Django-Access - the application introducing dynamic evaluation-based instance-level (row-level) access rights control for Django
BooookScore

1 67 7.0 Python

A package to generate summaries of long-form text and evaluate the coherence of these summaries. Official package for our ICLR 2024 paper, "BooookScore: A systematic exploration of book-length summarization in the era of LLMs".

Project mention: Evaluating faithfulness and content selection of LLMs in book-length summaries | news.ycombinator.com | 2024-04-09

With a link to https://arxiv.org/pdf/2310.00785.pdf - which then links to another GitHub repository, https://github.com/lilakk/BooookScore which has a bunch of prompts in https://github.com/lilakk/BooookScore/tree/main/prompts
Which makes me think that this original paper isn't evaluating LLMs so much as it's evaluating that one particular prompting technique for long summaries.
Gemini Pro 1.5 has 1m token context length, which should remove the need for weird hierarchical summary tricks. I wonder how well it would score?

cyclops

3 66 9.3 Python

Toolkit for health AI implementation (by VectorInstitute)

Project mention: Call for open source devs to build the future of healthcare! | /r/github | 2023-06-30

ice-score

1 61 6.5 Python

[EACL 2024] ICE-Score: Instructing Large Language Models to Evaluate Code
SaaSHub

www.saashub.com featured

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Python Evaluation related posts

An Open Source Tool for Multimodal Fact Verification

7 projects | news.ycombinator.com | 6 Apr 2024
Show HN: Times faster LLM evaluation with Bayesian optimization

6 projects | news.ycombinator.com | 13 Feb 2024
Given the rise of LLMs, is a toolkit like ERRANT still relevant?

1 project | /r/LanguageTechnology | 10 Dec 2023
evalidate - Safe evaluation of untrusted user-supplied python expression

2 projects | /r/Python | 30 May 2023
[D] The MMSegmentation library from OpenMMLab appears to return the wrong results when computing basic image segmentation metrics such as the Jaccard index (IoU - intersection-over-union). It appears to compute recall (sensitivity) instead of IoU, which artificially inflates the performance metrics.

2 projects | /r/MachineLearning | 6 Mar 2023
[D] Can we use Ray for distributed training on vertex ai ? Can someone provide me examples for the same ? Also which dataframe libraries you guys used for training machine learning models on huge datasets (100 gb+) (because pandas can't handle huge data).

1 project | /r/MLQuestions | 9 Feb 2023
Need help with a data science project

1 project | /r/learnmachinelearning | 30 Jan 2023
A note from our sponsor - InfluxDB
www.influxdata.com | 4 May 2024

Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality. Learn more →

Index

What are some of the best open-source Evaluation projects in Python? This list will help you:

	Project	Stars
1	opencompass	2,559
2	promptbench	2,061
3	uptrain	1,999
4	evaluate	1,819
5	EvalAI	1,688
6	avalanche	1,674
7	pycm	1,430
8	torch-fidelity	872
9	semantic-kitti-api	725
10	long-form-factuality	443
11	simpleeval	423
12	errant	410
13	ranx	344
14	rexmex	276
15	generative-evaluation-prdc	234
16	FActScore	215
17	ChatGPT_for_IE	134
18	precision-recall-distributions	95
19	CommonGen-Eval	79
20	django-access	76
21	BooookScore	67
22	cyclops	66
23	ice-score	61