Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality. Learn more →
Top 23 Python Evaluation Projects
-
opencompass
OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
uptrain
UpTrain is an open-source unified platform to evaluate and improve Generative AI applications. We provide grades for 20+ preconfigured checks (covering language, code, embedding use-cases), perform root cause analysis on failure cases and give insights on how to resolve them.
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
-
semantic-kitti-api
SemanticKITTI API for visualizing dataset, processing data, and evaluating results.
-
long-form-factuality
Benchmarking long-form factuality in large language models. Original code for our paper "Long-form factuality in large language models".
-
errant
ERRor ANnotation Toolkit: Automatically extract and classify grammatical errors in parallel original and corrected sentences.
-
generative-evaluation-prdc
Code base for the precision, recall, density, and coverage metrics for generative models. ICML 2020.
-
FActScore
A package to evaluate factuality of long-form generation. Original implementation of our EMNLP 2023 paper "FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation"
-
ChatGPT_for_IE
Evaluating ChatGPT’s Information Extraction Capabilities: An Assessment of Performance, Explainability, Calibration, and Faithfulness
-
precision-recall-distributions
Assessing Generative Models via Precision and Recall (official repository)
-
django-access
Django-Access - the application introducing dynamic evaluation-based instance-level (row-level) access rights control for Django
-
BooookScore
A package to generate summaries of long-form text and evaluate the coherence of these summaries. Official package for our ICLR 2024 paper, "BooookScore: A systematic exploration of book-length summarization in the era of LLMs".
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
Project mention: Show HN: Times faster LLM evaluation with Bayesian optimization | news.ycombinator.com | 2024-02-13Fair question.
Evaluate refers to the phase after training to check if the training is good.
Usually the flow goes training -> evaluation -> deployment (what you called inference). This project is aimed for evaluation. Evaluation can be slow (might even be slower than training if you're finetuning on a small domain specific subset)!
So there are [quite](https://github.com/microsoft/promptbench) [a](https://github.com/confident-ai/deepeval) [few](https://github.com/openai/evals) [frameworks](https://github.com/EleutherAI/lm-evaluation-harness) working on evaluation, however, all of them are quite slow, because LLM are slow if you don't have infinite money. [This](https://github.com/open-compass/opencompass) one tries to speed up by parallelizing on multiple computers, but none of them takes advantage of the fact that many evaluation queries might be similar and all try to evaluate on all given queries. And that's where this project might come in handy.
Project mention: Show HN: Times faster LLM evaluation with Bayesian optimization | news.ycombinator.com | 2024-02-13Fair question.
Evaluate refers to the phase after training to check if the training is good.
Usually the flow goes training -> evaluation -> deployment (what you called inference). This project is aimed for evaluation. Evaluation can be slow (might even be slower than training if you're finetuning on a small domain specific subset)!
So there are [quite](https://github.com/microsoft/promptbench) [a](https://github.com/confident-ai/deepeval) [few](https://github.com/openai/evals) [frameworks](https://github.com/EleutherAI/lm-evaluation-harness) working on evaluation, however, all of them are quite slow, because LLM are slow if you don't have infinite money. [This](https://github.com/open-compass/opencompass) one tries to speed up by parallelizing on multiple computers, but none of them takes advantage of the fact that many evaluation queries might be similar and all try to evaluate on all given queries. And that's where this project might come in handy.
Currently seeking feedback for the developed tool. Would love it if you can check it out on: https://github.com/uptrain-ai/uptrain/blob/main/examples/assistants/assistant_evaluator.ipynb
Project mention: PyCM 4.0 Released: Multilabel Confusion Matrix Support | /r/coolgithubprojects | 2023-06-07
Project mention: An Open Source Tool for Multimodal Fact Verification | news.ycombinator.com | 2024-04-06Isn't this similar to the Deepmind paper on long form factuality posted a few days ago?
https://arxiv.org/abs/2403.18802
https://github.com/google-deepmind/long-form-factuality/tree...
Project mention: Given the rise of LLMs, is a toolkit like ERRANT still relevant? | /r/LanguageTechnology | 2023-12-10ERRANT automatically annotates parallel English sentences with error type information.
Ranx is a great library for mixing results from different sources.
Looks like a slight modification of FActScore [1], but instead of using Wikipedia as a verification source, they use Google Search. They also claim to include a wider range of topics. That said, FActScore allows you to use whatever knowledge source and topics you want [2].
[1]: https://arxiv.org/abs/2305.14251
[2]: https://github.com/shmsw25/FActScore?tab=readme-ov-file#to-u...
Leaderboard: https://github.com/allenai/CommonGen-Eval?tab=readme-ov-file...
Project mention: Evaluating faithfulness and content selection of LLMs in book-length summaries | news.ycombinator.com | 2024-04-09With a link to https://arxiv.org/pdf/2310.00785.pdf - which then links to another GitHub repository, https://github.com/lilakk/BooookScore which has a bunch of prompts in https://github.com/lilakk/BooookScore/tree/main/prompts
Which makes me think that this original paper isn't evaluating LLMs so much as it's evaluating that one particular prompting technique for long summaries.
Gemini Pro 1.5 has 1m token context length, which should remove the need for weird hierarchical summary tricks. I wonder how well it would score?
Project mention: Call for open source devs to build the future of healthcare! | /r/github | 2023-06-30
Python Evaluation related posts
-
An Open Source Tool for Multimodal Fact Verification
-
Show HN: Times faster LLM evaluation with Bayesian optimization
-
Given the rise of LLMs, is a toolkit like ERRANT still relevant?
-
evalidate - Safe evaluation of untrusted user-supplied python expression
-
[D] The MMSegmentation library from OpenMMLab appears to return the wrong results when computing basic image segmentation metrics such as the Jaccard index (IoU - intersection-over-union). It appears to compute recall (sensitivity) instead of IoU, which artificially inflates the performance metrics.
-
[D] Can we use Ray for distributed training on vertex ai ? Can someone provide me examples for the same ? Also which dataframe libraries you guys used for training machine learning models on huge datasets (100 gb+) (because pandas can't handle huge data).
-
Need help with a data science project
-
A note from our sponsor - InfluxDB
www.influxdata.com | 4 May 2024
Index
What are some of the best open-source Evaluation projects in Python? This list will help you:
Project | Stars | |
---|---|---|
1 | opencompass | 2,559 |
2 | promptbench | 2,061 |
3 | uptrain | 1,999 |
4 | evaluate | 1,819 |
5 | EvalAI | 1,688 |
6 | avalanche | 1,674 |
7 | pycm | 1,430 |
8 | torch-fidelity | 872 |
9 | semantic-kitti-api | 725 |
10 | long-form-factuality | 443 |
11 | simpleeval | 423 |
12 | errant | 410 |
13 | ranx | 344 |
14 | rexmex | 276 |
15 | generative-evaluation-prdc | 234 |
16 | FActScore | 215 |
17 | ChatGPT_for_IE | 134 |
18 | precision-recall-distributions | 95 |
19 | CommonGen-Eval | 79 |
20 | django-access | 76 |
21 | BooookScore | 67 |
22 | cyclops | 66 |
23 | ice-score | 61 |
Sponsored