Python Evaluation

Open-source Python projects categorized as Evaluation

Top 23 Python Evaluation Projects

  • opencompass

    OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.

  • Project mention: Show HN: Times faster LLM evaluation with Bayesian optimization | news.ycombinator.com | 2024-02-13

    Fair question.

    Evaluate refers to the phase after training to check if the training is good.

    Usually the flow goes training -> evaluation -> deployment (what you called inference). This project is aimed for evaluation. Evaluation can be slow (might even be slower than training if you're finetuning on a small domain specific subset)!

    So there are [quite](https://github.com/microsoft/promptbench) [a](https://github.com/confident-ai/deepeval) [few](https://github.com/openai/evals) [frameworks](https://github.com/EleutherAI/lm-evaluation-harness) working on evaluation, however, all of them are quite slow, because LLM are slow if you don't have infinite money. [This](https://github.com/open-compass/opencompass) one tries to speed up by parallelizing on multiple computers, but none of them takes advantage of the fact that many evaluation queries might be similar and all try to evaluate on all given queries. And that's where this project might come in handy.

  • promptbench

    A unified evaluation framework for large language models

  • Project mention: Show HN: Times faster LLM evaluation with Bayesian optimization | news.ycombinator.com | 2024-02-13

    Fair question.

    Evaluate refers to the phase after training to check if the training is good.

    Usually the flow goes training -> evaluation -> deployment (what you called inference). This project is aimed for evaluation. Evaluation can be slow (might even be slower than training if you're finetuning on a small domain specific subset)!

    So there are [quite](https://github.com/microsoft/promptbench) [a](https://github.com/confident-ai/deepeval) [few](https://github.com/openai/evals) [frameworks](https://github.com/EleutherAI/lm-evaluation-harness) working on evaluation, however, all of them are quite slow, because LLM are slow if you don't have infinite money. [This](https://github.com/open-compass/opencompass) one tries to speed up by parallelizing on multiple computers, but none of them takes advantage of the fact that many evaluation queries might be similar and all try to evaluate on all given queries. And that's where this project might come in handy.

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • uptrain

    UpTrain is an open-source unified platform to evaluate and improve Generative AI applications. We provide grades for 20+ preconfigured checks (covering language, code, embedding use-cases), perform root cause analysis on failure cases and give insights on how to resolve them.

  • Project mention: Evaluation of OpenAI Assistants | dev.to | 2024-04-09

    Currently seeking feedback for the developed tool. Would love it if you can check it out on: https://github.com/uptrain-ai/uptrain/blob/main/examples/assistants/assistant_evaluator.ipynb

  • evaluate

    🤗 Evaluate: A library for easily evaluating machine learning models and datasets.

  • EvalAI

    :cloud: :rocket: :bar_chart: :chart_with_upwards_trend: Evaluating state of the art in AI

  • avalanche

    Avalanche: an End-to-End Library for Continual Learning based on PyTorch.

  • pycm

    Multi-class confusion matrix library in Python

  • Project mention: PyCM 4.0 Released: Multilabel Confusion Matrix Support | /r/coolgithubprojects | 2023-06-07
  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
  • torch-fidelity

    High-fidelity performance metrics for generative models in PyTorch

  • semantic-kitti-api

    SemanticKITTI API for visualizing dataset, processing data, and evaluating results.

  • long-form-factuality

    Benchmarking long-form factuality in large language models. Original code for our paper "Long-form factuality in large language models".

  • Project mention: An Open Source Tool for Multimodal Fact Verification | news.ycombinator.com | 2024-04-06

    Isn't this similar to the Deepmind paper on long form factuality posted a few days ago?

    https://arxiv.org/abs/2403.18802

    https://github.com/google-deepmind/long-form-factuality/tree...

  • simpleeval

    Simple Safe Sandboxed Extensible Expression Evaluator for Python

  • errant

    ERRor ANnotation Toolkit: Automatically extract and classify grammatical errors in parallel original and corrected sentences.

  • Project mention: Given the rise of LLMs, is a toolkit like ERRANT still relevant? | /r/LanguageTechnology | 2023-12-10

    ERRANT automatically annotates parallel English sentences with error type information.

  • ranx

    ⚡️A Blazing-Fast Python Library for Ranking Evaluation, Comparison, and Fusion 🐍

  • Project mention: Sparse Vectors in Qdrant: Pure Vector-based Hybrid Search | dev.to | 2024-02-19

    Ranx is a great library for mixing results from different sources.

  • rexmex

    A general purpose recommender metrics library for fair evaluation.

  • generative-evaluation-prdc

    Code base for the precision, recall, density, and coverage metrics for generative models. ICML 2020.

  • FActScore

    A package to evaluate factuality of long-form generation. Original implementation of our EMNLP 2023 paper "FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation"

  • Project mention: Long-form factuality in large language models | news.ycombinator.com | 2024-04-06

    Looks like a slight modification of FActScore [1], but instead of using Wikipedia as a verification source, they use Google Search. They also claim to include a wider range of topics. That said, FActScore allows you to use whatever knowledge source and topics you want [2].

    [1]: https://arxiv.org/abs/2305.14251

    [2]: https://github.com/shmsw25/FActScore?tab=readme-ov-file#to-u...

  • ChatGPT_for_IE

    Evaluating ChatGPT’s Information Extraction Capabilities: An Assessment of Performance, Explainability, Calibration, and Faithfulness

  • precision-recall-distributions

    Assessing Generative Models via Precision and Recall (official repository)

  • CommonGen-Eval

    Evaluating LLMs with CommonGen-Lite

  • Project mention: Evaluating LLMs with CommonGen-Lite | news.ycombinator.com | 2024-01-08

    Leaderboard: https://github.com/allenai/CommonGen-Eval?tab=readme-ov-file...

  • django-access

    Django-Access - the application introducing dynamic evaluation-based instance-level (row-level) access rights control for Django

  • BooookScore

    A package to generate summaries of long-form text and evaluate the coherence of these summaries. Official package for our ICLR 2024 paper, "BooookScore: A systematic exploration of book-length summarization in the era of LLMs".

  • Project mention: Evaluating faithfulness and content selection of LLMs in book-length summaries | news.ycombinator.com | 2024-04-09

    With a link to https://arxiv.org/pdf/2310.00785.pdf - which then links to another GitHub repository, https://github.com/lilakk/BooookScore which has a bunch of prompts in https://github.com/lilakk/BooookScore/tree/main/prompts

    Which makes me think that this original paper isn't evaluating LLMs so much as it's evaluating that one particular prompting technique for long summaries.

    Gemini Pro 1.5 has 1m token context length, which should remove the need for weird hierarchical summary tricks. I wonder how well it would score?

  • cyclops

    Toolkit for health AI implementation (by VectorInstitute)

  • Project mention: Call for open source devs to build the future of healthcare! | /r/github | 2023-06-30
  • ice-score

    [EACL 2024] ICE-Score: Instructing Large Language Models to Evaluate Code

  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Python Evaluation related posts

  • An Open Source Tool for Multimodal Fact Verification

    7 projects | news.ycombinator.com | 6 Apr 2024
  • Show HN: Times faster LLM evaluation with Bayesian optimization

    6 projects | news.ycombinator.com | 13 Feb 2024
  • Given the rise of LLMs, is a toolkit like ERRANT still relevant?

    1 project | /r/LanguageTechnology | 10 Dec 2023
  • evalidate - Safe evaluation of untrusted user-supplied python expression

    2 projects | /r/Python | 30 May 2023
  • [D] The MMSegmentation library from OpenMMLab appears to return the wrong results when computing basic image segmentation metrics such as the Jaccard index (IoU - intersection-over-union). It appears to compute recall (sensitivity) instead of IoU, which artificially inflates the performance metrics.

    2 projects | /r/MachineLearning | 6 Mar 2023
  • [D] Can we use Ray for distributed training on vertex ai ? Can someone provide me examples for the same ? Also which dataframe libraries you guys used for training machine learning models on huge datasets (100 gb+) (because pandas can't handle huge data).

    1 project | /r/MLQuestions | 9 Feb 2023
  • Need help with a data science project

    1 project | /r/learnmachinelearning | 30 Jan 2023
  • A note from our sponsor - InfluxDB
    www.influxdata.com | 4 May 2024
    Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality. Learn more →

Index

What are some of the best open-source Evaluation projects in Python? This list will help you:

Project Stars
1 opencompass 2,559
2 promptbench 2,061
3 uptrain 1,999
4 evaluate 1,819
5 EvalAI 1,688
6 avalanche 1,674
7 pycm 1,430
8 torch-fidelity 872
9 semantic-kitti-api 725
10 long-form-factuality 443
11 simpleeval 423
12 errant 410
13 ranx 344
14 rexmex 276
15 generative-evaluation-prdc 234
16 FActScore 215
17 ChatGPT_for_IE 134
18 precision-recall-distributions 95
19 CommonGen-Eval 79
20 django-access 76
21 BooookScore 67
22 cyclops 66
23 ice-score 61

Sponsored
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com