Top 11 Python evaluation-metric Projects

OCTIS

7 681 6.0 Python

OCTIS: Comparing Topic Models is Simple! A python package to optimize and evaluate topic models (accepted at EACL2021 demo track)
image-similarity-measures

3 516 4.4 Python

:chart_with_upwards_trend: Implementation of eight evaluation metrics to access the similarity between two images. The eight metrics are as follows: RMSE, PSNR, SSIM, ISSM, FSIM, SRE, SAM, and UIQ.

Project mention: Using VAE for image compression | /r/StableDiffusion | 2023-05-04

Speaking of math, using this library -- https://github.com/up42/image-similarity-measures -- I computed the following for these images vs the original image:

InfluxDB

www.influxdata.com sponsored

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
agentops

1 474 9.5 Python

Python SDK for agent evals and observability

Project mention: DeepEval – Unit Testing for LLMs | news.ycombinator.com | 2023-08-16

COMET

3 400 8.0 Python

A Neural Framework for MT Evaluation (by Unbabel)
ranx

1 344 6.0 Python

⚡️A Blazing-Fast Python Library for Ranking Evaluation, Comparison, and Fusion 🐍

Project mention: Sparse Vectors in Qdrant: Pure Vector-based Hybrid Search | dev.to | 2024-02-19

Ranx is a great library for mixing results from different sources.

continuous-eval

3 311 8.4 Python

Open-Source Evaluation for GenAI Application Pipelines

Project mention: Launch HN: Relari (YC W24) – Identify the root cause of problems in LLM apps | news.ycombinator.com | 2024-03-08

Hi HN, we are the founders of Relari, the company behind continuous-eval (https://github.com/relari-ai/continuous-eval), an evaluation framework that lets you test your GenAI systems at the component level, pinpointing issues where they originate.
We experienced the need for this when we were building a copilot for bankers. Our RAG pipeline blew up in complexity as we added components: a query classifier (to triage user intent), multiple retrievers (to grab information from different sources), filtering LLM (to rerank / compress context), a calculator agent (to call financial functions) and finally the synthesizer LLM that gives the answer. Ensuring reliability became more difficult with each of these we added.
When a bad response was detected by our answer evaluator, we had to backtrack multiple steps to understand which component(s) made a mistake. But this quickly became unscalable beyond a few samples.
I did my Ph.D. in fault detection for autonomous vehicles, and I see a strong parallel between the complexity of autonomous driving software and today's LLM pipelines. In self-driving systems, sensors, perception, prediction, planning, and control modules are all chained together. To ensure system-level safety, we use granular metrics to measure the performance of each module individually. When the vehicle makes an unexpected decision, we use these metrics to pinpoint the problem to a specific component. Only then we can make targeted improvements, systematically.
Based on this thinking, we developed the first version of continuous-eval for ourselves. Since then we’ve made it more flexible to fit various types of GenAI pipelines. Continuous-eval allows you to describe (programmatically) your pipeline and modules, and select metrics for each module. We developed 30+ metrics to cover retrieval, text generation, code generation, classification, agent tool use, etc. We now have a number of companies using us to test complex pipelines like finance copilots, enterprise search, coding agents, etc.
As an example, one customer was trying to understand why their RAG system did poorly on trend analysis queries. Through continuous-eval, they realized that the “retriever” component was retrieving 80%+ of all relevant chunks, but the “reranker” component, that filters out “irrelevant” context, was dropping that to below 50%. This enabled them to fix the problem, in their case by skipping the reranker for certain queries.
We’ve also built ensemble metrics that do a surprisingly good job of predicting user feedback. Users often rate LLM-generated answers by giving a thumbs up/down about how good the answer was. We train our custom metrics on this user data, and then use those metrics to generate thumbs up/down ratings on future LLM answers. The results turn out to be 90% aligned with what the users say. This gives developers a feedback loop from production data to offline testing and development. Some customers have found this to be our most unique advantage.
Lastly, to make the most out of evaluation, you should use a diverse dataset—ideally with ground truth labels for comprehensive and consistent assessment. Because ground truth labels are costly and time-consuming to curate manually, we also have a synthetic data generation pipeline that allows you to get started quickly. Try it here (https://www.relari.ai/#synthetic_data_demo)
What’s been your experience testing and iterating LLM apps? Please let us know your thoughts and feedback on our approaches (modular framework, leveraging user feedback, testing with synthetic data).

generative-evaluation-prdc

2 234 0.0 Python

Code base for the precision, recall, density, and coverage metrics for generative models. ICML 2020.
WorkOS

workos.com sponsored

The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
tonic_validate

6 199 9.5 Python

Metrics to evaluate the quality of responses of your Retrieval Augmented Generation (RAG) applications.

Project mention: Validating the RAG Performance of Amazon Titan vs. Cohere Using Amazon Bedrock | news.ycombinator.com | 2024-02-09

I tried out Amazon Bedrock, and used Tonic Validate to do a head to head comparison of very simple RAG system's built using embedding and text models available in Amazon Bedrock. I compared Amazon Titan's embedding and text models to Cohere's embedding and text models in RAG systems that employ Amazon Bedrock Knowledge Bases as the vector db and retrieval components of the system.
The code for the comparison is in this jupyter notebook https://github.com/TonicAI/tonic_validate/blob/main/examples...
Let me know what you think, And your experiences building RAG with Amazon Bedrock!

precision-recall-distributions

1 95 0.0 Python

Assessing Generative Models via Precision and Recall (official repository)
ctc-gen-eval

3 93 1.3 Python

EMNLP 2021 - CTC: A Unified Framework for Evaluating Natural Language Generation
tvallogging

1 6 5.9 Python

A tool for evaluating and tracking your RAG experiments. This repo contains the Python SDK for logging to Tonic Validate.

Project mention: Show HN: Tonic Validate Logging – an open-sourced SDK and convenient UI | news.ycombinator.com | 2023-10-31

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Python evaluation-metrics related posts

Launch HN: Relari (YC W24) – Identify the root cause of problems in LLM apps
1 project | news.ycombinator.com | 8 Mar 2024
Show HN: Ellipsis – Automatic pull request reviews
5 projects | news.ycombinator.com | 27 Feb 2024
Validating the RAG Performance of Amazon Titan vs. Cohere Using Amazon Bedrock
1 project | news.ycombinator.com | 9 Feb 2024
Tonic.ai and LlamaIndex join forces to help developers build RAG systems
1 project | news.ycombinator.com | 19 Jan 2024
Evaluating Rag Parameters Using Tvalmetrics
1 project | news.ycombinator.com | 1 Nov 2023
Show HN: Tonic Validate Logging – an open-sourced SDK and convenient UI
3 projects | news.ycombinator.com | 31 Oct 2023
Using VAE for image compression
1 project | /r/StableDiffusion | 4 May 2023
A note from our sponsor - WorkOS
workos.com | 29 Apr 2024

The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning. Learn more →

Index

What are some of the best open-source evaluation-metric projects in Python? This list will help you:

	Project	Stars
1	OCTIS	681
2	image-similarity-measures	516
3	agentops	474
4	COMET	400
5	ranx	344
6	continuous-eval	311
7	generative-evaluation-prdc	234
8	tonic_validate	199
9	precision-recall-distributions	95
10	ctc-gen-eval	93
11	tvallogging	6