Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality. Learn more →
Top 23 Evaluation Open-Source Projects
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
write-you-a-haskell
Building a modern functional compiler from first principles. (http://dev.stephendiehl.com/fun/)
-
opencompass
OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
-
uptrain
UpTrain is an open-source unified platform to evaluate and improve Generative AI applications. We provide grades for 20+ preconfigured checks (covering language, code, embedding use-cases), perform root cause analysis on failure cases and give insights on how to resolve them.
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
-
LLM-eval-survey
The official GitHub page for the survey paper "A Survey on Evaluation of Large Language Models".
-
alpaca_eval
An automatic evaluator for instruction-following language models. Human-validated, high-quality, cheap, and fast.
-
semantic-kitti-api
SemanticKITTI API for visualizing dataset, processing data, and evaluating results.
-
ExpressionEvaluator
A Simple Math and Pseudo C# Expression Evaluator in One C# File. Can also execute small C# like scripts
-
long-form-factuality
Benchmarking long-form factuality in large language models. Original code for our paper "Long-form factuality in large language models".
-
Eval-Expression.NET
C# Eval Expression | Evaluate, Compile, and Execute C# code and expression at runtime.
-
errant
ERRor ANnotation Toolkit: Automatically extract and classify grammatical errors in parallel original and corrected sentences.
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
I highly recommend https://github.com/sdiehl/write-you-a-haskell as it is very developer friendly. It’s not complete, but it really gets the gears turning and will set you up for writing your own Hendley-Milner style type checker.
Project mention: Show HN: Times faster LLM evaluation with Bayesian optimization | news.ycombinator.com | 2024-02-13Fair question.
Evaluate refers to the phase after training to check if the training is good.
Usually the flow goes training -> evaluation -> deployment (what you called inference). This project is aimed for evaluation. Evaluation can be slow (might even be slower than training if you're finetuning on a small domain specific subset)!
So there are [quite](https://github.com/microsoft/promptbench) [a](https://github.com/confident-ai/deepeval) [few](https://github.com/openai/evals) [frameworks](https://github.com/EleutherAI/lm-evaluation-harness) working on evaluation, however, all of them are quite slow, because LLM are slow if you don't have infinite money. [This](https://github.com/open-compass/opencompass) one tries to speed up by parallelizing on multiple computers, but none of them takes advantage of the fact that many evaluation queries might be similar and all try to evaluate on all given queries. And that's where this project might come in handy.
Project mention: Show HN: Times faster LLM evaluation with Bayesian optimization | news.ycombinator.com | 2024-02-13Fair question.
Evaluate refers to the phase after training to check if the training is good.
Usually the flow goes training -> evaluation -> deployment (what you called inference). This project is aimed for evaluation. Evaluation can be slow (might even be slower than training if you're finetuning on a small domain specific subset)!
So there are [quite](https://github.com/microsoft/promptbench) [a](https://github.com/confident-ai/deepeval) [few](https://github.com/openai/evals) [frameworks](https://github.com/EleutherAI/lm-evaluation-harness) working on evaluation, however, all of them are quite slow, because LLM are slow if you don't have infinite money. [This](https://github.com/open-compass/opencompass) one tries to speed up by parallelizing on multiple computers, but none of them takes advantage of the fact that many evaluation queries might be similar and all try to evaluate on all given queries. And that's where this project might come in handy.
Currently seeking feedback for the developed tool. Would love it if you can check it out on: https://github.com/uptrain-ai/uptrain/blob/main/examples/assistants/assistant_evaluator.ipynb
Project mention: PyCM 4.0 Released: Multilabel Confusion Matrix Support | /r/coolgithubprojects | 2023-06-07
Project mention: A Survey on Evaluation of Large Language Models | news.ycombinator.com | 2023-07-18
Project mention: Sapling: A highly experimental vi-inspired editor where you edit code, not text | news.ycombinator.com | 2024-02-04
Alpaca Eval is open source and was developed by the same team who trained the alpaca model afaik. It is not like what you said in the other comment
Project mention: An Open Source Tool for Multimodal Fact Verification | news.ycombinator.com | 2024-04-06Isn't this similar to the Deepmind paper on long form factuality posted a few days ago?
https://arxiv.org/abs/2403.18802
https://github.com/google-deepmind/long-form-factuality/tree...
Project mention: Given the rise of LLMs, is a toolkit like ERRANT still relevant? | /r/LanguageTechnology | 2023-12-10ERRANT automatically annotates parallel English sentences with error type information.
Ranx is a great library for mixing results from different sources.
Evaluation related posts
-
An Open Source Tool for Multimodal Fact Verification
-
Show HN: Times faster LLM evaluation with Bayesian optimization
-
Given the rise of LLMs, is a toolkit like ERRANT still relevant?
-
UltraLM-13B reaches top of AlpacaEval leaderboard
-
[P] AlpacaEval : An Automatic Evaluator for Instruction-following Language Models
-
evalidate - Safe evaluation of untrusted user-supplied python expression
-
@initminal/run - Safe & fast code eval in the browser with modern ESM features, dynamic module injection and more...
-
A note from our sponsor - InfluxDB
www.influxdata.com | 2 May 2024
Index
What are some of the best open-source Evaluation projects? This list will help you:
Project | Stars | |
---|---|---|
1 | awesome-semantic-segmentation | 10,220 |
2 | govaluate | 3,542 |
3 | write-you-a-haskell | 3,304 |
4 | klipse | 3,088 |
5 | opencompass | 2,559 |
6 | promptbench | 2,061 |
7 | uptrain | 1,999 |
8 | evaluate | 1,819 |
9 | EvalAI | 1,683 |
10 | avalanche | 1,666 |
11 | pycm | 1,429 |
12 | LLM-eval-survey | 1,229 |
13 | lispy | 1,184 |
14 | alpaca_eval | 1,103 |
15 | torch-fidelity | 872 |
16 | semantic-kitti-api | 723 |
17 | gval | 698 |
18 | ExpressionEvaluator | 562 |
19 | long-form-factuality | 435 |
20 | Eval-Expression.NET | 428 |
21 | simpleeval | 423 |
22 | errant | 410 |
23 | ranx | 344 |
Sponsored