Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality. Learn more →
Top 23 Evaluation Open-Source Projects
-
-
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
write-you-a-haskell
Building a modern functional compiler from first principles. (http://dev.stephendiehl.com/fun/)
I highly recommend https://github.com/sdiehl/write-you-a-haskell as it is very developer friendly. It’s not complete, but it really gets the gears turning and will set you up for writing your own Hendley-Milner style type checker.
-
-
opencompass
OpenCompass is an LLM evaluation platform, supporting a wide range of models (InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
Project mention: Show HN: Times faster LLM evaluation with Bayesian optimization | news.ycombinator.com | 2024-02-13Fair question.
Evaluate refers to the phase after training to check if the training is good.
Usually the flow goes training -> evaluation -> deployment (what you called inference). This project is aimed for evaluation. Evaluation can be slow (might even be slower than training if you're finetuning on a small domain specific subset)!
So there are [quite](https://github.com/microsoft/promptbench) [a](https://github.com/confident-ai/deepeval) [few](https://github.com/openai/evals) [frameworks](https://github.com/EleutherAI/lm-evaluation-harness) working on evaluation, however, all of them are quite slow, because LLM are slow if you don't have infinite money. [This](https://github.com/open-compass/opencompass) one tries to speed up by parallelizing on multiple computers, but none of them takes advantage of the fact that many evaluation queries might be similar and all try to evaluate on all given queries. And that's where this project might come in handy.
-
Project mention: Show HN: Times faster LLM evaluation with Bayesian optimization | news.ycombinator.com | 2024-02-13
Fair question.
Evaluate refers to the phase after training to check if the training is good.
Usually the flow goes training -> evaluation -> deployment (what you called inference). This project is aimed for evaluation. Evaluation can be slow (might even be slower than training if you're finetuning on a small domain specific subset)!
So there are [quite](https://github.com/microsoft/promptbench) [a](https://github.com/confident-ai/deepeval) [few](https://github.com/openai/evals) [frameworks](https://github.com/EleutherAI/lm-evaluation-harness) working on evaluation, however, all of them are quite slow, because LLM are slow if you don't have infinite money. [This](https://github.com/open-compass/opencompass) one tries to speed up by parallelizing on multiple computers, but none of them takes advantage of the fact that many evaluation queries might be similar and all try to evaluate on all given queries. And that's where this project might come in handy.
-
uptrain
UpTrain is an open-source unified platform to evaluate and improve Generative AI applications. We provide grades for 20+ preconfigured checks (covering language, code, embedding use-cases), perform root cause analysis on failure cases and give insights on how to resolve them.
Currently seeking feedback for the developed tool. Would love it if you can check it out on: https://github.com/uptrain-ai/uptrain/blob/main/examples/assistants/assistant_evaluator.ipynb
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
-
-
-
-
Project mention: PyCM 4.0 Released: Multilabel Confusion Matrix Support | /r/coolgithubprojects | 2023-06-07
-
LLM-eval-survey
The official GitHub page for the survey paper "A Survey on Evaluation of Large Language Models".
Project mention: A Survey on Evaluation of Large Language Models | news.ycombinator.com | 2023-07-18 -
Project mention: Sapling: A highly experimental vi-inspired editor where you edit code, not text | news.ycombinator.com | 2024-02-04
-
alpaca_eval
An automatic evaluator for instruction-following language models. Human-validated, high-quality, cheap, and fast.
Alpaca Eval is open source and was developed by the same team who trained the alpaca model afaik. It is not like what you said in the other comment
-
-
semantic-kitti-api
SemanticKITTI API for visualizing dataset, processing data, and evaluating results.
-
-
ExpressionEvaluator
A Simple Math and Pseudo C# Expression Evaluator in One C# File. Can also execute small C# like scripts
-
long-form-factuality
Benchmarking long-form factuality in large language models. Original code for our paper "Long-form factuality in large language models".
Project mention: An Open Source Tool for Multimodal Fact Verification | news.ycombinator.com | 2024-04-06Isn't this similar to the Deepmind paper on long form factuality posted a few days ago?
https://arxiv.org/abs/2403.18802
https://github.com/google-deepmind/long-form-factuality/tree...
-
Eval-Expression.NET
C# Eval Expression | Evaluate, Compile, and Execute C# code and expression at runtime.
-
-
errant
ERRor ANnotation Toolkit: Automatically extract and classify grammatical errors in parallel original and corrected sentences.
Project mention: Given the rise of LLMs, is a toolkit like ERRANT still relevant? | /r/LanguageTechnology | 2023-12-10ERRANT automatically annotates parallel English sentences with error type information.
-
Ranx is a great library for mixing results from different sources.
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
Evaluation related posts
- An Open Source Tool for Multimodal Fact Verification
- Show HN: Times faster LLM evaluation with Bayesian optimization
- Given the rise of LLMs, is a toolkit like ERRANT still relevant?
- UltraLM-13B reaches top of AlpacaEval leaderboard
- [P] AlpacaEval : An Automatic Evaluator for Instruction-following Language Models
- evalidate - Safe evaluation of untrusted user-supplied python expression
- @initminal/run - Safe & fast code eval in the browser with modern ESM features, dynamic module injection and more...
-
A note from our sponsor - InfluxDB
www.influxdata.com | 18 Apr 2024
Index
What are some of the best open-source Evaluation projects? This list will help you:
Project | Stars | |
---|---|---|
1 | awesome-semantic-segmentation | 10,220 |
2 | govaluate | 3,529 |
3 | write-you-a-haskell | 3,304 |
4 | klipse | 3,088 |
5 | opencompass | 2,403 |
6 | promptbench | 1,954 |
7 | uptrain | 1,951 |
8 | evaluate | 1,803 |
9 | EvalAI | 1,673 |
10 | avalanche | 1,654 |
11 | pycm | 1,428 |
12 | LLM-eval-survey | 1,206 |
13 | lispy | 1,183 |
14 | alpaca_eval | 1,058 |
15 | torch-fidelity | 870 |
16 | semantic-kitti-api | 722 |
17 | gval | 696 |
18 | ExpressionEvaluator | 562 |
19 | long-form-factuality | 428 |
20 | Eval-Expression.NET | 423 |
21 | simpleeval | 420 |
22 | errant | 410 |
23 | ranx | 325 |