Top 23 Python Benchmark Projects

fashion-mnist

2 11,439 0.0 Python

A MNIST-like fashion product database. Benchmark :point_down:

Project mention: Logistic Regression for Image Classification Using OpenCV | news.ycombinator.com | 2023-12-31

In this case there's no advantage to using logistic regression on an image other than the novelty. Logistic regression is excellent for feature explainability, but you can't explain anything from an image.
Traditional classification algorithms but not deep learning such as SVMs and Random Forest perform a lot better on MNIST, up to 97% accuracy compared to the 88% from logistic regression in this post. Check the Original MNIST benchmarks here: http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/#
tianshou

1 7,356 9.5 Python

An elegant PyTorch deep reinforcement learning library.

Project mention: Is it better to not use the Target Update Frequency in Double DQN or depends on the application? | /r/reinforcementlearning | 2023-07-05

The tianshou implementation I found at https://github.com/thu-ml/tianshou/blob/master/tianshou/policy/modelfree/dqn.py is DQN by default.
WorkOS

workos.com
sponsored

The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
mmpose

0 4,937 8.4 Python

OpenMMLab Pose Estimation Toolbox and Benchmark.
ann-benchmarks

26 4,547 8.1 Python

Benchmarks of approximate nearest neighbor libraries in Python

Project mention: ANN Benchmarks | news.ycombinator.com | 2024-01-25
Baichuan2

1 3,879 7.3 Python

A series of large language models developed by Baichuan Intelligent Technology

Project mention: Baichuan 2 | news.ycombinator.com | 2023-10-12
mmaction2

0 3,863 7.8 Python

OpenMMLab's Next Generation Video Understanding Toolbox and Benchmark
Baichuan-13B

2 2,954 7.3 Python

A 13B large language model developed by Baichuan Intelligent Technology

Project mention: Baichuan IA de China | /r/techieHugui | 2023-07-22
InfluxDB

www.influxdata.com
sponsored

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
opencompass

1 2,336 9.7 Python

OpenCompass is an LLM evaluation platform, supporting a wide range of models (InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.

Project mention: Show HN: Times faster LLM evaluation with Bayesian optimization | news.ycombinator.com | 2024-02-13

Fair question.
Evaluate refers to the phase after training to check if the training is good.
Usually the flow goes training -> evaluation -> deployment (what you called inference). This project is aimed for evaluation. Evaluation can be slow (might even be slower than training if you're finetuning on a small domain specific subset)!
So there are [quite](https://github.com/microsoft/promptbench) [a](https://github.com/confident-ai/deepeval) [few](https://github.com/openai/evals) [frameworks](https://github.com/EleutherAI/lm-evaluation-harness) working on evaluation, however, all of them are quite slow, because LLM are slow if you don't have infinite money. [This](https://github.com/open-compass/opencompass) one tries to speed up by parallelizing on multiple computers, but none of them takes advantage of the fact that many evaluation queries might be similar and all try to evaluate on all given queries. And that's where this project might come in handy.
promptbench

4 1,954 9.2 Python

A unified evaluation framework for large language models

Project mention: Show HN: Times faster LLM evaluation with Bayesian optimization | news.ycombinator.com | 2024-02-13

Fair question.
Evaluate refers to the phase after training to check if the training is good.
Usually the flow goes training -> evaluation -> deployment (what you called inference). This project is aimed for evaluation. Evaluation can be slow (might even be slower than training if you're finetuning on a small domain specific subset)!
So there are [quite](https://github.com/microsoft/promptbench) [a](https://github.com/confident-ai/deepeval) [few](https://github.com/openai/evals) [frameworks](https://github.com/EleutherAI/lm-evaluation-harness) working on evaluation, however, all of them are quite slow, because LLM are slow if you don't have infinite money. [This](https://github.com/open-compass/opencompass) one tries to speed up by parallelizing on multiple computers, but none of them takes advantage of the fact that many evaluation queries might be similar and all try to evaluate on all given queries. And that's where this project might come in handy.
logparser

2 1,420 7.5 Python

A machine learning toolkit for log parsing [ICSE'19, DSN'16]

Project mention: Log2row: A tool that detects, extracts templates, and structures logs | news.ycombinator.com | 2023-10-06

You use GPT-4 to extract log patterns, does it really need LLM? There are more traditional approach such as https://github.com/logpai/logparser
beir

3 1,357 4.2 Python

A Heterogeneous Benchmark for Information Retrieval. Easy to use, evaluate your models across 15+ diverse IR datasets.

Project mention: On building a semantic search engine | news.ycombinator.com | 2024-01-06

The BEIR project might be what you're looking for: https://github.com/beir-cellar/beir/wiki/Leaderboard
py-motmetrics

0 1,321 4.9 Python

:bar_chart: Benchmark multiple object trackers (MOT) in Python
mteb

1 1,314 9.1 Python

MTEB: Massive Text Embedding Benchmark

Project mention: AI for AWS Documentation | news.ycombinator.com | 2023-07-06

RAG is very difficult to do right. I am experimenting with various RAG projects from [1]. The main problems are:
- Chunking can interfer with context boundaries
- Content vectors can differ vastly from question vectors, for this you have to use hypothetical embeddings (they generate artificial questions and store them)
- Instead of saving just one embedding per text-chuck you should store various (text chunk, hypothetical embedding questions, meta data)
- RAG will miserably fail with requests like "summarize the whole document"
- to my knowledge, openAI embeddings aren't performing well, use a embedding that is optimized for question answering or information retrieval and supports multi language. Also look into instructor embeddings: https://github.com/embeddings-benchmark/mteb
1 https://github.com/underlines/awesome-marketing-datascience/...
pytest-benchmark

2 1,187 6.0 Python

py.test fixture for benchmarking code

Project mention: Pinpoint performance regressions with CI-Integrated differential profiling | dev.to | 2023-10-23

pytest-benchmark
tapnet

1 1,030 8.3 Python

Tracking Any Point (TAP)

Project mention: Meta AI releases CoTracker, a model for tracking any points (pixels) on a video | news.ycombinator.com | 2023-08-29

Neat. It's mentioned on Facebook's page, but here Google's version of point tracking: https://deepmind-tapir.github.io
smac

0 992 3.8 Python

SMAC: The StarCraft Multi-Agent Challenge
InternVideo

1 890 8.0 Python

Video Foundation Models & Data for Multimodal Understanding

Project mention: [Demo] Watch Videos with ChatGPT | /r/ChatGPT | 2023-04-19

Thanks for your interest! If you had any ideas to make the given demo more user-friendly, please do not hesitate to share them with us. We are open to discussing relevant ideas about video foundation models or other topics. We made some progress in these areas (InternVideo, VideoMAE v2, UMT, and more). We believe that user-level intelligent video understanding is on the horizon with the current LLM, computing power, and video data.
Monocular-Depth-Estimation-Toolbox

0 847 10.0 Python

Monocular Depth Estimation Toolbox based on MMSegmentation.
asv

1 834 9.2 Python

Airspeed Velocity: A simple Python benchmarking tool with web-based reporting

Project mention: git-appraise – Distributed Code Review for Git | news.ycombinator.com | 2023-08-10

> All these workflows are a derivation of the source in the repository and keeping them close together has a great aesthetic.
I agree. Version control is a great enabler, so using it to track "sources" other than just code can be useful. A couple of tools I like to use:
- Artemis, for tracking issues http://www.chriswarbo.net/blog/2017-06-14-artemis.html
- ASV, for tracking benchmark results https://github.com/airspeed-velocity/asv (I use this for non-Python projects via my asv-nix plugin http://www.chriswarbo.net/projects/nixos/asv_benchmarking.ht... )
evalplus

3 833 9.4 Python

EvalPlus for rigourous evaluation of LLM-synthesized code

Project mention: The AI Reproducibility Crisis in GPT-3.5/GPT-4 Research | news.ycombinator.com | 2023-08-25

*Further Reading*:
- [GPT-4's decline over time (HackerNews)](https://news.ycombinator.com/item?id=36786407)
- [GPT-4 downgrade discussions (OpenAI Forums)](https://community.openai.com/t/gpt-4-has-been-severely-downg...)
- [Behavioral changes in ChatGPT (arXiv)](https://arxiv.org/abs/2307.09009)
- [Zero-Shot Replication Effort (Github)](https://github.com/emrgnt-cmplxty/zero-shot-replication)
- [Inconsistencies in GPT-4 HumanEval (Github)](https://github.com/evalplus/evalplus/issues/15)
- [Early experiments with GPT-4 (arXiv)](https://arxiv.org/abs/2303.12712)
- [GPT-4 Technical Report (arXiv)](https://arxiv.org/abs/2303.08774)
pyperformance

0 815 7.1 Python

Python Performance Benchmark Suite
benchmark

1 774 9.7 Python

TorchBench is a collection of open source benchmarks used to evaluate PyTorch performance. (by pytorch)

Project mention: PyTorch Primitives in WebGPU for the Browser | news.ycombinator.com | 2023-05-19

>What's a fair benchmark?
the absolute golden benchmarks are https://github.com/pytorch/benchmark
ADBench

0 767 6.9 Python

Official Implement of "ADBench: Anomaly Detection Benchmark", NeurIPS 2022.
SaaSHub

www.saashub.com
sponsored

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020). The latest post mention was on 2024-02-13.

Python Benchmark related posts

PullRequestBenchmark Challenge: Can AI Replace Your Dev Team?
1 project | news.ycombinator.com | 10 Apr 2024
An Open Source Tool for Multimodal Fact Verification
7 projects | news.ycombinator.com | 6 Apr 2024
PRBenchmark – Expert PR Review Capabilities Equals Expert PR Creation Capability
1 project | news.ycombinator.com | 5 Apr 2024
The PullRequestBenchmark: The Final Countdown to Full Developer Job Automation?
1 project | news.ycombinator.com | 19 Mar 2024
RAG is Dead. Long Live RAG!
1 project | dev.to | 28 Feb 2024
Show HN: Open-source tool to benchmark local LLMs
1 project | news.ycombinator.com | 21 Feb 2024
ANN Benchmarks
1 project | news.ycombinator.com | 25 Jan 2024
A note from our sponsor - InfluxDB
www.influxdata.com | 16 Apr 2024

Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality. Learn more →

Index

What are some of the best open-source Benchmark projects in Python? This list will help you:

	Project	Stars
1	fashion-mnist	11,439
2	tianshou	7,356
3	mmpose	4,937
4	ann-benchmarks	4,547
5	Baichuan2	3,879
6	mmaction2	3,863
7	Baichuan-13B	2,954
8	opencompass	2,336
9	promptbench	1,954
10	logparser	1,420
11	beir	1,357
12	py-motmetrics	1,321
13	mteb	1,314
14	pytest-benchmark	1,187
15	tapnet	1,030
16	smac	992
17	InternVideo	890
18	Monocular-Depth-Estimation-Toolbox	847
19	asv	834
20	evalplus	833
21	pyperformance	815
22	benchmark	774
23	ADBench	767