Python Benchmark

Open-source Python projects categorized as Benchmark

Top 23 Python Benchmark Projects

  • fashion-mnist

    A MNIST-like fashion product database. Benchmark :point_down:

    Project mention: Logistic Regression for Image Classification Using OpenCV | news.ycombinator.com | 2023-12-31

    In this case there's no advantage to using logistic regression on an image other than the novelty. Logistic regression is excellent for feature explainability, but you can't explain anything from an image.

    Traditional classification algorithms but not deep learning such as SVMs and Random Forest perform a lot better on MNIST, up to 97% accuracy compared to the 88% from logistic regression in this post. Check the Original MNIST benchmarks here: http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/#

  • tianshou

    An elegant PyTorch deep reinforcement learning library.

    Project mention: Is it better to not use the Target Update Frequency in Double DQN or depends on the application? | /r/reinforcementlearning | 2023-07-05

    The tianshou implementation I found at https://github.com/thu-ml/tianshou/blob/master/tianshou/policy/modelfree/dqn.py is DQN by default.

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

  • mmpose

    OpenMMLab Pose Estimation Toolbox and Benchmark.

  • ann-benchmarks

    Benchmarks of approximate nearest neighbor libraries in Python

    Project mention: ANN Benchmarks | news.ycombinator.com | 2024-01-25
  • Baichuan2

    A series of large language models developed by Baichuan Intelligent Technology

    Project mention: Baichuan 2 | news.ycombinator.com | 2023-10-12
  • mmaction2

    OpenMMLab's Next Generation Video Understanding Toolbox and Benchmark

  • Baichuan-13B

    A 13B large language model developed by Baichuan Intelligent Technology

    Project mention: Baichuan IA de China | /r/techieHugui | 2023-07-22
  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

  • opencompass

    OpenCompass is an LLM evaluation platform, supporting a wide range of models (InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.

    Project mention: Show HN: Times faster LLM evaluation with Bayesian optimization | news.ycombinator.com | 2024-02-13

    Fair question.

    Evaluate refers to the phase after training to check if the training is good.

    Usually the flow goes training -> evaluation -> deployment (what you called inference). This project is aimed for evaluation. Evaluation can be slow (might even be slower than training if you're finetuning on a small domain specific subset)!

    So there are [quite](https://github.com/microsoft/promptbench) [a](https://github.com/confident-ai/deepeval) [few](https://github.com/openai/evals) [frameworks](https://github.com/EleutherAI/lm-evaluation-harness) working on evaluation, however, all of them are quite slow, because LLM are slow if you don't have infinite money. [This](https://github.com/open-compass/opencompass) one tries to speed up by parallelizing on multiple computers, but none of them takes advantage of the fact that many evaluation queries might be similar and all try to evaluate on all given queries. And that's where this project might come in handy.

  • promptbench

    A unified evaluation framework for large language models

    Project mention: Show HN: Times faster LLM evaluation with Bayesian optimization | news.ycombinator.com | 2024-02-13

    Fair question.

    Evaluate refers to the phase after training to check if the training is good.

    Usually the flow goes training -> evaluation -> deployment (what you called inference). This project is aimed for evaluation. Evaluation can be slow (might even be slower than training if you're finetuning on a small domain specific subset)!

    So there are [quite](https://github.com/microsoft/promptbench) [a](https://github.com/confident-ai/deepeval) [few](https://github.com/openai/evals) [frameworks](https://github.com/EleutherAI/lm-evaluation-harness) working on evaluation, however, all of them are quite slow, because LLM are slow if you don't have infinite money. [This](https://github.com/open-compass/opencompass) one tries to speed up by parallelizing on multiple computers, but none of them takes advantage of the fact that many evaluation queries might be similar and all try to evaluate on all given queries. And that's where this project might come in handy.

  • logparser

    A machine learning toolkit for log parsing [ICSE'19, DSN'16]

    Project mention: Log2row: A tool that detects, extracts templates, and structures logs | news.ycombinator.com | 2023-10-06

    You use GPT-4 to extract log patterns, does it really need LLM? There are more traditional approach such as https://github.com/logpai/logparser

  • beir

    A Heterogeneous Benchmark for Information Retrieval. Easy to use, evaluate your models across 15+ diverse IR datasets.

    Project mention: On building a semantic search engine | news.ycombinator.com | 2024-01-06

    The BEIR project might be what you're looking for: https://github.com/beir-cellar/beir/wiki/Leaderboard

  • py-motmetrics

    :bar_chart: Benchmark multiple object trackers (MOT) in Python

  • mteb

    MTEB: Massive Text Embedding Benchmark

    Project mention: AI for AWS Documentation | news.ycombinator.com | 2023-07-06

    RAG is very difficult to do right. I am experimenting with various RAG projects from [1]. The main problems are:

    - Chunking can interfer with context boundaries

    - Content vectors can differ vastly from question vectors, for this you have to use hypothetical embeddings (they generate artificial questions and store them)

    - Instead of saving just one embedding per text-chuck you should store various (text chunk, hypothetical embedding questions, meta data)

    - RAG will miserably fail with requests like "summarize the whole document"

    - to my knowledge, openAI embeddings aren't performing well, use a embedding that is optimized for question answering or information retrieval and supports multi language. Also look into instructor embeddings: https://github.com/embeddings-benchmark/mteb

    1 https://github.com/underlines/awesome-marketing-datascience/...

  • pytest-benchmark

    py.test fixture for benchmarking code

    Project mention: Pinpoint performance regressions with CI-Integrated differential profiling | dev.to | 2023-10-23

    pytest-benchmark

  • tapnet

    Tracking Any Point (TAP)

    Project mention: Meta AI releases CoTracker, a model for tracking any points (pixels) on a video | news.ycombinator.com | 2023-08-29

    Neat. It's mentioned on Facebook's page, but here Google's version of point tracking: https://deepmind-tapir.github.io

  • smac

    SMAC: The StarCraft Multi-Agent Challenge

  • InternVideo

    Video Foundation Models & Data for Multimodal Understanding

    Project mention: [Demo] Watch Videos with ChatGPT | /r/ChatGPT | 2023-04-19

    Thanks for your interest! If you had any ideas to make the given demo more user-friendly, please do not hesitate to share them with us. We are open to discussing relevant ideas about video foundation models or other topics. We made some progress in these areas (InternVideo, VideoMAE v2, UMT, and more). We believe that user-level intelligent video understanding is on the horizon with the current LLM, computing power, and video data.

  • Monocular-Depth-Estimation-Toolbox

    Monocular Depth Estimation Toolbox based on MMSegmentation.

  • asv

    Airspeed Velocity: A simple Python benchmarking tool with web-based reporting

    Project mention: git-appraise – Distributed Code Review for Git | news.ycombinator.com | 2023-08-10

    > All these workflows are a derivation of the source in the repository and keeping them close together has a great aesthetic.

    I agree. Version control is a great enabler, so using it to track "sources" other than just code can be useful. A couple of tools I like to use:

    - Artemis, for tracking issues http://www.chriswarbo.net/blog/2017-06-14-artemis.html

    - ASV, for tracking benchmark results https://github.com/airspeed-velocity/asv (I use this for non-Python projects via my asv-nix plugin http://www.chriswarbo.net/projects/nixos/asv_benchmarking.ht... )

  • evalplus

    EvalPlus for rigourous evaluation of LLM-synthesized code

    Project mention: The AI Reproducibility Crisis in GPT-3.5/GPT-4 Research | news.ycombinator.com | 2023-08-25

    *Further Reading*:

    - [GPT-4's decline over time (HackerNews)](https://news.ycombinator.com/item?id=36786407)

    - [GPT-4 downgrade discussions (OpenAI Forums)](https://community.openai.com/t/gpt-4-has-been-severely-downg...)

    - [Behavioral changes in ChatGPT (arXiv)](https://arxiv.org/abs/2307.09009)

    - [Zero-Shot Replication Effort (Github)](https://github.com/emrgnt-cmplxty/zero-shot-replication)

    - [Inconsistencies in GPT-4 HumanEval (Github)](https://github.com/evalplus/evalplus/issues/15)

    - [Early experiments with GPT-4 (arXiv)](https://arxiv.org/abs/2303.12712)

    - [GPT-4 Technical Report (arXiv)](https://arxiv.org/abs/2303.08774)

  • pyperformance

    Python Performance Benchmark Suite

  • benchmark

    TorchBench is a collection of open source benchmarks used to evaluate PyTorch performance. (by pytorch)

    Project mention: PyTorch Primitives in WebGPU for the Browser | news.ycombinator.com | 2023-05-19

    >What's a fair benchmark?

    the absolute golden benchmarks are https://github.com/pytorch/benchmark

  • ADBench

    Official Implement of "ADBench: Anomaly Detection Benchmark", NeurIPS 2022.

  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020). The latest post mention was on 2024-02-13.

Python Benchmark related posts

Index

What are some of the best open-source Benchmark projects in Python? This list will help you:

Project Stars
1 fashion-mnist 11,439
2 tianshou 7,356
3 mmpose 4,937
4 ann-benchmarks 4,547
5 Baichuan2 3,879
6 mmaction2 3,863
7 Baichuan-13B 2,954
8 opencompass 2,336
9 promptbench 1,954
10 logparser 1,420
11 beir 1,357
12 py-motmetrics 1,321
13 mteb 1,314
14 pytest-benchmark 1,187
15 tapnet 1,030
16 smac 992
17 InternVideo 890
18 Monocular-Depth-Estimation-Toolbox 847
19 asv 834
20 evalplus 833
21 pyperformance 815
22 benchmark 774
23 ADBench 767
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com