Get Scout setup in minutes, and let us sweat the small stuff. A couple lines in settings.py is all you need to start monitoring your apps. Sign up for our free tier today. Learn more →
Top 23 Python Benchmark Projects
-
Project mention: Logistic Regression for Image Classification Using OpenCV | news.ycombinator.com | 2023-12-31
In this case there's no advantage to using logistic regression on an image other than the novelty. Logistic regression is excellent for feature explainability, but you can't explain anything from an image.
Traditional classification algorithms but not deep learning such as SVMs and Random Forest perform a lot better on MNIST, up to 97% accuracy compared to the 88% from logistic regression in this post. Check the Original MNIST benchmarks here: http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/#
-
Scout Monitoring
Free Django app performance insights with Scout Monitoring. Get Scout setup in minutes, and let us sweat the small stuff. A couple lines in settings.py is all you need to start monitoring your apps. Sign up for our free tier today.
-
-
Project mention: Using Your Vector Database as a JSON (Or Relational) Datastore | news.ycombinator.com | 2024-04-23
On top of my head, pgvector only supports 2 indexes, those are running in memory only. They don't support GPU indexing, nor Disk based indexing, they also don't have separation of query and insertions.
Also with different people I've talked to, they struggle with scale past 100K-1M vector.
You can also have a look yourself from a performance perspective: https://ann-benchmarks.com/
-
-
-
opencompass
OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
Project mention: Show HN: Times faster LLM evaluation with Bayesian optimization | news.ycombinator.com | 2024-02-13Fair question.
Evaluate refers to the phase after training to check if the training is good.
Usually the flow goes training -> evaluation -> deployment (what you called inference). This project is aimed for evaluation. Evaluation can be slow (might even be slower than training if you're finetuning on a small domain specific subset)!
So there are [quite](https://github.com/microsoft/promptbench) [a](https://github.com/confident-ai/deepeval) [few](https://github.com/openai/evals) [frameworks](https://github.com/EleutherAI/lm-evaluation-harness) working on evaluation, however, all of them are quite slow, because LLM are slow if you don't have infinite money. [This](https://github.com/open-compass/opencompass) one tries to speed up by parallelizing on multiple computers, but none of them takes advantage of the fact that many evaluation queries might be similar and all try to evaluate on all given queries. And that's where this project might come in handy.
-
-
InfluxDB
Purpose built for real-time analytics at any scale. InfluxDB Platform is powered by columnar analytics, optimized for cost-efficient storage, and built with open data standards.
-
Project mention: Show HN: Times faster LLM evaluation with Bayesian optimization | news.ycombinator.com | 2024-02-13
Fair question.
Evaluate refers to the phase after training to check if the training is good.
Usually the flow goes training -> evaluation -> deployment (what you called inference). This project is aimed for evaluation. Evaluation can be slow (might even be slower than training if you're finetuning on a small domain specific subset)!
So there are [quite](https://github.com/microsoft/promptbench) [a](https://github.com/confident-ai/deepeval) [few](https://github.com/openai/evals) [frameworks](https://github.com/EleutherAI/lm-evaluation-harness) working on evaluation, however, all of them are quite slow, because LLM are slow if you don't have infinite money. [This](https://github.com/open-compass/opencompass) one tries to speed up by parallelizing on multiple computers, but none of them takes advantage of the fact that many evaluation queries might be similar and all try to evaluate on all given queries. And that's where this project might come in handy.
-
Project mention: Log2row: A tool that detects, extracts templates, and structures logs | news.ycombinator.com | 2023-10-06
You use GPT-4 to extract log patterns, does it really need LLM? There are more traditional approach such as https://github.com/logpai/logparser
-
beir
A Heterogeneous Benchmark for Information Retrieval. Easy to use, evaluate your models across 15+ diverse IR datasets.
Project mention: Any* Embedding Model Can Become a Late Interaction Model - If You Give It a Chance! | dev.to | 2024-08-29The source code for these experiments is open-source and utilizes beir-qdrant, an integration of Qdrant with the BeIR library. While this package is not officially maintained by the Qdrant team, it may prove useful for those interested in experimenting with various Qdrant configurations to see how they impact retrieval quality. All experiments were conducted using Qdrant in exact search mode, ensuring the results are not influenced by approximate search.
-
-
-
Project mention: Pinpoint performance regressions with CI-Integrated differential profiling | dev.to | 2023-10-23
pytest-benchmark
-
-
-
-
-
-
benchmark
TorchBench is a collection of open source benchmarks used to evaluate PyTorch performance. (by pytorch)
If you're the author, unfortunately I have to say that the blog is not well-written -- misinformed about some of the claims and has a repugnant click-baity title. you're getting the attention and clicks, but probably losing a lot of trust among people. I didn't engage out of choice, but because of a duty to respond to FUD.
> > torch.compile is 2 years old, XLA is 7 years old. Compilers take a few years to mature
> That was one of my major points - I don't think leaning on torch.compile is the best idea. A compiler would inherently place restrictions that you have to work-around.
There are plenty of compilers that place restrictions that you barely notice. gcc, clang, nvcc -- they're fairly flexible, and "dynamic". Adding constraints doesn't mean you have to give up on important flexibility.
> This is not dynamic, nor flexible - and it flies in the face of torch's core philosophies just so they can offer more performance to the big labs using PyTorch. For various reasons, I dislike pandering to the rich guy instead of being an independent, open-source entity.
I think this is an assumption you've made largely without evidence. I'm not entirely sure what your point is. The way torch.compile is measured for success publicly (even in the announcement blogpost and Conference Keynote, link https://pytorch.org/get-started/pytorch-2.0/ ) is by measuring on a bunch of popular PyTorch-based github repos in the wild + popular HuggingFace models + the TIMM vision benchmark. They're curated here https://github.com/pytorch/benchmark . Your claim that its to mainly favor large labs is pretty puzzling.
torch.compile is both dynamic and flexible because: 1. it supports dynamic shapes, 2. it allows incremental compilation (you dont need to compile the parts that you wish to keep in uncompilable python -- probably using random arbitrary python packages, etc.). there is a trade-off between dynamic, flexible and performance, i.e. more dynamic and flexible means we don't have enough information to extract better performance, but that's an acceptable trade-off when you need the flexibility to express your ideas more than you need the speed.
> XLA's GPU support is great, its compatible across different hardware, its optimized and mature. In short, its a great alternative to the often buggy torch.compile stack - if you fix the torch integration.
If you are an XLA maximalist, that's fine. I am not. There isn't evidence to prove out either opinions. PyTorch will never be nicely compatible with XLA until XLA has significant constraints that are incompatible with PyTorch's User Experience model. The PyTorch devs have given clear written-down feedback to the XLA project on what it takes for XLA+PyTorch to get better, and its been a few years and the XLA project prioritizes other things.
-
-
Project mention: [P] LagrangeBench: A Lagrangian Fluid Mechanics Benchmarking Suite | /r/MachineLearning | 2023-12-11
LagrangeBench is a machine learning benchmarking library for CFD particle problems based on JAX. It is designed to evaluate and develop learned particle models (e.g. graph neural networks) on challenging physical problems. To our knowledge it's the first benchmark for this specific set of problems. Our work was inspired by the grid-based benchmarks of PDEBench and PDEArena, and we propose it as a Lagrangian alternative.
-
-
tape
Tasks Assessing Protein Embeddings (TAPE), a set of five biologically relevant semi-supervised learning tasks spread across different domains of protein biology. (by songlab-cal)
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
Python Benchmark discussion
Python Benchmark related posts
-
Python Performance Benchmark Suite
-
Any* Embedding Model Can Become a Late Interaction Model - If You Give It a Chance!
-
PyTorch is dead. Long live Jax
-
Show HN: Open-source LLM provider price comparison
-
Show HN: PyBench 2.0 – Python benchmark tool inspired by Geekbench
-
Using Your Vector Database as a JSON (Or Relational) Datastore
-
PullRequestBenchmark Challenge: Can AI Replace Your Dev Team?
-
A note from our sponsor - Scout Monitoring
www.scoutapm.com | 7 Sep 2024
Index
What are some of the best open-source Benchmark projects in Python? This list will help you:
Project | Stars | |
---|---|---|
1 | fashion-mnist | 11,619 |
2 | mmpose | 5,547 |
3 | ann-benchmarks | 4,840 |
4 | mmaction2 | 4,134 |
5 | Baichuan2 | 4,072 |
6 | opencompass | 3,669 |
7 | Baichuan-13B | 2,980 |
8 | promptbench | 2,347 |
9 | logparser | 1,549 |
10 | beir | 1,543 |
11 | py-motmetrics | 1,368 |
12 | InternVideo | 1,275 |
13 | pytest-benchmark | 1,232 |
14 | evalplus | 1,133 |
15 | smac | 1,066 |
16 | Monocular-Depth-Estimation-Toolbox | 897 |
17 | asv | 860 |
18 | pyperformance | 850 |
19 | benchmark | 843 |
20 | ADBench | 824 |
21 | PDEBench | 714 |
22 | py-frameworks-bench | 710 |
23 | tape | 635 |