XNNPACK vs gemm-benchmark

Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

www.influxdata.com

featured

SaaSHub - Software Alternatives and Reviews

SaaSHub helps you find the best software and product alternatives

www.saashub.com

featured

XNNPACK		gemm-benchmark
	Project
8	Mentions	6
1,700	Stars	8
1.6%	Growth	-
9.9	Activity	3.5
5 days ago	Latest Commit	6 months ago
C	Language	Rust
GNU General Public License v3.0 or later	License	Apache License 2.0

The number of mentions indicates the total number of mentions that we've tracked plus the number of user suggested alternatives.
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.

XNNPACK

Posts with mentions or reviews of XNNPACK. We have used some of these posts to build our list of alternatives and similar projects. The last one was on 2023-05-02.

Xnnpack: High-efficiency floating-point neural network inference operators
1 project | news.ycombinator.com | 25 Dec 2023
Can a NPU be used for vectors?
1 project | /r/RISCV | 29 Aug 2023
Performance critical ML: How viable is Rust as an alternative to C++
4 projects | /r/rust | 2 May 2023

Why are you writing your own inference code in C++ or Rust instead of using some kind of established framework like XNNPACK?
[P] Pure C/C++ port of OpenAI's Whisper
10 projects | /r/MachineLearning | 10 Oct 2022
[Discussion] Is XNNPACK a part of mediapipe? or should be additionally configured with mediapipe?
1 project | /r/opencv | 29 Jan 2022

XNNPACK - https://github.com/google/XNNPACK
WebAssembly Techniques to Speed Up Matrix Multiplication by 120x
4 projects | news.ycombinator.com | 25 Jan 2022
Prediction: Macs won't see many new games, no matter how powerful their hardware is
2 projects | /r/apple | 28 Oct 2021

Ok, concrete example time! At work, we're going to be using some software which includes XNNPACK, which is a library of highly-optimised operations for doing neural-network inference. This is the sort of thing where people have gone in and specifically tuned for performance, and nope, there's no attempt at all made to have code which is different for Intel/AMD or Apple/Other ARM. What they target is elements of the ISA, like NEON (i.e. ARM SIMD) and SSE, AVX etc. on x86(-64). And Wasm SIMD for Wasm.
Where are Nvidia's DLSS models stored and how big are they?
1 project | /r/hardware | 28 Mar 2021

It's quite simple. https://github.com/google/XNNPACK for example.

gemm-benchmark

Posts with mentions or reviews of gemm-benchmark. We have used some of these posts to build our list of alternatives and similar projects. The last one was on 2023-07-20.

Running Stable Diffusion in 260MB of RAM
3 projects | news.ycombinator.com | 20 Jul 2023

And PyTorch on the M1 (without Metal) uses the fast AMX matrix multiplication units (through the Accelerate Framework). The matrix multiplication on the M1 is on par with ~10 threads/cores of Ryzen 5900X.
[1] https://github.com/danieldk/gemm-benchmark#example-results
Ask HN: What is a AI chip and how does it work?
4 projects | news.ycombinator.com | 27 May 2023

Apple Silicon Macs have special matrix multiplication units (AMX) that can do matrix multiplication fast and with low energy requirements [1]. These AMX units can often beat matrix multiplication on AMD/Intel CPUs (especially those without a very large number of cores). Since a lot of linear algebra code uses matrix multiplication and using the AMX units is only a matter of linking against Accelerate (for its BLAS interface), a lot of software that uses BLAS is faster o Apple Silicon Macs.
That said, the GPUs in your M1 Mac are faster than the AMX units and any reasonably modern NVIDIA GPU will wipe the floor with the AMX units or Apple Silicon GPUs in raw compute. However, a lot of software does not use CUDA by default and for small problem sets AMX units or CPUs with just AVX can be faster because they don't incur the cost of data transfers from main memory to GPU memory and vice versa.
[1] Benchmarks:
https://github.com/danieldk/gemm-benchmark#example-results
https://explosion.ai/blog/metal-performance-shaders (scroll down a bit for AMX and MPS numbers)
Apple previews Live Speech, Personal Voice, and more new accessibility features
3 projects | news.ycombinator.com | 16 May 2023
How to Get 1.5 TFlops of FP32 Performance on a Single M1 CPU Core
1 project | news.ycombinator.com | 5 Jan 2023

Yes, there is one per core cluster. The title is a bit misleading, because it suggests that going to two or three cores will scale linearly, though it won't be much faster. See here for sgemm benchmarks for everything from the M1 to M1 Ultra and 1 to 16 threads:
https://github.com/danieldk/gemm-benchmark#1-to-16-threads
WebAssembly Techniques to Speed Up Matrix Multiplication by 120x
4 projects | news.ycombinator.com | 25 Jan 2022

There's always been a tradeoff in writing code between developer experience and taking full advantage of what the hardware is capable of. That "waste" in execution efficiency is often worth it for the sake of representing helpful abstractions and generally helping developer productivity.
The GFLOP/s is 1/28th of what you'd get when using the native Accelerate framework on M1 Macs [1]. I am all in for powerful abstractions, but not using native APIs for this (even if it's just the browser calling Accelerate in some way) is just a huge waste of everyone's CPU cycles and electricity.
[1] https://github.com/danieldk/gemm-benchmark#1-to-16-threads

What are some alternatives?

When comparing XNNPACK and gemm-benchmark you can also consider the following projects:

ncnn - ncnn is a high-performance neural network inference framework optimized for the mobile platform

rknn-toolkit

cpuid2cpuflags - Tool to generate CPU_FLAGS_* for your CPU

OnnxStream - Lightweight inference library for ONNX files, written in C++. It can run SDXL on a RPI Zero 2 but also Mistral 7B on desktops and servers.

wasmblr - C++ WebAssembly assembler in a single header file

armnn - Arm NN ML Software. The code here is a read-only mirror of https://review.mlplatform.org/admin/repos/ml/armnn

Genann - simple neural network library in ANSI C

Pytorch - Tensors and Dynamic neural networks in Python with strong GPU acceleration

ruby-fann - Ruby library for interfacing with FANN (Fast Artificial Neural Network)

piper - A fast, local neural text to speech system

HIP-CPU - An implementation of HIP that works on CPUs, across OSes.

DOOM - DOOM Open Source Release

XNNPACK vs ncnn gemm-benchmark vs rknn-toolkit XNNPACK vs cpuid2cpuflags gemm-benchmark vs OnnxStream XNNPACK vs wasmblr gemm-benchmark vs armnn XNNPACK vs Genann gemm-benchmark vs Pytorch XNNPACK vs ruby-fann gemm-benchmark vs piper XNNPACK vs HIP-CPU gemm-benchmark vs DOOM

Compare XNNPACK vs gemm-benchmark and see what are their differences.

XNNPACK

gemm-benchmark

XNNPACK

gemm-benchmark

What are some alternatives?