armnn
gemm-benchmark
armnn | gemm-benchmark | |
---|---|---|
2 | 6 | |
1,128 | 8 | |
2.3% | - | |
9.2 | 3.5 | |
3 days ago | 6 months ago | |
C++ | Rust | |
MIT License | Apache License 2.0 |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
armnn
-
LeCun: Qualcomm working with Meta to run Llama-2 on mobile devices
Like ARM? https://github.com/ARM-software/armnn
Optimization for this workload has arguably been in-progress for decades. Modern AVX instructions can be found in laptops that are a decade old now, and most big inferencing projects are built around SIMD or GPU shaders. Unless your computer ships with onboard Nvidia hardware, there's usually not much difference in inferencing performance.
-
Apple previews Live Speech, Personal Voice, and more new accessibility features
Yes, and generic multicore ARM CPUs can run ARM's standard compute library regardless of their hardware: https://github.com/ARM-software/armnn
Plus, the benchmark you've linked to is comparing CPU accelerated code to the notoriously crippled MKL execution. A more appropriate comparison would test Apple's AMX units against the Ryzen's SIMD-optimized inferencing.
gemm-benchmark
-
Running Stable Diffusion in 260MB of RAM
And PyTorch on the M1 (without Metal) uses the fast AMX matrix multiplication units (through the Accelerate Framework). The matrix multiplication on the M1 is on par with ~10 threads/cores of Ryzen 5900X.
[1] https://github.com/danieldk/gemm-benchmark#example-results
-
Ask HN: What is a AI chip and how does it work?
Apple Silicon Macs have special matrix multiplication units (AMX) that can do matrix multiplication fast and with low energy requirements [1]. These AMX units can often beat matrix multiplication on AMD/Intel CPUs (especially those without a very large number of cores). Since a lot of linear algebra code uses matrix multiplication and using the AMX units is only a matter of linking against Accelerate (for its BLAS interface), a lot of software that uses BLAS is faster o Apple Silicon Macs.
That said, the GPUs in your M1 Mac are faster than the AMX units and any reasonably modern NVIDIA GPU will wipe the floor with the AMX units or Apple Silicon GPUs in raw compute. However, a lot of software does not use CUDA by default and for small problem sets AMX units or CPUs with just AVX can be faster because they don't incur the cost of data transfers from main memory to GPU memory and vice versa.
[1] Benchmarks:
https://github.com/danieldk/gemm-benchmark#example-results
https://explosion.ai/blog/metal-performance-shaders (scroll down a bit for AMX and MPS numbers)
- Apple previews Live Speech, Personal Voice, and more new accessibility features
-
How to Get 1.5 TFlops of FP32 Performance on a Single M1 CPU Core
Yes, there is one per core cluster. The title is a bit misleading, because it suggests that going to two or three cores will scale linearly, though it won't be much faster. See here for sgemm benchmarks for everything from the M1 to M1 Ultra and 1 to 16 threads:
https://github.com/danieldk/gemm-benchmark#1-to-16-threads
-
WebAssembly Techniques to Speed Up Matrix Multiplication by 120x
There's always been a tradeoff in writing code between developer experience and taking full advantage of what the hardware is capable of. That "waste" in execution efficiency is often worth it for the sake of representing helpful abstractions and generally helping developer productivity.
The GFLOP/s is 1/28th of what you'd get when using the native Accelerate framework on M1 Macs [1]. I am all in for powerful abstractions, but not using native APIs for this (even if it's just the browser calling Accelerate in some way) is just a huge waste of everyone's CPU cycles and electricity.
[1] https://github.com/danieldk/gemm-benchmark#1-to-16-threads
What are some alternatives?
iNeural - A library for creating Artificial Neural Networks, for use in Machine Learning and Deep Learning algorithms.
XNNPACK - High-efficiency floating-point neural network inference operators for mobile, server, and Web
llama - Inference code for Llama models
rknn-toolkit
piper - A fast, local neural text to speech system
OnnxStream - Lightweight inference library for ONNX files, written in C++. It can run SDXL on a RPI Zero 2 but also Mistral 7B on desktops and servers.
DeepSpeech - DeepSpeech is an open source embedded (offline, on-device) speech-to-text engine which can run in real time on devices ranging from a Raspberry Pi 4 to high power GPU servers.
Pytorch - Tensors and Dynamic neural networks in Python with strong GPU acceleration