Our great sponsors
-
sgemm_riscv
This project records the process of optimizing SGEMM (single-precision floating point General Matrix Multiplication) on the riscv platform.
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
this sgemm progress is pretty impressive. i assume 'version 1' is one of the lower lines on the first graph? he has a graph lower down that shows it getting under 300 megaflops, closer to 50 megaflops for most problem sizes. getting up to 2 gigaflops is seriously impressive; that's 50% of the maximum theoretical possible throughput, which is really challenging on things that aren't vector supercomputers
probably worth archiving this link: https://github.com/Zhao-Dongyu/sgemm_riscv
i was astounded yesterday to find that gcc 12.2.0 fails at some basic optimizations on risc-v. (all of the following is with -Os)
this trivial subroutine
int sumarray(int *a, int n)
I don't believe there is any official documentation on this, but https://github.com/JuliaLang/julia/pull/49430 for example added prefetching to the marking phase of a GC which saw speedups on x86, but not on M1.
yep. they have a neural engine that is separate from the CPU and GPU that does really fast matmuls https://github.com/hollance/neural-engine. it's basically completely undocumented.
I am talking about the matrix/vector coprocessor (AMX). You can find some reverse-engineered documentation here: https://github.com/corsix/amx
On M3 a singe matrix block can achieve ~ 1TFLOP on DGEMM, I assume it will be closer to 4TFLOPS for SGEMM. The Max variants have two such blocks. Didn't do precise benchmarking myself, but switching Python/R matrix libraries to use Apple's BLAS result in 5-6x perf improvement on matrix heavy code for me.
There is a recent update to the blis alternative to BLAS that includes a number of RISC-V performance optimizations.
https://github.com/flame/blis/pull/737