-
A curious thing about Swift: after https://github.com/attractivechaos/plb2/pull/23, the matrix multiplication example is comparable to C and Rust. However, I don’t see a way to idiomatically optimise the sudoku example, whose main overhead is allocating several arrays each time solve() is called. Apparently, in Swift there is no such thing as static array allocation. That’s very unfortunate.
-
CodeRabbit
CodeRabbit: AI Code Reviews for Developers. Revolutionize your code reviews with AI. CodeRabbit offers PR summaries, code walkthroughs, 1-click suggestions, and AST-based analysis. Boost productivity and code quality across all major languages with each PR.
-
laser
The HPC toolbox: fused matrix multiplication, convolution, data-parallel strided tensor primitives, OpenMP facilities, SIMD, JIT Assembler, CPU detection, state-of-the-art vectorized BLAS for floats and integers (by mratsim)
```
Note: the Theoretical peak limit is hardcoded and used my previous machine i9-9980XE.
It maybe that your BLAS library is not named libopenblas.so, you can change that here: https://github.com/mratsim/laser/blob/master/benchmarks/thir...
Implementation is in this folder: https://github.com/mratsim/laser/tree/master/laser/primitive...
in particular, tiling, cache and register optimization: https://github.com/mratsim/laser/blob/master/laser/primitive...
AVX512 code generator: https://github.com/mratsim/laser/blob/master/laser/primitive...
And generic Scalar/SSE/AVX/AVX2/AVX512 microkernel generator (this is Nim macros to generate code at compile-time): https://github.com/mratsim/laser/blob/master/laser/primitive...
I'll come back later with details on how to use my custom HPC threadpool Weave instead of OpenMP (https://github.com/mratsim/weave/tree/master/benchmarks/matm...)
-
weave
A state-of-the-art multithreading runtime: message-passing based, fast, scalable, ultra-low overhead (by mratsim)
```
Note: the Theoretical peak limit is hardcoded and used my previous machine i9-9980XE.
It maybe that your BLAS library is not named libopenblas.so, you can change that here: https://github.com/mratsim/laser/blob/master/benchmarks/thir...
Implementation is in this folder: https://github.com/mratsim/laser/tree/master/laser/primitive...
in particular, tiling, cache and register optimization: https://github.com/mratsim/laser/blob/master/laser/primitive...
AVX512 code generator: https://github.com/mratsim/laser/blob/master/laser/primitive...
And generic Scalar/SSE/AVX/AVX2/AVX512 microkernel generator (this is Nim macros to generate code at compile-time): https://github.com/mratsim/laser/blob/master/laser/primitive...
I'll come back later with details on how to use my custom HPC threadpool Weave instead of OpenMP (https://github.com/mratsim/weave/tree/master/benchmarks/matm...)
-
I should have mentioned somewhere, I disabled threading for OpenBLAS, so it is comparing one thread to one thread. Parallelism would be easy to add, but I tend to want the thread parallelism outside code like this anyways.
As for the inner loop not being well optimized... the disassembly looks like the same basic thing as OpenBLAS. There's disassembly in the comments of that file to show what code it generates, I'd love to know what you think is lacking! The only difference between the one I linked and this is prefetching and outer loop ordering: https://github.com/dsharlet/array/blob/master/examples/linea...
-
Or use BenchmarkDotNet which, among other things to get an accurate benchmark, does JIT warmup outside of measurement.
( https://github.com/dotnet/BenchmarkDotNet ).
-
So I actually tested your code: https://gist.github.com/bjourne/c2d0db48b2e50aaadf884e4450c6...
On my machine single-threaded OpenBLAS multiplies two single precision 4096x4096 matrices in 0.95 seconds. Your code takes over 30 seconds. For comparison, my own matrix multiplication code (https://github.com/bjourne/c-examples/blob/master/libraries/...) run in single-threaded mode takes 0.89 seconds. Which actually beats OpenBLAS, but OpenBLAS retakes the lead for larger arrays when multi-threading is added.
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
-
First we can use Laser, which was my initial BLAS experiment in 2019. At the time in particular, OpenBLAS didn't properly use the AVX512 VPUs. (See thread in BLIS https://github.com/flame/blis/issues/352 ), It has made progress since then, still, on my current laptop perf is in the same range
Reproduction:
Related posts
-
Can't compile llama-cpp-python with CLBLAST
-
Why does working with a transposed tensor not make the following operations less performant?
-
[New linear algebra library] monolish: MONOlithic LIner equation Solvers for Highly-parallel architecture
-
Improve performance with SIMD intrinsics
-
Mandelbrot Set Explorer