Benchmarking 20 programming languages on N-queens and matrix multiplication

Our great sponsors

InfluxDB - Power Real-Time Data Analytics at Scale

WorkOS - The modern identity platform for B2B SaaS

SaaSHub - Software Alternatives and Reviews

Our great sponsors

plb2

7 236 9.5 C

A programming language benchmark

A curious thing about Swift: after https://github.com/attractivechaos/plb2/pull/23, the matrix multiplication example is comparable to C and Rust. However, I don’t see a way to idiomatically optimise the sudoku example, whose main overhead is allocating several arrays each time solve() is called. Apparently, in Swift there is no such thing as static array allocation. That’s very unfortunate.

laser

6 261 3.6 Nim

The HPC toolbox: fused matrix multiplication, convolution, data-parallel strided tensor primitives, OpenMP facilities, SIMD, JIT Assembler, CPU detection, state-of-the-art vectorized BLAS for floats and integers (by mratsim)

```
Note: the Theoretical peak limit is hardcoded and used my previous machine i9-9980XE.
It maybe that your BLAS library is not named libopenblas.so, you can change that here: https://github.com/mratsim/laser/blob/master/benchmarks/thir...
Implementation is in this folder: https://github.com/mratsim/laser/tree/master/laser/primitive...
in particular, tiling, cache and register optimization: https://github.com/mratsim/laser/blob/master/laser/primitive...
AVX512 code generator: https://github.com/mratsim/laser/blob/master/laser/primitive...
And generic Scalar/SSE/AVX/AVX2/AVX512 microkernel generator (this is Nim macros to generate code at compile-time): https://github.com/mratsim/laser/blob/master/laser/primitive...
I'll come back later with details on how to use my custom HPC threadpool Weave instead of OpenMP (https://github.com/mratsim/weave/tree/master/benchmarks/matm...)

InfluxDB

www.influxdata.com sponsored

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
weave

7 519 3.0 Nim

A state-of-the-art multithreading runtime: message-passing based, fast, scalable, ultra-low overhead (by mratsim)

```
Note: the Theoretical peak limit is hardcoded and used my previous machine i9-9980XE.
It maybe that your BLAS library is not named libopenblas.so, you can change that here: https://github.com/mratsim/laser/blob/master/benchmarks/thir...
Implementation is in this folder: https://github.com/mratsim/laser/tree/master/laser/primitive...
in particular, tiling, cache and register optimization: https://github.com/mratsim/laser/blob/master/laser/primitive...
AVX512 code generator: https://github.com/mratsim/laser/blob/master/laser/primitive...
And generic Scalar/SSE/AVX/AVX2/AVX512 microkernel generator (this is Nim macros to generate code at compile-time): https://github.com/mratsim/laser/blob/master/laser/primitive...
I'll come back later with details on how to use my custom HPC threadpool Weave instead of OpenMP (https://github.com/mratsim/weave/tree/master/benchmarks/matm...)

array

4 188 6.9 C++

C++ multidimensional arrays in the spirit of the STL

I should have mentioned somewhere, I disabled threading for OpenBLAS, so it is comparing one thread to one thread. Parallelism would be easy to add, but I tend to want the thread parallelism outside code like this anyways.
As for the inner loop not being well optimized... the disassembly looks like the same basic thing as OpenBLAS. There's disassembly in the comments of that file to show what code it generates, I'd love to know what you think is lacking! The only difference between the one I linked and this is prefetching and outer loop ordering: https://github.com/dsharlet/array/blob/master/examples/linea...

BenchmarkDotNet

67 10,019 9.3 C#

Powerful .NET library for benchmarking

Or use BenchmarkDotNet which, among other things to get an accurate benchmark, does JIT warmup outside of measurement.
( https://github.com/dotnet/BenchmarkDotNet ).

related_post_gen

15 273 9.9 C++

Data Processing benchmark featuring Rust, Go, Swift, Zig, Julia etc.

There is one for data processing here: https://github.com/jinyus/related_post_gen

c-examples

4 4 9.1 C

Example C code

So I actually tested your code: https://gist.github.com/bjourne/c2d0db48b2e50aaadf884e4450c6...
On my machine single-threaded OpenBLAS multiplies two single precision 4096x4096 matrices in 0.95 seconds. Your code takes over 30 seconds. For comparison, my own matrix multiplication code (https://github.com/bjourne/c-examples/blob/master/libraries/...) run in single-threaded mode takes 0.89 seconds. Which actually beats OpenBLAS, but OpenBLAS retakes the lead for larger arrays when multi-threading is added.

WorkOS

workos.com sponsored

The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
blis

17 2,091 7.0 C

BLAS-like Library Instantiation Software Framework

First we can use Laser, which was my initial BLAS experiment in 2019. At the time in particular, OpenBLAS didn't properly use the AVX512 VPUs. (See thread in BLIS https://github.com/flame/blis/issues/352 ), It has made progress since then, still, on my current laptop perf is in the same range
Reproduction:

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project