Benchmarking 20 programming languages on N-queens and matrix multiplication

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • WorkOS - The modern identity platform for B2B SaaS
  • SaaSHub - Software Alternatives and Reviews
  • plb2

    A programming language benchmark

  • A curious thing about Swift: after https://github.com/attractivechaos/plb2/pull/23, the matrix multiplication example is comparable to C and Rust. However, I don’t see a way to idiomatically optimise the sudoku example, whose main overhead is allocating several arrays each time solve() is called. Apparently, in Swift there is no such thing as static array allocation. That’s very unfortunate.

  • laser

    The HPC toolbox: fused matrix multiplication, convolution, data-parallel strided tensor primitives, OpenMP facilities, SIMD, JIT Assembler, CPU detection, state-of-the-art vectorized BLAS for floats and integers (by mratsim)

  • ```

    Note: the Theoretical peak limit is hardcoded and used my previous machine i9-9980XE.

    It maybe that your BLAS library is not named libopenblas.so, you can change that here: https://github.com/mratsim/laser/blob/master/benchmarks/thir...

    Implementation is in this folder: https://github.com/mratsim/laser/tree/master/laser/primitive...

    in particular, tiling, cache and register optimization: https://github.com/mratsim/laser/blob/master/laser/primitive...

    AVX512 code generator: https://github.com/mratsim/laser/blob/master/laser/primitive...

    And generic Scalar/SSE/AVX/AVX2/AVX512 microkernel generator (this is Nim macros to generate code at compile-time): https://github.com/mratsim/laser/blob/master/laser/primitive...

    I'll come back later with details on how to use my custom HPC threadpool Weave instead of OpenMP (https://github.com/mratsim/weave/tree/master/benchmarks/matm...)

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • weave

    A state-of-the-art multithreading runtime: message-passing based, fast, scalable, ultra-low overhead (by mratsim)

  • ```

    Note: the Theoretical peak limit is hardcoded and used my previous machine i9-9980XE.

    It maybe that your BLAS library is not named libopenblas.so, you can change that here: https://github.com/mratsim/laser/blob/master/benchmarks/thir...

    Implementation is in this folder: https://github.com/mratsim/laser/tree/master/laser/primitive...

    in particular, tiling, cache and register optimization: https://github.com/mratsim/laser/blob/master/laser/primitive...

    AVX512 code generator: https://github.com/mratsim/laser/blob/master/laser/primitive...

    And generic Scalar/SSE/AVX/AVX2/AVX512 microkernel generator (this is Nim macros to generate code at compile-time): https://github.com/mratsim/laser/blob/master/laser/primitive...

    I'll come back later with details on how to use my custom HPC threadpool Weave instead of OpenMP (https://github.com/mratsim/weave/tree/master/benchmarks/matm...)

  • array

    C++ multidimensional arrays in the spirit of the STL

  • I should have mentioned somewhere, I disabled threading for OpenBLAS, so it is comparing one thread to one thread. Parallelism would be easy to add, but I tend to want the thread parallelism outside code like this anyways.

    As for the inner loop not being well optimized... the disassembly looks like the same basic thing as OpenBLAS. There's disassembly in the comments of that file to show what code it generates, I'd love to know what you think is lacking! The only difference between the one I linked and this is prefetching and outer loop ordering: https://github.com/dsharlet/array/blob/master/examples/linea...

  • BenchmarkDotNet

    Powerful .NET library for benchmarking

  • Or use BenchmarkDotNet which, among other things to get an accurate benchmark, does JIT warmup outside of measurement.

    ( https://github.com/dotnet/BenchmarkDotNet ).

  • related_post_gen

    Data Processing benchmark featuring Rust, Go, Swift, Zig, Julia etc.

  • There is one for data processing here: https://github.com/jinyus/related_post_gen

  • c-examples

    Example C code

  • So I actually tested your code: https://gist.github.com/bjourne/c2d0db48b2e50aaadf884e4450c6...

    On my machine single-threaded OpenBLAS multiplies two single precision 4096x4096 matrices in 0.95 seconds. Your code takes over 30 seconds. For comparison, my own matrix multiplication code (https://github.com/bjourne/c-examples/blob/master/libraries/...) run in single-threaded mode takes 0.89 seconds. Which actually beats OpenBLAS, but OpenBLAS retakes the lead for larger arrays when multi-threading is added.

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
  • blis

    BLAS-like Library Instantiation Software Framework

  • First we can use Laser, which was my initial BLAS experiment in 2019. At the time in particular, OpenBLAS didn't properly use the AVX512 VPUs. (See thread in BLIS https://github.com/flame/blis/issues/352 ), It has made progress since then, still, on my current laptop perf is in the same range

    Reproduction:

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts