From slow to SIMD: A Go optimization story

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • WorkOS - The modern identity platform for B2B SaaS
  • SaaSHub - Software Alternatives and Reviews
  • avo

    Generate x86 Assembly with Go

  • I wonder whether avo could have been useful here?[1] I mention it because it came up the last time we were talking about AVX operations in go.[2]

    1 = https://github.com/mmcloughlin/avo

    2 = https://news.ycombinator.com/item?id=34465297

  • Apache Arrow

    Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing

  • I learned yesterday about GoLang's assembler https://go.dev/doc/asm - after browsing how arrow is implemented for different languages (my experience is mainly C/C++) - https://github.com/apache/arrow/tree/main/go/arrow/math - there are bunch of .S ("asm" files) and I'm still not able to comprehend how these work exactly (I guess it'll take more reading) - it seems very peculiar.

    The last time I've used inlined assembly was back in Turbo/Borland Pascal, then bit in Visual Studio (32-bit), until they got disabled. Then did very little gcc with their more strict specification (while the former you had to know how the ABI worked, the latter too - but it was specced out).

    Anyway - I wasn't expecting to find this in "Go" :) But I guess you can always start with .go code then produce assembly (-S) then optimize it, or find/hire someone to do it.

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • laser

    The HPC toolbox: fused matrix multiplication, convolution, data-parallel strided tensor primitives, OpenMP facilities, SIMD, JIT Assembler, CPU detection, state-of-the-art vectorized BLAS for floats and integers (by mratsim)

  • It depends.

    You need 2~3 accumulators to saturate instruction-level parallelism with a parallel sum reduction. But the compiler won't do it because it only creates those when the operation is associative, i.e. (a+b)+c = a+(b+c), which is true for integers but not for floats.

    There is an escape hatch in -ffast-math.

    I have extensive benches on this here: https://github.com/mratsim/laser/blob/master/benchmarks%2Ffp...

  • 1brc

    1๏ธโƒฃ๐Ÿ๐ŸŽ๏ธ The One Billion Row Challenge -- A fun exploration of how quickly 1B rows from a text file can be aggregated with Java

  • Even manual vectorization is pain...writing ASM, really?

    Rust has unstable portable SIMD and a few third-party crates, C++ has that as well, C# has stable portable SIMD and a very small BLAS-like library on top of it (hell it even exercises PackedSIMD when ran in a browser) and Java is getting stable Panama vectors some time in the future (though the question of codegen quality stands open given planned changes to unsafe API).

    Go among these is uniquely disadvantaged. And if that's not enough, you may want to visit 1Brc's challenge discussions and see that Go struggles get anywhere close to 2s mark with both C# and C++ are blazing past it:

    https://hotforknowledge.com/2024/01/13/1brc-in-dotnet-among-...

    https://github.com/gunnarmorling/1brc/discussions/67

  • rust

    Empowering everyone to build reliable and efficient software.

  • Ahh yes, thank you.

    Rust doesn't have a -ffast-math flag, though it is interesting that you passed it directly to llvm. I am kinda glad that escape hatch doesn't work, to be honest.

    There are currently unstable intrinsics that let you do this, and you seemingly get close to clang codegen with them: https://godbolt.org/z/EEW79Gbxv

    The thread tracking this discusses another attempt at a flag to enable this by turning on the CPU feature directly, but that doesn't seem to affect codegen in this case. https://github.com/rust-lang/rust/issues/21690

    It would be nice to get these intrinsics stabilized, at least.

  • SimSIMD

    Up to 200x Faster Inner Products and Vector Similarity โ€” for Python, JavaScript, Rust, and C, supporting f64, f32, f16 real & complex, i8, and binary vectors using SIMD for both x86 AVX2 & AVX-512 and Arm NEON & SVE ๐Ÿ“

  • For other languages (including nodejs/bun/rust/python etc) you can have a look at SimSIMD which I have contributed to this year (made recompiled binaries for nodejs/bun part of the build process for x86_64 and arm64 on Mac and Linux, x86 and x86_64 on windows).

    [0] https://github.com/ashvardanian/SimSIMD

  • vek

    SIMD Accelerated vector functions for Go

  • I did a similar optimization via https://github.com/viterin/vek as the SIMD version. Some somewhat unscientific calculations showed a 10x improvement staying in float32: https://github.com/stillmatic/gollum/blob/07a9aa35d2517af8cf...

    TBH my takeaway was that it was more useful to use smaller vectors as a representation

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
  • gollum

    Production grade LLM-ops in Golang (by stillmatic)

  • I did a similar optimization via https://github.com/viterin/vek as the SIMD version. Some somewhat unscientific calculations showed a 10x improvement staying in float32: https://github.com/stillmatic/gollum/blob/07a9aa35d2517af8cf...

    TBH my takeaway was that it was more useful to use smaller vectors as a representation

  • highway

    Performance-portable, length-agnostic SIMD with runtime dispatch

  • C++ users can enjoy Highway [1].

    [1] https://github.com/google/highway/

  • Halide

    a language for fast, portable data-parallel computation

  • This is a task where Halide https://halide-lang.org/ could really shine! It disconnects logic from scheduling (unrolling, vectorizing, tiling, caching intermediates etc), so every step the author describes in the article is a tunable in halide. halide doesn't appear to have bindings for golang so calling C++ from go might be the only viable option.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts