From slow to SIMD: A Go optimization story

Our great sponsors

InfluxDB - Power Real-Time Data Analytics at Scale

WorkOS - The modern identity platform for B2B SaaS

SaaSHub - Software Alternatives and Reviews

Our great sponsors

avo

10 2,590 7.1 Go

Generate x86 Assembly with Go

I wonder whether avo could have been useful here?[1] I mention it because it came up the last time we were talking about AVX operations in go.[2]
1 = https://github.com/mmcloughlin/avo
2 = https://news.ycombinator.com/item?id=34465297

Apache Arrow

75 13,480 10.0 C++

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing

I learned yesterday about GoLang's assembler https://go.dev/doc/asm - after browsing how arrow is implemented for different languages (my experience is mainly C/C++) - https://github.com/apache/arrow/tree/main/go/arrow/math - there are bunch of .S ("asm" files) and I'm still not able to comprehend how these work exactly (I guess it'll take more reading) - it seems very peculiar.
The last time I've used inlined assembly was back in Turbo/Borland Pascal, then bit in Visual Studio (32-bit), until they got disabled. Then did very little gcc with their more strict specification (while the former you had to know how the ABI worked, the latter too - but it was specced out).
Anyway - I wasn't expecting to find this in "Go" :) But I guess you can always start with .go code then produce assembly (-S) then optimize it, or find/hire someone to do it.

InfluxDB

www.influxdata.com sponsored

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
laser

6 261 3.6 Nim

The HPC toolbox: fused matrix multiplication, convolution, data-parallel strided tensor primitives, OpenMP facilities, SIMD, JIT Assembler, CPU detection, state-of-the-art vectorized BLAS for floats and integers (by mratsim)

It depends.
You need 2~3 accumulators to saturate instruction-level parallelism with a parallel sum reduction. But the compiler won't do it because it only creates those when the operation is associative, i.e. (a+b)+c = a+(b+c), which is true for integers but not for floats.
There is an escape hatch in -ffast-math.
I have extensive benches on this here: https://github.com/mratsim/laser/blob/master/benchmarks%2Ffp...

1brc

28 5,077 9.9 Java

1️⃣🐝🏎️ The One Billion Row Challenge -- A fun exploration of how quickly 1B rows from a text file can be aggregated with Java

Even manual vectorization is pain...writing ASM, really?
Rust has unstable portable SIMD and a few third-party crates, C++ has that as well, C# has stable portable SIMD and a very small BLAS-like library on top of it (hell it even exercises PackedSIMD when ran in a browser) and Java is getting stable Panama vectors some time in the future (though the question of codegen quality stands open given planned changes to unsafe API).
Go among these is uniquely disadvantaged. And if that's not enough, you may want to visit 1Brc's challenge discussions and see that Go struggles get anywhere close to 2s mark with both C# and C++ are blazing past it:
https://hotforknowledge.com/2024/01/13/1brc-in-dotnet-among-...
https://github.com/gunnarmorling/1brc/discussions/67

rust

2,682 92,831 10.0 Rust

Empowering everyone to build reliable and efficient software.

Ahh yes, thank you.
Rust doesn't have a -ffast-math flag, though it is interesting that you passed it directly to llvm. I am kinda glad that escape hatch doesn't work, to be honest.
There are currently unstable intrinsics that let you do this, and you seemingly get close to clang codegen with them: https://godbolt.org/z/EEW79Gbxv
The thread tracking this discusses another attempt at a flag to enable this by turning on the CPU feature directly, but that doesn't seem to affect codegen in this case. https://github.com/rust-lang/rust/issues/21690
It would be nice to get these intrinsics stabilized, at least.

SimSIMD

15 710 9.6 C

Up to 200x Faster Inner Products and Vector Similarity — for Python, JavaScript, Rust, and C, supporting f64, f32, f16 real & complex, i8, and binary vectors using SIMD for both x86 AVX2 & AVX-512 and Arm NEON & SVE 📐

For other languages (including nodejs/bun/rust/python etc) you can have a look at SimSIMD which I have contributed to this year (made recompiled binaries for nodejs/bun part of the build process for x86_64 and arm64 on Mac and Linux, x86 and x86_64 on windows).
[0] https://github.com/ashvardanian/SimSIMD

vek

2 93 3.9 Assembly

SIMD Accelerated vector functions for Go

I did a similar optimization via https://github.com/viterin/vek as the SIMD version. Some somewhat unscientific calculations showed a 10x improvement staying in float32: https://github.com/stillmatic/gollum/blob/07a9aa35d2517af8cf...
TBH my takeaway was that it was more useful to use smaller vectors as a representation

WorkOS

workos.com sponsored

The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
gollum

4 32 6.0 Go

Production grade LLM-ops in Golang (by stillmatic)

I did a similar optimization via https://github.com/viterin/vek as the SIMD version. Some somewhat unscientific calculations showed a 10x improvement staying in float32: https://github.com/stillmatic/gollum/blob/07a9aa35d2517af8cf...
TBH my takeaway was that it was more useful to use smaller vectors as a representation

highway

66 3,645 9.8 C++

Performance-portable, length-agnostic SIMD with runtime dispatch

C++ users can enjoy Highway [1].
[1] https://github.com/google/highway/

Halide

43 5,703 9.5 C++

a language for fast, portable data-parallel computation

This is a task where Halide https://halide-lang.org/ could really shine! It disconnects logic from scheduling (unrolling, vectorizing, tiling, caching intermediates etc), so every step the author describes in the article is a tunable in halide. halide doesn't appear to have bindings for golang so calling C++ from go might be the only viable option.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project