From slow to SIMD: A Go optimization story

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Nutrient – The #1 PDF SDK Library, trusted by 10K+ developers
Other PDF SDKs promise a lot - then break. Laggy scrolling, poor mobile UX, tons of bugs, and lack of support cost you endless frustrations. Nutrient’s SDK handles billion-page workloads - so you don’t have to debug PDFs. Used by ~1 billion end users in more than 150 different countries.
www.nutrient.io
featured
CodeRabbit: AI Code Reviews for Developers
Revolutionize your code reviews with AI. CodeRabbit offers PR summaries, code walkthroughs, 1-click suggestions, and AST-based analysis. Boost productivity and code quality across all major languages with each PR.
coderabbit.ai
featured
  1. avo

    Generate x86 Assembly with Go

    I wonder whether avo could have been useful here?[1] I mention it because it came up the last time we were talking about AVX operations in go.[2]

    1 = https://github.com/mmcloughlin/avo

    2 = https://news.ycombinator.com/item?id=34465297

  2. Nutrient

    Nutrient – The #1 PDF SDK Library, trusted by 10K+ developers. Other PDF SDKs promise a lot - then break. Laggy scrolling, poor mobile UX, tons of bugs, and lack of support cost you endless frustrations. Nutrient’s SDK handles billion-page workloads - so you don’t have to debug PDFs. Used by ~1 billion end users in more than 150 different countries.

    Nutrient logo
  3. Apache Arrow

    Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics

    I learned yesterday about GoLang's assembler https://go.dev/doc/asm - after browsing how arrow is implemented for different languages (my experience is mainly C/C++) - https://github.com/apache/arrow/tree/main/go/arrow/math - there are bunch of .S ("asm" files) and I'm still not able to comprehend how these work exactly (I guess it'll take more reading) - it seems very peculiar.

    The last time I've used inlined assembly was back in Turbo/Borland Pascal, then bit in Visual Studio (32-bit), until they got disabled. Then did very little gcc with their more strict specification (while the former you had to know how the ABI worked, the latter too - but it was specced out).

    Anyway - I wasn't expecting to find this in "Go" :) But I guess you can always start with .go code then produce assembly (-S) then optimize it, or find/hire someone to do it.

  4. laser

    The HPC toolbox: fused matrix multiplication, convolution, data-parallel strided tensor primitives, OpenMP facilities, SIMD, JIT Assembler, CPU detection, state-of-the-art vectorized BLAS for floats and integers (by mratsim)

    It depends.

    You need 2~3 accumulators to saturate instruction-level parallelism with a parallel sum reduction. But the compiler won't do it because it only creates those when the operation is associative, i.e. (a+b)+c = a+(b+c), which is true for integers but not for floats.

    There is an escape hatch in -ffast-math.

    I have extensive benches on this here: https://github.com/mratsim/laser/blob/master/benchmarks%2Ffp...

  5. 1brc

    1️⃣🐝🏎️ The One Billion Row Challenge -- A fun exploration of how quickly 1B rows from a text file can be aggregated with Java

    Even manual vectorization is pain...writing ASM, really?

    Rust has unstable portable SIMD and a few third-party crates, C++ has that as well, C# has stable portable SIMD and a very small BLAS-like library on top of it (hell it even exercises PackedSIMD when ran in a browser) and Java is getting stable Panama vectors some time in the future (though the question of codegen quality stands open given planned changes to unsafe API).

    Go among these is uniquely disadvantaged. And if that's not enough, you may want to visit 1Brc's challenge discussions and see that Go struggles get anywhere close to 2s mark with both C# and C++ are blazing past it:

    https://hotforknowledge.com/2024/01/13/1brc-in-dotnet-among-...

    https://github.com/gunnarmorling/1brc/discussions/67

  6. rust

    Empowering everyone to build reliable and efficient software.

    Ahh yes, thank you.

    Rust doesn't have a -ffast-math flag, though it is interesting that you passed it directly to llvm. I am kinda glad that escape hatch doesn't work, to be honest.

    There are currently unstable intrinsics that let you do this, and you seemingly get close to clang codegen with them: https://godbolt.org/z/EEW79Gbxv

    The thread tracking this discusses another attempt at a flag to enable this by turning on the CPU feature directly, but that doesn't seem to affect codegen in this case. https://github.com/rust-lang/rust/issues/21690

    It would be nice to get these intrinsics stabilized, at least.

  7. SimSIMD

    Up to 200x Faster Dot Products & Similarity Metrics — for Python, Rust, C, JS, and Swift, supporting f64, f32, f16 real & complex, i8, and bit vectors using SIMD for both AVX2, AVX-512, NEON, SVE, & SVE2 📐

    For other languages (including nodejs/bun/rust/python etc) you can have a look at SimSIMD which I have contributed to this year (made recompiled binaries for nodejs/bun part of the build process for x86_64 and arm64 on Mac and Linux, x86 and x86_64 on windows).

    [0] https://github.com/ashvardanian/SimSIMD

  8. vek

    SIMD Accelerated vector functions for Go

    I did a similar optimization via https://github.com/viterin/vek as the SIMD version. Some somewhat unscientific calculations showed a 10x improvement staying in float32: https://github.com/stillmatic/gollum/blob/07a9aa35d2517af8cf...

    TBH my takeaway was that it was more useful to use smaller vectors as a representation

  9. CodeRabbit

    CodeRabbit: AI Code Reviews for Developers. Revolutionize your code reviews with AI. CodeRabbit offers PR summaries, code walkthroughs, 1-click suggestions, and AST-based analysis. Boost productivity and code quality across all major languages with each PR.

    CodeRabbit logo
  10. gollum

    Production grade LLM-ops in Golang (by stillmatic)

    I did a similar optimization via https://github.com/viterin/vek as the SIMD version. Some somewhat unscientific calculations showed a 10x improvement staying in float32: https://github.com/stillmatic/gollum/blob/07a9aa35d2517af8cf...

    TBH my takeaway was that it was more useful to use smaller vectors as a representation

  11. highway

    Performance-portable, length-agnostic SIMD with runtime dispatch

    C++ users can enjoy Highway [1].

    [1] https://github.com/google/highway/

  12. Halide

    a language for fast, portable data-parallel computation

    This is a task where Halide https://halide-lang.org/ could really shine! It disconnects logic from scheduling (unrolling, vectorizing, tiling, caching intermediates etc), so every step the author describes in the article is a tunable in halide. halide doesn't appear to have bindings for golang so calling C++ from go might be the only viable option.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts

Did you know that C++ is
the 7th most popular programming language
based on number of references?