optimization-manual
sb_lower_bound
optimization-manual | sb_lower_bound | |
---|---|---|
3 | 8 | |
738 | 14 | |
1.9% | - | |
3.8 | 3.9 | |
2 months ago | 10 months ago | |
Assembly | C++ | |
BSD Zero Clause License | - |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
optimization-manual
-
Fastest Branchless Binary Search
There's two ways I vectorized linear and binary search (in practice you often want a combination, always benchmark on your real-world datasets!) - Do N binary searches simultaneously, each lane is essentially doing one bsearch. Obviously, this only works if you are doing multiple searches. - use the VPCONFLICT instruction for the linear search parts, there's even code from the Intel SDM doing it: https://github.com/intel/optimization-manual/blob/main/chap18/ex20/avx512_vector_dp.asm
-
Zen4's AVX512 Teardown
The Intel optimization manual has a fun example where they use vpconflict for vectorizing sparse dot products: https://github.com/intel/optimization-manual/blob/main/chap1...
I benchmarked it on Intel, and it was indeed quite fast/a good improvement over the scalar version. Will be interesting to try that on AMD.
- Intel Optimization Reference Manual Code Samples
sb_lower_bound
-
Fastest Branchless Binary Search
Then you'll want to look at https://mhdm.dev/posts/sb_lower_bound/#prefetching
100mb is large enough that the branchy version turns out to have a slight advantage, more due to quirks of x86 (speculative execution) rather than being better.
"very similar topic" is an understatement. Funnily enough the "implementation to perform the best on Apple M1 after all micro-optimizations are applied" in the Conclusion is equivalent in terms of the how many actual comparisons are made as with sb_lower_bound. Out of curiosity I've benchmarked the two and orlp lower_bound seems to perform slightly worse: ~39ns average (using gcc) vs ~33ns average of sb_lower_bound (using clang -cmov). I'm comparing best runs for both, usual disclaimer of tested on my machine.
What are some alternatives?
AvxMath
ThinkingInSimd - An essay comparing performance implications of ignoring AVX acceleration
tigerbeetle - The distributed financial transactions database designed for mission critical safety and performance.
zig - General-purpose programming language and toolchain for maintaining robust, optimal, and reusable software.
amh-code - Complete implementations from "Algorithms for Modern Hardware"
Nim - Nim is a statically typed compiled systems programming language. It combines successful concepts from mature languages like Python, Ada and Modula. Its design focuses on efficiency, expressiveness, and elegance (in that order of priority).
rust - Empowering everyone to build reliable and efficient software.
branchless-binary-search - Binary search implementation that avoids branch instructions