10~17x faster than what? A performance analysis of Intel x86-SIMD-sort (AVX-512)

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • WorkOS - The modern identity platform for B2B SaaS
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • SaaSHub - Software Alternatives and Reviews
  • sort-research-rs

    Test and benchmark suite for sort implementations.

  • Ok so both the things you linked are not my ideas. The first one is AFAIK originally from the BlockQuicksort paper https://arxiv.org/pdf/1604.06697.pdf section 3.2. Which Orson Peters used in his pdqsort, which I know base ipnsort on. The second one is from Igor van den Hoven in his work on quadsort as I mention in the function description. Really a lot of this stuff is building on top of other peoples work and refining it. Seemingly I'm the first to write neat little ASCII graphics for them in the code making them easier to approach. There are some novel ideas by me in there too though, for example https://github.com/Voultapher/sort-research-rs/blob/42bf339d... the way I marry a limited number of sorting-networks, insertion sort and the bi-directional merge into one, very fast but binary size and compile time efficient package is novel. Generally I'd say my work has been more about figuring out novel ways to combine existing ideas into one good cohesive package fit for a standard library sort implementation.

  • fluxsort

    A fast branchless stable quicksort / mergesort hybrid that is highly adaptive.

  • Steps to build a fast, highly adaptive AVX-512 sorting algorithm:

    - Clone fluxsort (https://github.com/scandum/fluxsort)

    - Replace the partitioning code in flux_default_partition and flux_reverse_partition with the obvious AVX-512 version using a compare and two compress instructions

    - If you're feeling ambitious, swap out the small array sorting, or incorporate crumsort's fulcrum partition for larger arrays.

    I know why I haven't done this: my computer doesn't have AVX-512, and hardly anyone else's seems to. Maybe a couple Zen 4 owners. I'm less clear on why the tech giants are reinventing the wheel to make these sorting alrogithms that don't even handle pre-sorted data rather than working with some of the very high-quality open source stuff out there. Is adaptivity really considered that worthless?

    Fluxsort makes this particularly simple because it gets great performance out of a stable out-of-place partition. It's a bit newer; maybe the authors weren't aware of this work. But these algorithms both use (fairly difficult) in-place partitioning code; why not slot that into the well-known pdqsort?

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
  • Presentations

    Collection of personal presentations (by Voultapher)

  • If you look at the history of the author's own sorting implementation (ipnsort), it started as an attempt to port fluxsort to Rust:

    https://github.com/Voultapher/Presentations/blob/master/rust...

    It starts there, eventually the author abandons the linked PR because of a better approach found with ipnsort:

    https://github.com/rust-lang/rust/pull/100856#issuecomment-1...

  • rust

    Empowering everyone to build reliable and efficient software.

  • If you look at the history of the author's own sorting implementation (ipnsort), it started as an attempt to port fluxsort to Rust:

    https://github.com/Voultapher/Presentations/blob/master/rust...

    It starts there, eventually the author abandons the linked PR because of a better approach found with ipnsort:

    https://github.com/rust-lang/rust/pull/100856#issuecomment-1...

  • mountain-sort

    The best algorithm to sort mountains

  • It depends on the size of the structs. For struct pointers you're likely better off sorting keys and pointers simultaneously. It doesn't matter much until you get to large sizes (millions), but sorting indices and then selecting with them is random access. If the original ordering is messy, selecting can be slower than the sorting step. For structs a few words long, the unit you're moving is a larger portion of a cache line, and I'd expect the data movement to drown out any SIMD advantage. A radix sort might be all right because it moves less, but I'd probably go with sorting indices as the first thing to try unless I knew the arrays were huge. For very large structs there's an interesting effort called mountain sort[0], "probably the best sorting algorithm if you need to sort actual mountains by height". Given that it minimizes number of moves it's ignoring cache entirely. I haven't benchmarked so I can't say much about how practical it is.

    [0] https://github.com/Morwenn/mountain-sort

  • highway

    Performance-portable, length-agnostic SIMD with runtime dispatch

  • The vqsort README says it is a non-issue on that generation CPU based on their benchmarks: https://github.com/google/highway/tree/master/hwy/contrib/so...

  • quadsort

    Quadsort is a branchless stable adaptive mergesort faster than quicksort.

  • https://github.com/scandum/quadsort/blob/f171a0b26cf6bd6f6dc...

    As you can see, quadsort 1.1.4.1 used 2 instead of 4 writes in the bi-directional parity merges. This was in June 2021, and would have compiled as branchless with clang, but as branched with gcc.

    When I added a compile time check to use ternary operations for clang I was not adapting your work. I was well aware that clang compiled ternary operations as branchless, but I wasn't aware that rust did as well. I added the compile time check to use ternary operations for a fair performance comparison against glidesort.

    https://raw.githubusercontent.com/scandum/fluxsort/main/imag...

    As for ipnsort's small sort, it is very similar to quadsort's small sort, which uses stable sorting networks, instead of unstable sorting networks. From my perspective it's not exactly novel. I didn't go for unstable sorting networks in crumsort to increase code reuse, and to not reduce adaptivity.

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • rotate

    A collection of array rotation algorithms.

  • quadsort/fluxsort/crumsort author here.

    For me there's a strong visual component, perhaps most obvious for my work on array rotation algorithms.

    https://github.com/scandum/rotate

    There's also the ability to notice strange/curious/discordant things, and either connect the dots through trying semi-random things, as well as sudden insights which seem to be partially subconscious.

    One of my (many) theories is that I have the ability to use long-term memory in a quasi-similar manner to short-term memory for problem solving. My IQ is in the 120-130 range, I suffer from hypervigilance, so it's generally on the lower end due to lack of sleep.

    I'd say there's a strong creative aspect. If I could redo life I might try my hand at music.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts