The One Billion Row Challenge in CUDA: from 17 minutes to 17 seconds

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • WorkOS - The modern identity platform for B2B SaaS
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • SaaSHub - Software Alternatives and Reviews
  • 1brc

    1️⃣🐝🏎️ The One Billion Row Challenge -- A fun exploration of how quickly 1B rows from a text file can be aggregated with Java

  • 1brc had SMT disabled [0]. (hyperthreading is an intel marketing name and trademark for their brand of smt, but the benchmark was run on an amd cpu)

    [0] https://github.com/gunnarmorling/1brc/issues/189#issuecommen...

  • xxHash

    Extremely fast non-cryptographic hash algorithm

  • > GPU Hash Table?

    How bad would performance have suffered if you sha256'd the lines to build the map? I'm going to guess "badly"?

    Maybe something like this in CUDA: https://github.com/Cyan4973/xxHash ?

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
  • 1brc

    C99 implementation of the 1 Billion Rows Challenge. 1️⃣🐝🏎️ Runs in ~1.6 seconds on my not-so-fast laptop CPU w/ 16GB RAM. (by dannyvankooten)

  • There are some good ideas for this type of problem here: https://github.com/dannyvankooten/1brc

    After you deal with parsing and hashes, basically you are IO limited so mmap helps. A reasonable guess is that even for the optimal CUDA implementation, because there is no compute to speak of other than a hashmap, the starting of kernels and transfer of data to the GPU would likely add a noticeable bottleneck and make the optimal CUDA code slower than this pure C code.

  • parallel-hashmap

    A family of header-only, very fast and memory-friendly hashmap and btree containers.

  • Standard library maps/unordered_maps are themselves notoriously slow anyway. A sparse_hash_map from abseil or parallel-hashmaps[1] would be better.

    [1] https://github.com/greg7mdp/parallel-hashmap

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts