The One Billion Row Challenge in CUDA: from 17 minutes to 17 seconds

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

CodeRabbit: AI Code Reviews for Developers
Revolutionize your code reviews with AI. CodeRabbit offers PR summaries, code walkthroughs, 1-click suggestions, and AST-based analysis. Boost productivity and code quality across all major languages with each PR.
coderabbit.ai
featured
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com
featured
  1. 1brc

    1️⃣🐝🏎️ The One Billion Row Challenge -- A fun exploration of how quickly 1B rows from a text file can be aggregated with Java

    1brc had SMT disabled [0]. (hyperthreading is an intel marketing name and trademark for their brand of smt, but the benchmark was run on an amd cpu)

    [0] https://github.com/gunnarmorling/1brc/issues/189#issuecommen...

  2. CodeRabbit

    CodeRabbit: AI Code Reviews for Developers. Revolutionize your code reviews with AI. CodeRabbit offers PR summaries, code walkthroughs, 1-click suggestions, and AST-based analysis. Boost productivity and code quality across all major languages with each PR.

    CodeRabbit logo
  3. xxHash

    Extremely fast non-cryptographic hash algorithm

    > GPU Hash Table?

    How bad would performance have suffered if you sha256'd the lines to build the map? I'm going to guess "badly"?

    Maybe something like this in CUDA: https://github.com/Cyan4973/xxHash ?

  4. 1brc

    C11 implementation of the 1 Billion Rows Challenge. 1️⃣🐝🏎️ Runs in ~1.6 seconds on my not-so-fast laptop CPU w/ 16GB RAM. (by dannyvankooten)

    There are some good ideas for this type of problem here: https://github.com/dannyvankooten/1brc

    After you deal with parsing and hashes, basically you are IO limited so mmap helps. A reasonable guess is that even for the optimal CUDA implementation, because there is no compute to speak of other than a hashmap, the starting of kernels and transfer of data to the GPU would likely add a noticeable bottleneck and make the optimal CUDA code slower than this pure C code.

  5. parallel-hashmap

    A family of header-only, very fast and memory-friendly hashmap and btree containers.

    Standard library maps/unordered_maps are themselves notoriously slow anyway. A sparse_hash_map from abseil or parallel-hashmaps[1] would be better.

    [1] https://github.com/greg7mdp/parallel-hashmap

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts

  • No clue how other people are hitting <200ms on Day 23 (C++)

    5 projects | /r/adventofcode | 24 Dec 2021
  • Meow Hash

    13 projects | news.ycombinator.com | 29 Oct 2021
  • 76% Faster CPython

    9 projects | /r/Python | 30 Mar 2021
  • Parallel-hashmap: drop-in replacement for unordered_map, unordered_set

    2 projects | news.ycombinator.com | 5 Jan 2025
  • Breaking CityHash64, MurmurHash2/3, wyhash, and more

    3 projects | news.ycombinator.com | 3 Nov 2024