The One Billion Row Challenge

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • WorkOS - The modern identity platform for B2B SaaS
  • SaaSHub - Software Alternatives and Reviews
  • 1brc

    1️⃣🐝🏎️ The One Billion Row Challenge -- A fun exploration of how quickly 1B rows from a text file can be aggregated with Java

  • As far as I see the currently best performing solution [0] does not account for hash collisions and therefore probably generates wrong results if enough different cities are in the dataset. Or am I missing something?

    [0] https://github.com/gunnarmorling/1brc/blob/main/src/main/jav...

  • 1brc

    C99 implementation of the 1 Billion Rows Challenge. 1️⃣🐝🏎️ Runs in ~1.6 seconds on my not-so-fast laptop CPU w/ 16GB RAM. (by dannyvankooten)

  • Very fun challenge that nerd sniped me right away. Had to do a C version in standard C99 with POSIX threads. It[1] clocks in at just under 4 seconds on my AMD Ryzen 4800U Laptop CPU.

    Should run about 10-20% faster than that on the mentioned Hetzner hardware.

    - Since we only do one decimal of floating point precision it uses integer math right from the get-go.

    - FNV1-a hash with linear probing and a load factor well under 0.5.

    - Data file is mmap’d into memory.

    - Data is processed in 8 totally separate chunks (no concurrent data structures) and then those aggregations are in turn aggregated when all threads have finished.

    1: https://github.com/dannyvankooten/1brc

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • 1brc

  • Yeah so I had a discussion on Twitter about this, turns out 12GB is small enough to fit into memory, and the author runs submissions by running a solution 5 times in a row, so using direct IO actually hurts because haing the kernel cache is a way to enforce the file is in memory for the 4 runs after. I have a direct IO solution with SIMD string search and double parsing, just in C++ (using libraries). It runs in 6 seconds on my 24 core linux box (NVMe).

    Code: https://github.com/rockwotj/1brc

    Discussion on Filesystem cache: https://x.com/rockwotj/status/1742168024776430041?s=20

  • 1brc

    1BRC in .NET among fastest on Linux (by buybackoff)

  • JDK

    JDK main-line development https://openjdk.org/projects/jdk

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts