Building a faster hash table for high performance SQL joins

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • WorkOS - The modern identity platform for B2B SaaS
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • SaaSHub - Software Alternatives and Reviews
  • hashtable-benchmarks

    An Evaluation of Linear Probing Hashtable Algorithms

  • Since the blog post mentioned a PR to replace linear probing with Robin Hood, I just wanted to mention that I found bidirectional linear probing to outperform Robin Hood across the board in my Java integer set benchmarks:

    https://github.com/senderista/hashtable-benchmarks/blob/mast...

    https://github.com/senderista/hashtable-benchmarks/wiki/64-b...

  • CBQN

    a BQN implementation in C

  • Worth pointing out that this can depend a lot more on fiddly details than you might expect. In particular, you're dealing with a small fixed width allowing the hash to be stored in the table instead of the key. The article emphasizes variable-length keys, and I don't see any specialization on key sizes (if 4- and 8-byte keys aren't common then this makes sense; if they are then I'd expect dedicated table code for those sizes to be valuable). And set lookups are also just a bit different from value lookups. I think these cases are different enough that I have no idea if the results would carry over, although I can see how the bidirectional approach would reduce probing more than RH which seems good.

    ...and since I've done a lot of work with Robin Hood on small-key lookups, I can point out some little tweaks that have made a big difference for me. I have 8-byte lookups at just over 3ns/lookup[0], albeit at a very low load factor, typically <50%. A key step was to use the maximum possible hash as a sentinel value, handling it specially in case it shows up in the data. This way, instead of probing until finding an empty bucket or greater hash, probing just finds the first slot that's greater than or equal to the requested key's hash. So the lookup code[1] is very simple (the rest, not so much). The while loop is only needed on a hash collision, so at a low load factor a lookup is effectively branchless. However, these choices are specialized for a batched search where the number of insertions never has to be higher than the number of searches, and all the insertions can be done first. And focused on small-ish (under a million entries) tables.

    [0] https://mlochbaum.github.io/bencharray/pages/search.html

    [1] https://github.com/dzaima/CBQN/blob/5c7ab3f/src/singeli/src/...

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
  • QuestDB

    An open source time-series database for fast ingest and SQL queries

  • Looks like full keys are always compared if hash codes test equal, which is what I'd expect. For example: https://github.com/questdb/questdb/blob/master/core/src/main...

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts