The quick and practical “MSI” hash table

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • WorkOS - The modern identity platform for B2B SaaS
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • SaaSHub - Software Alternatives and Reviews
  • .NET Runtime

    .NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.

  • I've been implementing data structures like hash tables for a project I've been working on so it's interesting to see alternative approaches for easy to implement hash tables.

    One particular note of agreement is regarding deletion. An "Insert only hash table" is a very useful data structure which can have performance improvements given the extra overhead needed to support deletion.

    This is very similar to the design that dotnet uses for HashSet[1]. For deletion they use the term "freelist" rather than "gravestones" which is definitely less emotive language. ( They also use linear probing because it's generally much faster than double-hashing because branch prediction and memory look ahead and other things I don't fully understand mean that it can just grab the next X items from memory all at once before starting to compare them rather than having to jump around the array. )

    One choice from OP's design which I find odd is the choice of power-of-2 for the table size. I was under the impression that choosing a table size of a prime number has significant advantages such as being able to more easily expand the table.

    On a related note, my favourite hash-table is a fixed-size zero-probing "probabilistic" hash table where hash collisions are dealt with by just overwriting the hash. That makes the choice of a good size for the hash table a fun problem in itself. It's extremely fast to both implement and run if you don't mind the occasional bit of data loss, although the probability of data loss can be well controlled by choice of table size.

    [1] https://github.com/dotnet/runtime/blob/4017327955f1d8ddc4398...

  • smhasher

    Hash function quality and speed tests (by rurban)

  • When I recently went shopping for fast hashes for short strings, I settled on wyhash, but ahash[0] seemed like it would have been better if I had bothered to port it from Rust.

    > In that time you can FNV-1a a "short" string.

    Not if you read it one byte at a time like in TFA!

    It looks like the best FNV for short strings in smhasher[1] is comparably fast to ahash[2] on short strings, but I proposed doing slightly less work than ahash.

    > From the top of my head, t1ha, falkhash, meowhash and metrohash are using AES-NI and none of them are particularly fast on short inputs, and at least two of them have severe issues, despite guarding against lots of vulnerabilities, which your construction does not.

    For issues like reading past the ends of buffers and discarding the extra values, it would be nice if programmers could arrange to have buffers that could be used this way. I posted a thing for hashing strings of a fixed length though, to compare with the thing for hashing strings of a fixed length in TFA.

    [0]: https://github.com/tkaitchuck/aHash/blob/master/src/aes_hash...

    [1]: https://github.com/rurban/smhasher/blob/master/doc/FNV1a_YT....

    [2]: https://github.com/rurban/smhasher/blob/master/doc/ahash64.t...

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
  • referencesource

    Source from the Microsoft .NET Reference Source that represent a subset of the .NET Framework

  • Testing this using a size of 101, it only appears to give results up to 50. In fact in general appears to only give numbers up to size / 2, should that have been >> 31?

    I was also going to ask you why it needed to be "decently uniform" but then implementing it saw that the first thousand numbers all hash to the same value.

    In fact, the first 21,262,215 numbers all hash to the same value with size = 101.

    Given that .net implements Int32.GetHashCode(value) as "return value"[1], I would avoid this trick in practice in .net.

    Please correct me if I've stuffed up my implementation!

    https://referencesource.microsoft.com/#mscorlib/system/int32...

  • aHash

    aHash is a non-cryptographic hashing algorithm that uses the AES hardware instruction

  • When I recently went shopping for fast hashes for short strings, I settled on wyhash, but ahash[0] seemed like it would have been better if I had bothered to port it from Rust.

    > In that time you can FNV-1a a "short" string.

    Not if you read it one byte at a time like in TFA!

    It looks like the best FNV for short strings in smhasher[1] is comparably fast to ahash[2] on short strings, but I proposed doing slightly less work than ahash.

    > From the top of my head, t1ha, falkhash, meowhash and metrohash are using AES-NI and none of them are particularly fast on short inputs, and at least two of them have severe issues, despite guarding against lots of vulnerabilities, which your construction does not.

    For issues like reading past the ends of buffers and discarding the extra values, it would be nice if programmers could arrange to have buffers that could be used this way. I posted a thing for hashing strings of a fixed length though, to compare with the thing for hashing strings of a fixed length in TFA.

    [0]: https://github.com/tkaitchuck/aHash/blob/master/src/aes_hash...

    [1]: https://github.com/rurban/smhasher/blob/master/doc/FNV1a_YT....

    [2]: https://github.com/rurban/smhasher/blob/master/doc/ahash64.t...

  • FastHash

  • It was made (by Sanmayce) to optimize for instruction-level pipelining, and use the fact that modern CPUs have multiple execution ports. But due to those changes, it is not compatible with FNV1a anymore.

    The trick of reading in stripes is employed by many of the fastest hashes. It is kinda funny to see how one author prefers a switch case over for loops, where others prefer while loops. The differences can sometimes have a big impact on what optimizations the compiler decides to use.

    [1] https://github.com/Genbox/FastHash/blob/master/src/FastHash....

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts