Our great sponsors
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
-
referencesource
Source from the Microsoft .NET Reference Source that represent a subset of the .NET Framework
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
I've been implementing data structures like hash tables for a project I've been working on so it's interesting to see alternative approaches for easy to implement hash tables.
One particular note of agreement is regarding deletion. An "Insert only hash table" is a very useful data structure which can have performance improvements given the extra overhead needed to support deletion.
This is very similar to the design that dotnet uses for HashSet[1]. For deletion they use the term "freelist" rather than "gravestones" which is definitely less emotive language. ( They also use linear probing because it's generally much faster than double-hashing because branch prediction and memory look ahead and other things I don't fully understand mean that it can just grab the next X items from memory all at once before starting to compare them rather than having to jump around the array. )
One choice from OP's design which I find odd is the choice of power-of-2 for the table size. I was under the impression that choosing a table size of a prime number has significant advantages such as being able to more easily expand the table.
On a related note, my favourite hash-table is a fixed-size zero-probing "probabilistic" hash table where hash collisions are dealt with by just overwriting the hash. That makes the choice of a good size for the hash table a fun problem in itself. It's extremely fast to both implement and run if you don't mind the occasional bit of data loss, although the probability of data loss can be well controlled by choice of table size.
[1] https://github.com/dotnet/runtime/blob/4017327955f1d8ddc4398...
When I recently went shopping for fast hashes for short strings, I settled on wyhash, but ahash[0] seemed like it would have been better if I had bothered to port it from Rust.
> In that time you can FNV-1a a "short" string.
Not if you read it one byte at a time like in TFA!
It looks like the best FNV for short strings in smhasher[1] is comparably fast to ahash[2] on short strings, but I proposed doing slightly less work than ahash.
> From the top of my head, t1ha, falkhash, meowhash and metrohash are using AES-NI and none of them are particularly fast on short inputs, and at least two of them have severe issues, despite guarding against lots of vulnerabilities, which your construction does not.
For issues like reading past the ends of buffers and discarding the extra values, it would be nice if programmers could arrange to have buffers that could be used this way. I posted a thing for hashing strings of a fixed length though, to compare with the thing for hashing strings of a fixed length in TFA.
[0]: https://github.com/tkaitchuck/aHash/blob/master/src/aes_hash...
[1]: https://github.com/rurban/smhasher/blob/master/doc/FNV1a_YT....
[2]: https://github.com/rurban/smhasher/blob/master/doc/ahash64.t...
Testing this using a size of 101, it only appears to give results up to 50. In fact in general appears to only give numbers up to size / 2, should that have been >> 31?
I was also going to ask you why it needed to be "decently uniform" but then implementing it saw that the first thousand numbers all hash to the same value.
In fact, the first 21,262,215 numbers all hash to the same value with size = 101.
Given that .net implements Int32.GetHashCode(value) as "return value"[1], I would avoid this trick in practice in .net.
Please correct me if I've stuffed up my implementation!
https://referencesource.microsoft.com/#mscorlib/system/int32...
When I recently went shopping for fast hashes for short strings, I settled on wyhash, but ahash[0] seemed like it would have been better if I had bothered to port it from Rust.
> In that time you can FNV-1a a "short" string.
Not if you read it one byte at a time like in TFA!
It looks like the best FNV for short strings in smhasher[1] is comparably fast to ahash[2] on short strings, but I proposed doing slightly less work than ahash.
> From the top of my head, t1ha, falkhash, meowhash and metrohash are using AES-NI and none of them are particularly fast on short inputs, and at least two of them have severe issues, despite guarding against lots of vulnerabilities, which your construction does not.
For issues like reading past the ends of buffers and discarding the extra values, it would be nice if programmers could arrange to have buffers that could be used this way. I posted a thing for hashing strings of a fixed length though, to compare with the thing for hashing strings of a fixed length in TFA.
[0]: https://github.com/tkaitchuck/aHash/blob/master/src/aes_hash...
[1]: https://github.com/rurban/smhasher/blob/master/doc/FNV1a_YT....
[2]: https://github.com/rurban/smhasher/blob/master/doc/ahash64.t...
It was made (by Sanmayce) to optimize for instruction-level pipelining, and use the fact that modern CPUs have multiple execution ports. But due to those changes, it is not compatible with FNV1a anymore.
The trick of reading in stripes is employed by many of the fastest hashes. It is kinda funny to see how one author prefers a switch case over for loops, where others prefer while loops. The differences can sometimes have a big impact on what optimizations the compiler decides to use.
[1] https://github.com/Genbox/FastHash/blob/master/src/FastHash....