-
Apache Arrow
Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
-
I implemented this in Zig earlier: https://github.com/judofyr/minz
It’s a quite neat algorithm. I saw compression ratios in the 2-3x range. However, I remember that the algorithm for finding the dictionary was a bit unclear. I wasn’t convinced that what was explained in the paper found the “optimal” dictionary. With some slight tweaks I got widely different results. I wonder if this implementation improves on this.
-
The dictionary quality was definitely highly sensitive to some of the tricks that the original authors implemented in their C++ code, many were documented in the paper but a few were not:
1. Always promoting single-bytes by boosting their scores by a factor of 8 in candidate search
2. Boosting the calculated gains of single-byte candidates by a factor of 8 to prevent them from falling off in later generations
3. Having an adaptive threshold for which symbols are included as the rounds go on
I didn't document these in the blog post to keep the content accessible, but it's definitely something you find once you start digging into compression ratios! Perhaps they will end up in a part 2 at some point.
[1]: https://github.com/spiraldb/fsst/blob/develop/src/builder.rs...