Random access string compression with FSST and Rust

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com
featured
  1. Apache Arrow

    Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics

  2. SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
  3. minz

    Minimal string compression

    I implemented this in Zig earlier: https://github.com/judofyr/minz

    It’s a quite neat algorithm. I saw compression ratios in the 2-3x range. However, I remember that the algorithm for finding the dictionary was a bit unclear. I wasn’t convinced that what was explained in the paper found the “optimal” dictionary. With some slight tweaks I got widely different results. I wonder if this implementation improves on this.

  4. fsst

    Pure-Rust implementation of Fast Static Symbol Tables string compression (by spiraldb)

    The dictionary quality was definitely highly sensitive to some of the tricks that the original authors implemented in their C++ code, many were documented in the paper but a few were not:

    1. Always promoting single-bytes by boosting their scores by a factor of 8 in candidate search

    2. Boosting the calculated gains of single-byte candidates by a factor of 8 to prevent them from falling off in later generations

    3. Having an adaptive threshold for which symbols are included as the rounds go on

    I didn't document these in the blog post to keep the content accessible, but it's definitely something you find once you start digging into compression ratios! Perhaps they will end up in a part 2 at some point.

    [1]: https://github.com/spiraldb/fsst/blob/develop/src/builder.rs...

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts

  • Apache DataFusion

    3 projects | news.ycombinator.com | 12 Jan 2025
  • Show HN: Turn CSS files into high performance APIs

    1 project | news.ycombinator.com | 11 Jan 2025
  • Show HN: TonboLite – Scale SQLite with S3, Minimize ETL

    2 projects | news.ycombinator.com | 7 Jan 2025
  • Unlocking DuckDB from Anywhere - A Guide to Remote Access with Apache Arrow and Flight RPC (gRPC)

    4 projects | dev.to | 12 Dec 2024
  • Building a distributed log using S3 (under 150 lines of Go)

    4 projects | news.ycombinator.com | 1 Dec 2024

Did you know that C++ is
the 7th most popular programming language
based on number of references?