Rethinking string encoding: a 37.5% space efficient encoding than UTF-8 in Fury

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

InfluxDB - Power Real-Time Data Analytics at Scale
Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
www.influxdata.com
featured
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com
featured
  • fury

    A blazingly fast multi-language serialization framework powered by JIT and zero-copy.

    For implemetation details, https://github.com/apache/incubator-fury/blob/main/java/fury... can be taken as an example

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • zstd

    Zstandard - Fast real-time compression algorithm

    > In such cases, the serialized binary are mostly in 200~1000 bytes. Not big enough for zstd to work

    You're not referring to the same dictionary that I am. Look at --train in [1].

    If you have a training corpus of representative data, you can generate a dictionary that you preshare on both sides which will perform much better for very small binaries (including 200-1k bytes).

    If you want maximum flexibility (i.e. you don't know the universe of representative messages ahead of time or you want maximum compression performance), you can gather this corpus transparently as messages are generated & then generate a dictionary & attach it as sideband metadata to a message. You'll probably need to defer the decoding if it references a dictionary not yet received (i.e. send delivers messages out-of-order from generation). There are other techniques you can apply, but the general rule is that your custom encoding scheme is unlikely to outperform zstd + a representative training corpus. If it does, you'd need to actually show this rather than try to argue from first principles.

    [1] https://github.com/facebook/zstd/blob/dev/programs/zstd.1.md

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts

  • Apache Fury Serialization 0.5.1 released

    1 project | news.ycombinator.com | 29 May 2024
  • Apache Fury – fast serialization framework – 0.5.0 released

    1 project | news.ycombinator.com | 6 May 2024
  • Fast Cloud Native Java Serialization:Fury JIT and GraalVM Native Image AOT

    1 project | news.ycombinator.com | 1 Dec 2023
  • Fury Serialization Framework 0.3.1 Released: Support Python 3.11&3.12

    1 project | /r/Python | 23 Nov 2023
  • Fury Serialization 0.3.1 Released: support Python 3.11&12

    1 project | news.ycombinator.com | 21 Nov 2023