Text Classification by Data Compression

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • WorkOS - The modern identity platform for B2B SaaS
  • SaaSHub - Software Alternatives and Reviews
  • corpuscompression

    Achieve better compression for small objects with a predefined corpus

  • Was going to come here to say that. Played around with this a bit for compressing small fields using a learned dictionary:

    https://github.com/spullara/corpuscompression

  • zstd

    Zstandard - Fast real-time compression algorithm

  • Two points worth noting:

    1. Gzip is not a suitable compressor for this use case, because it's limited to a 32KB window. So the input can only be correlated with the last 32KB of the reference texts.

    2. You can save a great deal in computation by avoiding recompressing the reference texts over and over and over. Some compression algorithms support checkpointing the compression state so that it can be resumed from that point repeatedly ("dictionary-based compression", which is a distinct capability from just streaming compression, which generally can only be continued once).

    I would personally shill for using Zstandard [0] instead for this purpose. Although I should disclose my bias: I'm a developer of Zstd. A few salient facts:

    1. Zstd supports very large windows (up to 128MB, or up to 2GB in long mode).

    2. Zstd is much faster than zlib.

    3. Zstd has well-developed support for dictionary-based compression.

    4. Additionally, it has a dictionary trainer that can reduce a corpus of reference documents to a compact summary document that aims to capture as much of the content as possible of the reference corpus. [1]

    5. It has (more than one) python binding available. [2][3]

    [0] https://github.com/facebook/zstd

    [1] https://github.com/facebook/zstd/blob/dev/lib/zdict.h#L40

    [2] https://pypi.org/project/zstandard

    [3] https://pypi.org/project/zstd

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts