The Simple Beauty of XOR Floating Point Compression

InfluxDB - Power Real-Time Data Analytics at Scale

Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

www.influxdata.com

featured

SaaSHub - Software Alternatives and Reviews

SaaSHub helps you find the best software and product alternatives

www.saashub.com

featured

dietgpu

4 294 4.3 Cuda

GPU implementation of a fast generalized ANS (asymmetric numeral system) entropy encoder and decoder, with extensions for lossless compression of numerical and other data types in HPC/ML applications.

https://computing.llnl.gov/projects/floating-point-compressi...
but it tends to be very application specific, where there tends to be high correlation / small deltas between neighboring values in a 2d/3d/4d/etc floating point array (e.g., you are compressing neighboring temperature grid points in a PDE weather simulation model; temperature differences in neighboring cells won't differ by that much).
In a lot of other cases (e.g., machine learning) the floating point significand bits (and sometimes the sign bit) tends to be incompressible noise. The exponent is the only thing that is really compressible, and the xor trick does not help you as much because neighboring values could still vary a bit in terms of exponents. An entropy encoder instead works well for that (encode closer to the actual underlying data distribution/entropy), and you also don't depend upon neighboring floats having similar exponents as well.
In 2022, I created dietgpu, a library to losslessly compress/decompress floating point data at up to 400 GB/s on an A100. It uses a general-purpose asymmetric numeral system encoder/decoder on GPU (the first such implementation of general ANS on GPU, predating nvCOMP) for exponent compression.
We have used this to losslessly compress floating point data between GPUs (e.g., over Infiniband/NVLink/ethernet/etc) in training massive ML models to speed up overall wall clock time of training across 100s/1000s of GPUs without changing anything about how the training works (it's lossless compression, it computes the same thing that it did before).
https://github.com/facebookresearch/dietgpu

InfluxDB

www.influxdata.com featured

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

CUDA Checkpoint and Restore

1 project | news.ycombinator.com | 30 Apr 2024
Ask HN: Yo Nephew, in E. Africa, wants to train an LLM with on disk Wikipedia

1 project | news.ycombinator.com | 24 Apr 2024
Show HN: One Billion Rows in CUDA

1 project | news.ycombinator.com | 13 Apr 2024
Show HN: Faster sorting with register shuffling in CUDA

1 project | news.ycombinator.com | 15 Mar 2024
Raft: Fundamental widely-used algorithms and primitives for machine learning

1 project | news.ycombinator.com | 22 Feb 2024

The Simple Beauty of XOR Floating Point Compression

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com Post date: 11 Apr 2024

dietgpu

InfluxDB

Related posts

CUDA Checkpoint and Restore

Ask HN: Yo Nephew, in E. Africa, wants to train an LLM with on disk Wikipedia

Show HN: One Billion Rows in CUDA

Show HN: Faster sorting with register shuffling in CUDA

Raft: Fundamental widely-used algorithms and primitives for machine learning