dietgpu
gpuhd
dietgpu | gpuhd | |
---|---|---|
4 | 1 | |
294 | 39 | |
3.4% | - | |
4.3 | 10.0 | |
22 days ago | over 5 years ago | |
Cuda | C++ | |
MIT License | GNU Lesser General Public License v3.0 only |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
dietgpu
-
The Simple Beauty of XOR Floating Point Compression
https://computing.llnl.gov/projects/floating-point-compressi...
but it tends to be very application specific, where there tends to be high correlation / small deltas between neighboring values in a 2d/3d/4d/etc floating point array (e.g., you are compressing neighboring temperature grid points in a PDE weather simulation model; temperature differences in neighboring cells won't differ by that much).
In a lot of other cases (e.g., machine learning) the floating point significand bits (and sometimes the sign bit) tends to be incompressible noise. The exponent is the only thing that is really compressible, and the xor trick does not help you as much because neighboring values could still vary a bit in terms of exponents. An entropy encoder instead works well for that (encode closer to the actual underlying data distribution/entropy), and you also don't depend upon neighboring floats having similar exponents as well.
In 2022, I created dietgpu, a library to losslessly compress/decompress floating point data at up to 400 GB/s on an A100. It uses a general-purpose asymmetric numeral system encoder/decoder on GPU (the first such implementation of general ANS on GPU, predating nvCOMP) for exponent compression.
We have used this to losslessly compress floating point data between GPUs (e.g., over Infiniband/NVLink/ethernet/etc) in training massive ML models to speed up overall wall clock time of training across 100s/1000s of GPUs without changing anything about how the training works (it's lossless compression, it computes the same thing that it did before).
https://github.com/facebookresearch/dietgpu
-
Parallelising Huffman decoding and x86 disassembly by synchronising prefix codes
ANS is super fast and trivially parallizable, faster than Huffman or especially arithmetic encoding. It is fast because it can be machine word oriented (you can read/write whole machine word sizes at a time, not arbitrary/variable bit length sequences), and as a result you can interleave any number of independent (parallel) encoders in the same stream with just a prefix sum to figure out where to write the state normalization values. I for one got up to 400 GB/s throughput on A100 GPUs in my implementation (https://github.com/facebookresearch/dietgpu).
ANS can also self-synchronize as well.
-
How to defend when patent office gave smb monopoly for your work (e.g. found on github)? Defend JPEG XL from granted ANS patent? (author here)
In case of rANS patent, beside individual donations, also organizations blocked by given patent could donate - e.g. JPEG, Google, Nvidia, Facebook here.
- DietGPU: Fast ANS Codec for Nvidia GPUs
gpuhd
-
Parallelising Huffman decoding and x86 disassembly by synchronising prefix codes
https://github.com/weissenberger/gpuhd
The authors of this repo/paper use the self-synchronizing property of almost all Huffman codes to implement parallel Huffman decoding on the GPU at ~10 GB/s. In practice, I haven't found this to be useful to do Huffman decoding on the CPU, since the GPU round-trip outweighs the speed of the GPU. But if your data is already on the GPU, this is a really cool way to to Huffman decoding.
What are some alternatives?
nvcomp - Repository for nvCOMP docs and examples. nvCOMP is a library for fast lossless compression/decompression on the GPU that can be downloaded from https://developer.nvidia.com/nvcomp.