t-digest VS minisketch

Compare t-digest vs minisketch and see what are their differences.

t-digest

A new data structure for accurate on-line accumulation of rank-based statistics such as quantiles and trimmed means (by tdunning)

minisketch

Minisketch: an optimized library for BCH-based set reconciliation (by sipa)
Our great sponsors
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • WorkOS - The modern identity platform for B2B SaaS
  • SaaSHub - Software Alternatives and Reviews
t-digest minisketch
9 10
1,914 300
- -
3.3 0.6
3 months ago 4 months ago
Java C++
Apache License 2.0 MIT License
The number of mentions indicates the total number of mentions that we've tracked plus the number of user suggested alternatives.
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.

t-digest

Posts with mentions or reviews of t-digest. We have used some of these posts to build our list of alternatives and similar projects. The last one was on 2022-07-21.

minisketch

Posts with mentions or reviews of minisketch. We have used some of these posts to build our list of alternatives and similar projects. The last one was on 2022-11-20.
  • Peer-to-Peer Encrypted Messaging
    11 projects | news.ycombinator.com | 20 Nov 2022
    Since the protocol appears to use adhoc synchronization, the authors might be interested in https://github.com/sipa/minisketch/ which is a library that implements a data structure (pinsketch) that allows two parties to synchronize their sets of m b-bit elements which differ by c entries using only b*c bits. A naive protocol would use m*b bits instead, which is potentially much larger.

    I'd guess that under normal usage the message densities probably don't justify such efficient means-- we developed this library for use in bitcoin targeting rates on the order of a dozen new messages per second and where every participant has many peers with potentially differing sets--, but it's still probably worth being aware of. The pinsketch is always equal or more efficient than a naive approach, but may not be worth the complexity.

    The somewhat better known IBLT data structure has constant overheads that make it less efficient than even naive synchronization until the set differences are fairly large (particular when the element hashes are small); so some applications that evaluated and eschewed IBLT might find pinsketch applicable.

  • Ask HN: What are some 'cool' but obscure data structures you know about?
    54 projects | news.ycombinator.com | 21 Jul 2022
    Here is one not on the list so far:

    Set Sketches. They allow you compute the difference between two sets (for example to see if data has been replicated between two nodes) WITHOUT transmitting all the keys in one set to another.

    Say you have two sets of the numbers [1, ..., 1million] all 32 bit integers, and you know one set is missing 2 random numbers. Set sketches allow you to send a "set checksum" that is only 64 BITS which allows the other party to compute those missing numbers. A naive algorithm would require 1MB of data be transferred to calculate the same thing.

    *(in particular pin sketch https://github.com/sipa/minisketch).

    54 projects | news.ycombinator.com | 21 Jul 2022
    How about a pinsketch:

    A pinsketch is a set that takes a specified amount of memory and into which you can insert and remove set members or even add whole sets in time O(memory size). You can insert an unbounded number of entries, and at any time that it has equal or fewer entries than the size you can decode the list of members.

    For an example usage, say I have a list of ten million IP addresses of people who have DOS attacked my systems recently. I want to send my list to you over an expensive iridium connection, so I don't want to just send the 40MiB list. Fortunately you've been making your own observations (and maybe have stale data from me), and we don't expect our lists to differ by more than 1000 entries. So I make and maintain a pinsketch with size 1000 which takes 4000 bytes (1000 * 4bytes because IP addresses are 32-bits). Then to send you an update I just send it over. You maintain your own pinsketch of addresses, you subtract it from the one I sent and then you decode it. If the number of entries different between us is under 1000 you're guaranteed to learn the difference (otherwise the decode will fail, or give a false decode with odds ~= 1/2^(1000)).

    Bonus: We don't need to know in advance how different our sets are-- I can send the sketch in units as small as one word at a time (32-bits in this case) and stop sending once you've got enough to decode.

    Here is an implementation I contributed to: https://github.com/sipa/minisketch/

    There is a application related data-structure called an inverted bloom lookup table (IBLT) that accomplishes the same task. Its encoding and especially decoding is much faster, and it has asymptotically the same communications efficiency. However, the constant factors on the communications efficiency are poor so for 'small' in set difference (like the 1000 above) it has a rather high overhead factor, and it can't guarantee decoding. I think that makes it much less magical, though it may be the right tool for some applications.

    IBLT also has the benefit that it the decoder is a fun bit of code golf to implement.

    54 projects | news.ycombinator.com | 21 Jul 2022
    I love the set reconciliation structures like the IBLT (Iterative Bloom Lookup Table) and BCH set digests like minisketch.

    https://github.com/sipa/minisketch

    Lets say you have a set of a billion items. Someone else has mostly the same set but they differ by 10 items. These let you exchange messages that would fit in one UDP packet to reconcile the sets.

  • Here is how Ethereum COULD scale without increasing centralisation and without depending on layer two's.
    2 projects | /r/CryptoTechnology | 27 Jan 2022
    Sipa is working on a better version of that for a while. The technical term is a "set reconciliation protocol", but Bitcoin Core been doing a more basic version of this for a while. Note that the "BCH" there isn't the same as Bcash
  • ish: Sketches for Zig
    3 projects | /r/Zig | 18 Dec 2021
    I'd also have to say that Zig is a pretty neat library for this. In order to implement PBS I needed the MiniSketch-library (written in C/C++) and I'll have to say that integrating with it has been a breeze. Some fiddling in build.zig so that I can avoid Makefile, and after that everything has worked amazingly.
  • The Pinecone Overlay Network
    2 projects | news.ycombinator.com | 7 May 2021
    Networks that need to constrain themselves to limited typologies to avoid traffic magnification do so at the expense of robustness, especially against active attackers that grind their identifiers to gain privileged positions.

    Maybe this is a space where efficient reconciliation ( https://github.com/sipa/minisketch/ ) could help-- certainly if the goal were to flood messages to participants reconciliation can give almost optimal communication without compromising robustness.

What are some alternatives?

When comparing t-digest and minisketch you can also consider the following projects:

EvoTrees.jl - Boosted trees in Julia

wormhole-william-mobile - End-to-end encrypted file transfer for Android and iOS. A Magic Wormhole Mobile client.

timescale-analytics - Extension for more hyperfunctions, fully compatible with TimescaleDB and PostgreSQL 📈

tdigest - t-Digest data structure in Python. Useful for percentiles and quantiles, including distributed enviroments like PySpark

ctrie-java - Java implementation of a concurrent trie

tries-T9-Prediction - Its artificial intelligence algorithm of T9 mobile

PSI - Private Set Intersection Cardinality protocol based on ECDH and Bloom Filters

AspNetCoreDiagnosticScenarios - This repository has examples of broken patterns in ASP.NET Core applications

tdigest - PostgreSQL extension for estimating percentiles using t-digest

sdsl-lite - Succinct Data Structure Library 2.0

rolling-quantiles - Blazing fast, composable, Pythonic quantile filters.

ann-benchmarks - Benchmarks of approximate nearest neighbor libraries in Python