tdigest
t-digest
tdigest | t-digest | |
---|---|---|
1 | 9 | |
82 | 1,924 | |
- | - | |
3.7 | 3.3 | |
7 months ago | 5 months ago | |
C | Java | |
PostgreSQL License | Apache License 2.0 |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
tdigest
t-digest
-
Ask HN: How do you deal with information and internet addiction?
> I get a lot of benefit from this information but somehow it feels shallow.
I take a longer view to this. For example, a few years ago I read about an algorithm to calculate percentiles in real time. [0]
It literally just came up at work today. I haven't used that information but maybe two times since I read it, but it was super relevant today and saved my team potential weeks of development.
So maybe it's not so shallow.
But to your actual question, I have a similar problem. The best I can say is that deadlines help. I usually put down the HN and Youtube when I have a deadline coming up. And not just at work. I make sure my hobbies have deadlines too.
I tell people when I think something will be done, so they start bugging me about it when it doesn't get done, so that I have a "deadline". Also one of my hobbies is pixel light shows for holidays, which come with excellent natural deadlines -- it has to be done by the holiday or it's useless.
So either find an "accountability buddy" who will hold you to your self imposed deadlines, or find a hobby that has natural deadlines, like certain calendar dates, or annual conventions or contests that you need to be done by.
[0] https://github.com/tdunning/t-digest
-
Ask HN: What are some 'cool' but obscure data structures you know about?
I am enamored by data structures in the sketch/summary/probabilistic family: t-digest[1], q-digest[2], count-min sketch[3], matrix-sketch[4], graph-sketch[5][6], Misra-Gries sketch[7], top-k/spacesaving sketch[8], &c.
What I like about them is that they give me a set of engineering tradeoffs that I typically don't have access to: accuracy-speed[9] or accuracy-space. There have been too many times that I've had to say, "I wish I could do this, but it would take too much time/space to compute." Most of these problems still work even if the accuracy is not 100%. And furthermore, many (if not all of these) can tune accuracy to by parameter adjustment anyways. They tend to have favorable combinatorial properties ie: they form monoids or semigroups under merge operations. In short, a property of data structures that gave me the ability to solve problems I couldn't before.
I hope they are as useful or intriguing to you as they are to me.
1. https://github.com/tdunning/t-digest
2. https://pdsa.readthedocs.io/en/latest/rank/qdigest.html
3. https://florian.github.io/count-min-sketch/
4. https://www.cs.yale.edu/homes/el327/papers/simpleMatrixSketc...
5. https://www.juanlopes.net/poly18/poly18-juan-lopes.pdf
6. https://courses.engr.illinois.edu/cs498abd/fa2020/slides/20-...
7. https://people.csail.mit.edu/rrw/6.045-2017/encalgs-mg.pdf
8. https://www.sciencedirect.com/science/article/abs/pii/S00200...
9. It may better be described as error-speed and error-space, but I've avoided the term error because the term for programming audiences typically evokes the idea of logic errors and what I mean is statistical error.
-
Monarch: Google’s Planet-Scale In-Memory Time Series Database
Ah, I misunderstood what you meant. If you are reporting static buckets I get how that is better than what folks typically do but how do you know the buckets a priori? Others back their histograms with things like https://github.com/tdunning/t-digest. It is pretty powerful as the buckets are dynamic based on the data and histograms can be added together.
-
[Q] Estimator for pop median
Yes, but if you need to estimate median on the fly (e.g., over a stream of data) or in parallel there are better ways.
-
How percentile approximation works (and why it's more useful than averages)
There are some newer data structures that take this to the next level such as T-Digest[1], which remains extremely accurate even when determining percentiles at the very tail end (like 99.999%)
[1]: https://arxiv.org/pdf/1902.04023.pdf / https://github.com/tdunning/t-digest
-
Reducing fireflies in path tracing
[2] https://github.com/tdunning/t-digest
-
Reliable, Scalable, and Maintainable Applications
T-Digest
-
Show HN: Fast Rolling Quantiles for Python
This is pretty cool. The title would be a bit more descriptive if it were “Fast Rolling Quantile Filters for Python”, since the high-pass/low-pass filter functionality seems to be the focus.
The README mentions it uses binary heaps - if you’re willing to accept some (bounded) approximation, then it should be possible to reduce memory usage and somewhat reduce runtime by using a sketching data structure like Dunning’s t-digest: https://github.com/tdunning/t-digest/blob/main/docs/t-digest....
There is an open source Python implementation, although I haven’t used it and can’t vouch for its quality: https://github.com/CamDavidsonPilon/tdigest
What are some alternatives?
timescale-analytics - Extension for more hyperfunctions, fully compatible with TimescaleDB and PostgreSQL 📈
EvoTrees.jl - Boosted trees in Julia
tdigest - t-Digest data structure in Python. Useful for percentiles and quantiles, including distributed enviroments like PySpark
PSI - Private Set Intersection Cardinality protocol based on ECDH and Bloom Filters
AspNetCoreDiagnosticScenarios - This repository has examples of broken patterns in ASP.NET Core applications
minisketch - Minisketch: an optimized library for BCH-based set reconciliation
Caffeine - A high performance caching library for Java
swift - the multiparty transport protocol (aka "TCP with swarming" or "BitTorrent at the transport layer")
rolling-quantiles - Blazing fast, composable, Pythonic quantile filters.
pyroscope - Continuous Profiling Platform. Debug performance issues down to a single line of code [Moved to: https://github.com/grafana/pyroscope]
plurid-data-structures-typescript - Utility Data Structures Implemented in TypeScript