t-digest
plurid-data-structures-typescript
Our great sponsors
t-digest | plurid-data-structures-typescript | |
---|---|---|
9 | 1 | |
1,918 | 1 | |
- | - | |
3.3 | 3.1 | |
4 months ago | 12 months ago | |
Java | TypeScript | |
Apache License 2.0 | GNU General Public License v3.0 or later |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
t-digest
-
Ask HN: How do you deal with information and internet addiction?
> I get a lot of benefit from this information but somehow it feels shallow.
I take a longer view to this. For example, a few years ago I read about an algorithm to calculate percentiles in real time. [0]
It literally just came up at work today. I haven't used that information but maybe two times since I read it, but it was super relevant today and saved my team potential weeks of development.
So maybe it's not so shallow.
But to your actual question, I have a similar problem. The best I can say is that deadlines help. I usually put down the HN and Youtube when I have a deadline coming up. And not just at work. I make sure my hobbies have deadlines too.
I tell people when I think something will be done, so they start bugging me about it when it doesn't get done, so that I have a "deadline". Also one of my hobbies is pixel light shows for holidays, which come with excellent natural deadlines -- it has to be done by the holiday or it's useless.
So either find an "accountability buddy" who will hold you to your self imposed deadlines, or find a hobby that has natural deadlines, like certain calendar dates, or annual conventions or contests that you need to be done by.
-
Ask HN: What are some 'cool' but obscure data structures you know about?
I am enamored by data structures in the sketch/summary/probabilistic family: t-digest[1], q-digest[2], count-min sketch[3], matrix-sketch[4], graph-sketch[5][6], Misra-Gries sketch[7], top-k/spacesaving sketch[8], &c.
What I like about them is that they give me a set of engineering tradeoffs that I typically don't have access to: accuracy-speed[9] or accuracy-space. There have been too many times that I've had to say, "I wish I could do this, but it would take too much time/space to compute." Most of these problems still work even if the accuracy is not 100%. And furthermore, many (if not all of these) can tune accuracy to by parameter adjustment anyways. They tend to have favorable combinatorial properties ie: they form monoids or semigroups under merge operations. In short, a property of data structures that gave me the ability to solve problems I couldn't before.
I hope they are as useful or intriguing to you as they are to me.
1. https://github.com/tdunning/t-digest
2. https://pdsa.readthedocs.io/en/latest/rank/qdigest.html
3. https://florian.github.io/count-min-sketch/
4. https://www.cs.yale.edu/homes/el327/papers/simpleMatrixSketc...
5. https://www.juanlopes.net/poly18/poly18-juan-lopes.pdf
6. https://courses.engr.illinois.edu/cs498abd/fa2020/slides/20-...
7. https://people.csail.mit.edu/rrw/6.045-2017/encalgs-mg.pdf
8. https://www.sciencedirect.com/science/article/abs/pii/S00200...
9. It may better be described as error-speed and error-space, but I've avoided the term error because the term for programming audiences typically evokes the idea of logic errors and what I mean is statistical error.
-
Monarch: Google’s Planet-Scale In-Memory Time Series Database
Ah, I misunderstood what you meant. If you are reporting static buckets I get how that is better than what folks typically do but how do you know the buckets a priori? Others back their histograms with things like https://github.com/tdunning/t-digest. It is pretty powerful as the buckets are dynamic based on the data and histograms can be added together.
-
[Q] Estimator for pop median
Yes, but if you need to estimate median on the fly (e.g., over a stream of data) or in parallel there are better ways.
-
How percentile approximation works (and why it's more useful than averages)
There are some newer data structures that take this to the next level such as T-Digest[1], which remains extremely accurate even when determining percentiles at the very tail end (like 99.999%)
[1]: https://arxiv.org/pdf/1902.04023.pdf / https://github.com/tdunning/t-digest
-
Reducing fireflies in path tracing
[2] https://github.com/tdunning/t-digest
-
Reliable, Scalable, and Maintainable Applications
T-Digest
-
Show HN: Fast Rolling Quantiles for Python
This is pretty cool. The title would be a bit more descriptive if it were “Fast Rolling Quantile Filters for Python”, since the high-pass/low-pass filter functionality seems to be the focus.
The README mentions it uses binary heaps - if you’re willing to accept some (bounded) approximation, then it should be possible to reduce memory usage and somewhat reduce runtime by using a sketching data structure like Dunning’s t-digest: https://github.com/tdunning/t-digest/blob/main/docs/t-digest....
There is an open source Python implementation, although I haven’t used it and can’t vouch for its quality: https://github.com/CamDavidsonPilon/tdigest
plurid-data-structures-typescript
-
Ask HN: What are some 'cool' but obscure data structures you know about?
Somewhat along these lines, I have formed a concept forcedly called "differentially composable string", or "deposed string", or more precise "poor man's git".
The intended use case is to obtain a compact representation of all the historic text entered into an input field (notes, comments, maybe long-form): all the stages of the text, where a stage is a tuple [add/remove, start_index, text/end_index]. Once you get the stages from the deposed string as JSON, you could transform them however you want then load them into a new deposed string.
You can read more on GitHub: https://github.com/plurid/plurid-data-structures-typescript#... or play around on my note-taking app implementing deposed strings and more: https://denote.plurid.com
What are some alternatives?
EvoTrees.jl - Boosted trees in Julia
sdsl-lite - Succinct Data Structure Library 2.0
timescale-analytics - Extension for more hyperfunctions, fully compatible with TimescaleDB and PostgreSQL 📈
RVS_Generic_Swift_Toolbox - A Collection Of Various Swift Tools, Like Extensions and Utilities
tdigest - t-Digest data structure in Python. Useful for percentiles and quantiles, including distributed enviroments like PySpark
multiversion-concurrency-control - Implementation of multiversion concurrency control, Raft, Left Right concurrency Hashmaps and a multi consumer multi producer Ringbuffer, concurrent and parallel load-balanced loops, parallel actors implementation in Main.java, Actor2.java and a parallel interpreter
PSI - Private Set Intersection Cardinality protocol based on ECDH and Bloom Filters
tdigest - PostgreSQL extension for estimating percentiles using t-digest
AspNetCoreDiagnosticScenarios - This repository has examples of broken patterns in ASP.NET Core applications
entt - Gaming meets modern C++ - a fast and reliable entity component system (ECS) and much more