tdigest
LiveStats
tdigest | LiveStats | |
---|---|---|
1 | 1 | |
82 | 76 | |
- | - | |
3.7 | 0.0 | |
7 months ago | over 5 years ago | |
C | Python | |
PostgreSQL License | GNU General Public License v3.0 or later |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
tdigest
LiveStats
-
How percentile approximation works (and why it's more useful than averages)
Awhile ago I wrote a Python library called LiveStats[1] that computed any percentile for any amount of data using a fixed amount of memory per percentile. It uses an algorithm I found in an old paper[2] called P^2. It uses a polynomial to find good approximations.
The reason I made this was an old Amazon interview question. The question was basically, "Find the median of a huge data set without sorting it," and the "correct" answer was to have a fixed size sorted buffer and randomly evict items from it and then use the median of the buffer. However, a candidate I was interviewing had a really brilliant insight: if we estimate the median and move it a small amount for each new data point, it would be pretty close. I ended up doing some research on this and found P^2, which is a more sophisticated version of that insight.
[1]: https://github.com/cxxr/LiveStats
[2]: https://www.cs.wustl.edu/~jain/papers/ftp/psqr.pdf
What are some alternatives?
t-digest - A new data structure for accurate on-line accumulation of rank-based statistics such as quantiles and trimmed means
timescale-analytics - Extension for more hyperfunctions, fully compatible with TimescaleDB and PostgreSQL 📈
node-faststats - Quickly calculate statistics of a running stream of data
Folly - An open-source C++ library developed and used at Facebook.