How percentile approximation works (and why it's more useful than averages)

Our great sponsors

WorkOS - The modern identity platform for B2B SaaS

InfluxDB - Power Real-Time Data Analytics at Scale

SaaSHub - Software Alternatives and Reviews

Our great sponsors

timescale-analytics

8 336 6.0 Rust

Extension for more hyperfunctions, fully compatible with TimescaleDB and PostgreSQL 📈

NB: Post author here.
Thanks for sharing! Hadn't heard of that algorithm, have seen a number of other ones out there, we chose a couple that we knew about / were requested by users. (And we are open to more user requests if folks want to use other ones! https://github.com/timescale/timescaledb-toolkit and open an issue!)

LiveStats

1 76 0.0 Python

Online Statistical Algorithms for Python

Awhile ago I wrote a Python library called LiveStats[1] that computed any percentile for any amount of data using a fixed amount of memory per percentile. It uses an algorithm I found in an old paper[2] called P^2. It uses a polynomial to find good approximations.
The reason I made this was an old Amazon interview question. The question was basically, "Find the median of a huge data set without sorting it," and the "correct" answer was to have a fixed size sorted buffer and randomly evict items from it and then use the median of the buffer. However, a candidate I was interviewing had a really brilliant insight: if we estimate the median and move it a small amount for each new data point, it would be pretty close. I ended up doing some research on this and found P^2, which is a more sophisticated version of that insight.
[1]: https://github.com/cxxr/LiveStats
[2]: https://www.cs.wustl.edu/~jain/papers/ftp/psqr.pdf

WorkOS

workos.com sponsored

The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
t-digest

9 1,922 3.3 Java

A new data structure for accurate on-line accumulation of rank-based statistics such as quantiles and trimmed means

There are some newer data structures that take this to the next level such as T-Digest[1], which remains extremely accurate even when determining percentiles at the very tail end (like 99.999%)
[1]: https://arxiv.org/pdf/1902.04023.pdf / https://github.com/tdunning/t-digest

node-faststats

1 191 0.0 JavaScript

Quickly calculate statistics of a running stream of data

This is exactly the algorithm we developed at LogNormal (now part of Akamai) 10 years ago for doing fast, low-memory percentiles on large datasets.
It's implemented in this Node library: https://github.com/bluesmoon/node-faststats
Side note: I wish everyone would stop using the term Average to refer to the Arithmetic mean. "Average" just means some statistic used to summarize a dataset. It could be the Arithmetic Mean, Median, Mode(s), Geometric Mean, Harmonic Mean, or any of a bunch of other statistics. We're stuck with AVG because that's the function used by early databases and Lotus 123.

tdigest

1 82 3.7 C

PostgreSQL extension for estimating percentiles using t-digest (by tvondra)
rolling-quantiles

2 133 2.0 C

Blazing fast, composable, Pythonic quantile filters.

Not sure about the parent post, but I also wrote a Python package inspired by a past interview question! It uses a pair of heaps to keep track of items in a time-series window and find exact quantile estimates.
https://github.com/marmarelis/rolling-quantiles

Folly

90 27,072 9.8 C++

An open-source C++ library developed and used at Facebook.

Facebook seems to have an even better performance implementation using sqrt. Might make sense to port that over to Rust. https://github.com/facebook/folly/blob/master/folly/stats/TD...

InfluxDB

www.influxdata.com sponsored

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project