How percentile approximation works (and why it's more useful than averages)

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • WorkOS - The modern identity platform for B2B SaaS
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • SaaSHub - Software Alternatives and Reviews
  • timescale-analytics

    Extension for more hyperfunctions, fully compatible with TimescaleDB and PostgreSQL 📈

  • NB: Post author here.

    Thanks for sharing! Hadn't heard of that algorithm, have seen a number of other ones out there, we chose a couple that we knew about / were requested by users. (And we are open to more user requests if folks want to use other ones! https://github.com/timescale/timescaledb-toolkit and open an issue!)

  • LiveStats

    Online Statistical Algorithms for Python

  • Awhile ago I wrote a Python library called LiveStats[1] that computed any percentile for any amount of data using a fixed amount of memory per percentile. It uses an algorithm I found in an old paper[2] called P^2. It uses a polynomial to find good approximations.

    The reason I made this was an old Amazon interview question. The question was basically, "Find the median of a huge data set without sorting it," and the "correct" answer was to have a fixed size sorted buffer and randomly evict items from it and then use the median of the buffer. However, a candidate I was interviewing had a really brilliant insight: if we estimate the median and move it a small amount for each new data point, it would be pretty close. I ended up doing some research on this and found P^2, which is a more sophisticated version of that insight.

    [1]: https://github.com/cxxr/LiveStats

    [2]: https://www.cs.wustl.edu/~jain/papers/ftp/psqr.pdf

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
  • t-digest

    A new data structure for accurate on-line accumulation of rank-based statistics such as quantiles and trimmed means

  • There are some newer data structures that take this to the next level such as T-Digest[1], which remains extremely accurate even when determining percentiles at the very tail end (like 99.999%)

    [1]: https://arxiv.org/pdf/1902.04023.pdf / https://github.com/tdunning/t-digest

  • node-faststats

    Quickly calculate statistics of a running stream of data

  • This is exactly the algorithm we developed at LogNormal (now part of Akamai) 10 years ago for doing fast, low-memory percentiles on large datasets.

    It's implemented in this Node library: https://github.com/bluesmoon/node-faststats

    Side note: I wish everyone would stop using the term Average to refer to the Arithmetic mean. "Average" just means some statistic used to summarize a dataset. It could be the Arithmetic Mean, Median, Mode(s), Geometric Mean, Harmonic Mean, or any of a bunch of other statistics. We're stuck with AVG because that's the function used by early databases and Lotus 123.

  • tdigest

    PostgreSQL extension for estimating percentiles using t-digest (by tvondra)

  • rolling-quantiles

    Blazing fast, composable, Pythonic quantile filters.

  • Not sure about the parent post, but I also wrote a Python package inspired by a past interview question! It uses a pair of heaps to keep track of items in a time-series window and find exact quantile estimates.

    https://github.com/marmarelis/rolling-quantiles

  • Folly

    An open-source C++ library developed and used at Facebook.

  • Facebook seems to have an even better performance implementation using sqrt. Might make sense to port that over to Rust. https://github.com/facebook/folly/blob/master/folly/stats/TD...

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts