Polars

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • WorkOS - The modern identity platform for B2B SaaS
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • SaaSHub - Software Alternatives and Reviews
  • polars

    Dataframes powered by a multithreaded, vectorized query engine, written in Rust

  • - handling of categoricals in polars seemed a little underbaked, though my main complaint, that categories cannot be pre-defined, seems to have been recently addressed: https://github.com/pola-rs/polars/issues/10705

  • prql

    PRQL is a modern language for transforming data — a simple, powerful, pipelined SQL replacement

  • I am very curious to know how you feel about PRQL (prql-lang.org) ? IMHO it gives you the ergonomics and DX of Polars or Pandas with the power and universality of SQL because you can still execute your queries on any SQL compatible query execution engine of your choice, including Polars and Pandas but also DuckDB, ClickHouse, BigQuery, Redshift, Postgres, Trino/Presto, SQLite, ... to name just a few popular ones.

    The join syntax and semantics is one of the trickiest parts and is under discussion again recently. It's actually one of the key parts of any data transformation platform and is foundational to Relational Algebra, being right there in the "Relational" part and also the R in PRQL. Most of the PRQL built-in primitive transforms are just simple list manipulations like map, filter or reduce but joins require care to preserve monadic composition (see for example the design of SelectMany in LINQ or flatmap in the List Monad). See this comment for some of my thoughts on this: https://github.com/PRQL/prql/issues/3782#issuecomment-181131...

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
  • db-benchmark

    reproducible benchmark of database-like ops

  • Real-world performance is complicated since data science covers a lot of use cases.

    If you're just reading a small CSV to do analysis on it, then there will be no human-perceptible difference between Polars and Pandas. If you're reading a larger CSV with 100k rows, there still won't be much of a perceptible difference.

    Per this (old) benchmark, there are differences once you get into 500MB+ territory: https://h2oai.github.io/db-benchmark/

  • explorer

    Series (one-dimensional) and dataframes (two-dimensional) for fast and elegant data exploration in Elixir

  • The Explorer library [0] in Elixir uses Polars underneath it.

    [0] https://github.com/elixir-explorer/explorer

  • db-benchmark

    reproducible benchmark of database-like ops (by duckdblabs)

  • DuckDB maintains a benchmark of open source database-like tools, including Polars and Pandas

    https://duckdblabs.github.io/db-benchmark/

  • scikit-learn

    scikit-learn: machine learning in Python

  • sklearn is adding support through the dataframe interchange protocol (https://github.com/scikit-learn/scikit-learn/issues/25896). scipy, as far as I know, doesn't explicitly support dataframes (it just happens to work when you wrap a Series in `np.array` or `np.asarray`). I don't know about PyTorch but in general you can convert to numpy.

  • quivr

    Python library for working with Arrow data in tabular form (by B612-Asteroid-Institute)

  • Polars is cool, but man, I really have come to think that dataframes are disastrous for software. The mess of internal state and confusion of writing functions that take “df” and manipulate it - its all so hard to clean up once you’re deep in the mess.

    Quivr (https://github.com/spenczar/quivr) is an alternative approach that has been working for me. Maybe types are good!

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • datafusion-ballista

    Apache Arrow Ballista Distributed Query Engine

  • Not super on topic because this is all immature and not integrated with one another yet, but there is a scaled-out rust data-frames-on-arrow implementation called ballista that could maybe? form the backend of a polars scale out approach: https://github.com/apache/arrow-ballista

  • r-polars

    Bring polars to R

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts