This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

InfluxDB - Power Real-Time Data Analytics at Scale
Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
  • polars

    Dataframes powered by a multithreaded, vectorized query engine, written in Rust

  • - handling of categoricals in polars seemed a little underbaked, though my main complaint, that categories cannot be pre-defined, seems to have been recently addressed: https://github.com/pola-rs/polars/issues/10705

  • prql

    PRQL is a modern language for transforming data — a simple, powerful, pipelined SQL replacement

  • I am very curious to know how you feel about PRQL (prql-lang.org) ? IMHO it gives you the ergonomics and DX of Polars or Pandas with the power and universality of SQL because you can still execute your queries on any SQL compatible query execution engine of your choice, including Polars and Pandas but also DuckDB, ClickHouse, BigQuery, Redshift, Postgres, Trino/Presto, SQLite, ... to name just a few popular ones.

    The join syntax and semantics is one of the trickiest parts and is under discussion again recently. It's actually one of the key parts of any data transformation platform and is foundational to Relational Algebra, being right there in the "Relational" part and also the R in PRQL. Most of the PRQL built-in primitive transforms are just simple list manipulations like map, filter or reduce but joins require care to preserve monadic composition (see for example the design of SelectMany in LINQ or flatmap in the List Monad). See this comment for some of my thoughts on this: https://github.com/PRQL/prql/issues/3782#issuecomment-181131...

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • db-benchmark

    reproducible benchmark of database-like ops

  • Real-world performance is complicated since data science covers a lot of use cases.

    If you're just reading a small CSV to do analysis on it, then there will be no human-perceptible difference between Polars and Pandas. If you're reading a larger CSV with 100k rows, there still won't be much of a perceptible difference.

    Per this (old) benchmark, there are differences once you get into 500MB+ territory: https://h2oai.github.io/db-benchmark/

  • explorer

    Series (one-dimensional) and dataframes (two-dimensional) for fast and elegant data exploration in Elixir

  • The Explorer library [0] in Elixir uses Polars underneath it.

    [0] https://github.com/elixir-explorer/explorer

  • db-benchmark

    reproducible benchmark of database-like ops (by duckdblabs)

  • DuckDB maintains a benchmark of open source database-like tools, including Polars and Pandas


  • scikit-learn

    scikit-learn: machine learning in Python

  • sklearn is adding support through the dataframe interchange protocol (https://github.com/scikit-learn/scikit-learn/issues/25896). scipy, as far as I know, doesn't explicitly support dataframes (it just happens to work when you wrap a Series in `np.array` or `np.asarray`). I don't know about PyTorch but in general you can convert to numpy.

  • quivr

    Python library for working with Arrow data in tabular form (by B612-Asteroid-Institute)

  • Polars is cool, but man, I really have come to think that dataframes are disastrous for software. The mess of internal state and confusion of writing functions that take “df” and manipulate it - its all so hard to clean up once you’re deep in the mess.

    Quivr (https://github.com/spenczar/quivr) is an alternative approach that has been working for me. Maybe types are good!

  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
  • datafusion-ballista

    Apache Arrow Ballista Distributed Query Engine

  • Not super on topic because this is all immature and not integrated with one another yet, but there is a scaled-out rust data-frames-on-arrow implementation called ballista that could maybe? form the backend of a polars scale out approach: https://github.com/apache/arrow-ballista

  • r-polars

    Bring polars to R

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts

  • Pure Python Distributed SQL Engine

    9 projects | news.ycombinator.com | 30 Dec 2022
  • How moving from Pandas to Polars made me write better code without writing better code

    2 projects | dev.to | 5 Mar 2024
  • I used multiprocessing and multithreading at the same time to drop the execution time of my code from 155+ seconds to just over 2+ seconds

    1 project | /r/Python | 29 May 2023
  • Data Engineering with Rust

    5 projects | /r/rust | 9 May 2023
  • Any job processing framework like Spark but in Rust?

    4 projects | /r/dataengineering | 23 Mar 2023