Top 23 Dataframe Open-Source Projects

polars

144 26,043 10.0 Rust

Dataframes powered by a multithreaded, vectorized query engine, written in Rust

Project mention: Why Python's Integer Division Floors (2010) | news.ycombinator.com | 2024-02-28

This is because 0.1 is in actuality the floating point value value 0.1000000000000000055511151231257827021181583404541015625, and thus 1 divided by it is ever so slightly smaller than 10. Nevertheless, fpround(1 / fpround(1 / 10)) = 10 exactly.
I found out about this recently because in Polars I defined a // b for floats to be (a / b).floor(), which does return 10 for this computation. Since Python's correctly-rounded division is rather expensive, I chose to stick to this (more context: https://github.com/pola-rs/polars/issues/14596#issuecomment-...).

pygwalker

22 9,759 9.6 Python

PyGWalker: Turn your pandas dataframe into an interactive UI for visual analysis

Project mention: Show HN: Use an "eraser" to clean data on flight without breaking your workflow | news.ycombinator.com | 2024-03-15

InfluxDB

www.influxdata.com sponsored

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
modin

11 9,465 9.6 Python

Modin: Scale your Pandas workflows by changing a single line of code

Project mention: The Distributed Tensor Algebra Compiler (2022) | news.ycombinator.com | 2023-06-15

vaex

7 8,173 6.0 Python

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
cudf

23 7,274 9.9 C++

cuDF - GPU DataFrame Library

Project mention: A Polars exploration into Kedro | dev.to | 2023-05-17

The interesting thing about Polars is that it does not try to be a drop-in replacement to pandas, like Dask, cuDF, or Modin, and instead has its own expressive API. Despite being a young project, it quickly got popular thanks to its easy installation process and its “lightning fast” performance.

Smile

9 5,921 9.8 Java

Statistical Machine Intelligence & Learning Engine

Project mention: The Current State of Clojure's Machine Learning Ecosystem | news.ycombinator.com | 2024-04-07

> I don't think it's right to recommend that new users move away from the package because of licensing issues
I was going to chime in to agree but then I saw how this was done - a completely innocuous looking commit:
https://github.com/haifengl/smile/commit/6f22097b233a3436519...
And literally no mention in the release notes:
https://github.com/haifengl/smile/releases/tag/v3.0.0
I think if you are going to change license especially in a way that makes it less permissive you need to be super open and clear about both the fact you are doing it and your reasons for that. This is done so silently as to look like it is intentionally trying to mislead and trick people.
So maybe I wouldn't say to move away because of the specific license, but it's legitimate to avoid something when it's so clearly driven by a single entity and that entity acts in a way that isn't trustworthy.

arrow-datafusion

55 4,924 9.9 Rust

Apache DataFusion SQL Query Engine

Project mention: Velox: Meta's Unified Execution Engine [pdf] | news.ycombinator.com | 2024-03-25

Python's Substrait seems like the biggest/most-used competitor-ish out there. I'd love some compare & contrast; my sense is that Substrait has a smaller ambition, and more wants to be a language for talking about execution rather than a full on execution engine. https://github.com/substrait-io/substrait
We can also see from the DataFusion discussion that they too see themselves as a bit of a Velox competitor. https://github.com/apache/arrow-datafusion/discussions/6441

WorkOS

workos.com sponsored

The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
pandas-ta

17 4,732 0.0 Python

Technical Analysis Indicators - Pandas TA is an easy to use Python 3 Pandas Extension with 150+ Indicators

Project mention: Help recreating ta-lib python MACDFIX in pure python | /r/algotrading | 2023-05-03

I do not know what is the difference between MACD and MACDFIX but maybe you can take a look how MACD is implemented in pandas_ta library and modify it a bit to achive a behavior you want.

danfojs

2 4,649 0.6 TypeScript

Danfo.js is an open source, JavaScript library providing high performance, intuitive, and easy to use data structures for manipulating and processing structured data.
Mimesis

3 4,304 9.1 Python

Mimesis is a powerful Python library that empowers developers to generate massive amounts of synthetic data efficiently.
Tablesaw

4 3,441 4.8 Java

Java dataframe and visualization library
koalas

2 3,319 4.6 Python

Koalas: pandas API on Apache Spark
PandasGUI

8 3,129 4.3 Python

A GUI for Pandas DataFrames

Project mention: PandasGUI: A GUI for Pandas DataFrames | news.ycombinator.com | 2023-08-19

mars

0 2,675 5.7 Python

Mars is a tensor-based unified framework for large-scale data computation which scales numpy, pandas, scikit-learn and Python functions.
DataFrame

109 2,258 9.2 C++

C++ DataFrame for statistical, Financial, and ML analysis -- in modern C++ using native types and contiguous memory storage

Project mention: New multithreaded version of C++ DataFrame was released | news.ycombinator.com | 2024-02-13

sketch

20 2,194 4.4 Python

AI code-writing assistant that understands data content

Project mention: Ask HN: What have you built with LLMs? | news.ycombinator.com | 2024-02-05

We've made a lot of data tooling things based on LLMs, and are in the process of rebranding and launching our main product.
1. sketch (in notebook, ai for pandas) https://github.com/approximatelabs/sketch
2. datadm (open source, "chat with data", with support for the open source LLMs (https://github.com/approximatelabs/datadm)
3. Our main product: julyp. https://julyp.com/ (currently under very active rebrand and cleanup) -- but a "chat with data" style app, with a lot of specialized features. I'm also streaming me using it (and sometimes building it) every weekday on twitch to solve misc data problems (https://www.twitch.tv/bluecoconut)
For your next question, about the stack and deploy:

tidy-viewer

28 2,020 4.3 Rust

📺(tv) Tidy Viewer is a cross-platform CLI csv pretty printer that uses column styling to maximize viewer enjoyment.

Project mention: Csvlens: Command line CSV file viewer. Like less but made for CSV | news.ycombinator.com | 2024-01-06

connector-x

11 1,769 7.9 Rust

Fastest library to load data from DB to DataFrames in Rust and Python

Project mention: How moving from Pandas to Polars made me write better code without writing better code | dev.to | 2024-03-05

This was originally a blocker, however, we managed to set up a multi-stage Docker build to build from source. Here is the Github issue where we, along with community members, managed to solve it.

Daft

7 1,666 9.8 Rust

Distributed DataFrame for Python designed for the cloud, powered by Rust

Project mention: Daft: Distributed DataFrame for Python | news.ycombinator.com | 2024-02-29

There are benchmarks here - https://github.com/Eventual-Inc/Daft?tab=readme-ov-file#benc.... Seems to outperform Dask by a fair bit.

hamilton

19 1,312 9.8 Jupyter Notebook

Hamilton helps data scientists and engineers define testable, modular, self-documenting dataflows, that encode lineage and metadata. Runs and scales everywhere python does.

Project mention: Using IPython Jupyter Magic commands to improve the notebook experience | dev.to | 2024-03-03

In this post, we’ll show how your team can turn any utility function(s) into reusable IPython Jupyter magics for a better notebook experience. As an example, we’ll use Hamilton, my open source library, to motivate the creation of a magic that facilitates better development ergonomics for using it. You needn’t know what Hamilton is to understand this post.

pyjanitor

4 1,279 8.2 Python

Clean APIs for data cleaning. Python implementation of R package Janitor

Project mention: Sub library with useful code | /r/learnpython | 2023-05-19

datafusion-ballista

12 1,275 8.4 Rust

Apache Arrow Ballista Distributed Query Engine

Project mention: Polars | news.ycombinator.com | 2024-01-08

Not super on topic because this is all immature and not integrated with one another yet, but there is a scaled-out rust data-frames-on-arrow implementation called ballista that could maybe? form the backend of a polars scale out approach: https://github.com/apache/arrow-ballista

arquero

8 1,186 5.1 JavaScript

Query processing and transformation of array-backed data tables.

Project mention: Show HN: Matrices – explore, visualize, and share large datasets | news.ycombinator.com | 2023-12-07

Hey HN, I'm excited to share a new side project I've been working on.
The product is called Matrices. You can check it out here: https://matrices.com/.
With Matrices, you can *explore*, *visualize*, and *share* large (100k rows) datasets–all without code. Filter data down to just what you want, visualize it with built-in charts, and share your results with one click.
You can use it today (no login or waitlist or anything). Just copy and paste your data from a google sheet or CSV file.
It's hard to describe the feeling of "gliding over data" you get with Matrices, so I'd rather *show* you how it works instead. This 75s video will give you a sense of how it works: https://www.youtube.com/watch?v=Rrh9_I3Ux8E.
Data is stored locally in your browser until you publish it, though small sample does go to the OpenAI APIs for AI-assisted features.
I started building Matrices because I wanted a tool that made it easy to explore new datasets. When I'm first trying to dig into data, I'll have one question... that leads to another... that will invariably lead to five more questions. It's sort of a fractal process, and I couldn't find many good options that were fast, responsive, and visual.
I figured this crowd would be interested in tech stack as well, it's using arquero [1] bindings over apache arrow for in-memory analytics, and visx [2] for visualizations. I'd like to add duckdb-wasm support at some point to open up a wider set of databases. Data is serialized as parquet to save a bit on bandwidth + storage.
Give it a spin, and let me know what you think. This is my first 'serious frontend project' so I appreciate any and all feedback and bug reports. Feel free to comment here (I'll be around most of the day), or shoot me a note: [email protected]
[1]: https://uwdata.github.io/arquero/

SaaSHub

www.saashub.com sponsored

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Dataframe related posts

Plotting Financial Data in Kotlin with Kandy
3 projects | dev.to | 9 Apr 2024
Velox: Meta's Unified Execution Engine [pdf]
2 projects | news.ycombinator.com | 25 Mar 2024
Why Python's Integer Division Floors (2010)
1 project | news.ycombinator.com | 28 Feb 2024
New multithreaded version of C++ DataFrame was released
1 project | news.ycombinator.com | 13 Feb 2024
Polars
11 projects | news.ycombinator.com | 8 Jan 2024
Polars 0.20 Released
1 project | news.ycombinator.com | 16 Dec 2023
Polars: Dataframes powered by a multithreaded query engine, written in Rust
1 project | news.ycombinator.com | 7 Dec 2023
A note from our sponsor - InfluxDB
www.influxdata.com | 25 Apr 2024

Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality. Learn more →

Index

What are some of the best open-source Dataframe projects? This list will help you:

	Project	Stars
1	polars	26,043
2	pygwalker	9,759
3	modin	9,465
4	vaex	8,173
5	cudf	7,274
6	Smile	5,921
7	arrow-datafusion	4,924
8	pandas-ta	4,732
9	danfojs	4,649
10	Mimesis	4,304
11	Tablesaw	3,441
12	koalas	3,319
13	PandasGUI	3,129
14	mars	2,675
15	DataFrame	2,258
16	sketch	2,194
17	tidy-viewer	2,020
18	connector-x	1,769
19	Daft	1,666
20	hamilton	1,312
21	pyjanitor	1,279
22	datafusion-ballista	1,275
23	arquero	1,186