Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality. Learn more →
Top 23 Dataframe Open-Source Projects
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
vaex
Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
-
pandas-ta
Technical Analysis Indicators - Pandas TA is an easy to use Python 3 Pandas Extension with 150+ Indicators
-
danfojs
Danfo.js is an open source, JavaScript library providing high performance, intuitive, and easy to use data structures for manipulating and processing structured data.
-
Mimesis
Mimesis is a powerful Python library that empowers developers to generate massive amounts of synthetic data efficiently.
-
mars
Mars is a tensor-based unified framework for large-scale data computation which scales numpy, pandas, scikit-learn and Python functions.
-
DataFrame
C++ DataFrame for statistical, Financial, and ML analysis -- in modern C++ using native types and contiguous memory storage
-
tidy-viewer
đź“ş(tv) Tidy Viewer is a cross-platform CLI csv pretty printer that uses column styling to maximize viewer enjoyment.
-
hamilton
Hamilton helps data scientists and engineers define testable, modular, self-documenting dataflows, that encode lineage and metadata. Runs and scales everywhere python does.
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
This is because 0.1 is in actuality the floating point value value 0.1000000000000000055511151231257827021181583404541015625, and thus 1 divided by it is ever so slightly smaller than 10. Nevertheless, fpround(1 / fpround(1 / 10)) = 10 exactly.
I found out about this recently because in Polars I defined a // b for floats to be (a / b).floor(), which does return 10 for this computation. Since Python's correctly-rounded division is rather expensive, I chose to stick to this (more context: https://github.com/pola-rs/polars/issues/14596#issuecomment-...).
Project mention: Show HN: Use an "eraser" to clean data on flight without breaking your workflow | news.ycombinator.com | 2024-03-15
The interesting thing about Polars is that it does not try to be a drop-in replacement to pandas, like Dask, cuDF, or Modin, and instead has its own expressive API. Despite being a young project, it quickly got popular thanks to its easy installation process and its “lightning fast” performance.
Project mention: The Current State of Clojure's Machine Learning Ecosystem | news.ycombinator.com | 2024-04-07> I don't think it's right to recommend that new users move away from the package because of licensing issues
I was going to chime in to agree but then I saw how this was done - a completely innocuous looking commit:
https://github.com/haifengl/smile/commit/6f22097b233a3436519...
And literally no mention in the release notes:
https://github.com/haifengl/smile/releases/tag/v3.0.0
I think if you are going to change license especially in a way that makes it less permissive you need to be super open and clear about both the fact you are doing it and your reasons for that. This is done so silently as to look like it is intentionally trying to mislead and trick people.
So maybe I wouldn't say to move away because of the specific license, but it's legitimate to avoid something when it's so clearly driven by a single entity and that entity acts in a way that isn't trustworthy.
Python's Substrait seems like the biggest/most-used competitor-ish out there. I'd love some compare & contrast; my sense is that Substrait has a smaller ambition, and more wants to be a language for talking about execution rather than a full on execution engine. https://github.com/substrait-io/substrait
We can also see from the DataFusion discussion that they too see themselves as a bit of a Velox competitor. https://github.com/apache/arrow-datafusion/discussions/6441
I do not know what is the difference between MACD and MACDFIX but maybe you can take a look how MACD is implemented in pandas_ta library and modify it a bit to achive a behavior you want.
Project mention: New multithreaded version of C++ DataFrame was released | news.ycombinator.com | 2024-02-13
We've made a lot of data tooling things based on LLMs, and are in the process of rebranding and launching our main product.
1. sketch (in notebook, ai for pandas) https://github.com/approximatelabs/sketch
2. datadm (open source, "chat with data", with support for the open source LLMs (https://github.com/approximatelabs/datadm)
3. Our main product: julyp. https://julyp.com/ (currently under very active rebrand and cleanup) -- but a "chat with data" style app, with a lot of specialized features. I'm also streaming me using it (and sometimes building it) every weekday on twitch to solve misc data problems (https://www.twitch.tv/bluecoconut)
For your next question, about the stack and deploy:
Project mention: Csvlens: Command line CSV file viewer. Like less but made for CSV | news.ycombinator.com | 2024-01-06
Project mention: How moving from Pandas to Polars made me write better code without writing better code | dev.to | 2024-03-05This was originally a blocker, however, we managed to set up a multi-stage Docker build to build from source. Here is the Github issue where we, along with community members, managed to solve it.
There are benchmarks here - https://github.com/Eventual-Inc/Daft?tab=readme-ov-file#benc.... Seems to outperform Dask by a fair bit.
Project mention: Using IPython Jupyter Magic commands to improve the notebook experience | dev.to | 2024-03-03In this post, we’ll show how your team can turn any utility function(s) into reusable IPython Jupyter magics for a better notebook experience. As an example, we’ll use Hamilton, my open source library, to motivate the creation of a magic that facilitates better development ergonomics for using it. You needn’t know what Hamilton is to understand this post.
Not super on topic because this is all immature and not integrated with one another yet, but there is a scaled-out rust data-frames-on-arrow implementation called ballista that could maybe? form the backend of a polars scale out approach: https://github.com/apache/arrow-ballista
Project mention: Show HN: Matrices – explore, visualize, and share large datasets | news.ycombinator.com | 2023-12-07Hey HN, I'm excited to share a new side project I've been working on.
The product is called Matrices. You can check it out here: https://matrices.com/.
With Matrices, you can *explore*, *visualize*, and *share* large (100k rows) datasets–all without code. Filter data down to just what you want, visualize it with built-in charts, and share your results with one click.
You can use it today (no login or waitlist or anything). Just copy and paste your data from a google sheet or CSV file.
It's hard to describe the feeling of "gliding over data" you get with Matrices, so I'd rather *show* you how it works instead. This 75s video will give you a sense of how it works: https://www.youtube.com/watch?v=Rrh9_I3Ux8E.
Data is stored locally in your browser until you publish it, though small sample does go to the OpenAI APIs for AI-assisted features.
I started building Matrices because I wanted a tool that made it easy to explore new datasets. When I'm first trying to dig into data, I'll have one question... that leads to another... that will invariably lead to five more questions. It's sort of a fractal process, and I couldn't find many good options that were fast, responsive, and visual.
I figured this crowd would be interested in tech stack as well, it's using arquero [1] bindings over apache arrow for in-memory analytics, and visx [2] for visualizations. I'd like to add duckdb-wasm support at some point to open up a wider set of databases. Data is serialized as parquet to save a bit on bandwidth + storage.
Give it a spin, and let me know what you think. This is my first 'serious frontend project' so I appreciate any and all feedback and bug reports. Feel free to comment here (I'll be around most of the day), or shoot me a note: [email protected]
[1]: https://uwdata.github.io/arquero/
Dataframe related posts
- Plotting Financial Data in Kotlin with Kandy
- Velox: Meta's Unified Execution Engine [pdf]
- Why Python's Integer Division Floors (2010)
- New multithreaded version of C++ DataFrame was released
- Polars
- Polars 0.20 Released
- Polars: Dataframes powered by a multithreaded query engine, written in Rust
-
A note from our sponsor - InfluxDB
www.influxdata.com | 25 Apr 2024
Index
What are some of the best open-source Dataframe projects? This list will help you:
Project | Stars | |
---|---|---|
1 | polars | 26,043 |
2 | pygwalker | 9,759 |
3 | modin | 9,465 |
4 | vaex | 8,173 |
5 | cudf | 7,274 |
6 | Smile | 5,921 |
7 | arrow-datafusion | 4,924 |
8 | pandas-ta | 4,732 |
9 | danfojs | 4,649 |
10 | Mimesis | 4,304 |
11 | Tablesaw | 3,441 |
12 | koalas | 3,319 |
13 | PandasGUI | 3,129 |
14 | mars | 2,675 |
15 | DataFrame | 2,258 |
16 | sketch | 2,194 |
17 | tidy-viewer | 2,020 |
18 | connector-x | 1,769 |
19 | Daft | 1,666 |
20 | hamilton | 1,312 |
21 | pyjanitor | 1,279 |
22 | datafusion-ballista | 1,275 |
23 | arquero | 1,186 |
Sponsored