Dataframe

Top 23 Dataframe Open-Source Projects

  • polars

    Dataframes powered by a multithreaded, vectorized query engine, written in Rust

  • Project mention: Why Python's Integer Division Floors (2010) | news.ycombinator.com | 2024-02-28

    This is because 0.1 is in actuality the floating point value value 0.1000000000000000055511151231257827021181583404541015625, and thus 1 divided by it is ever so slightly smaller than 10. Nevertheless, fpround(1 / fpround(1 / 10)) = 10 exactly.

    I found out about this recently because in Polars I defined a // b for floats to be (a / b).floor(), which does return 10 for this computation. Since Python's correctly-rounded division is rather expensive, I chose to stick to this (more context: https://github.com/pola-rs/polars/issues/14596#issuecomment-...).

  • pygwalker

    PyGWalker: Turn your pandas dataframe into an interactive UI for visual analysis

  • Project mention: Show HN: Use an "eraser" to clean data on flight without breaking your workflow | news.ycombinator.com | 2024-03-15
  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • modin

    Modin: Scale your Pandas workflows by changing a single line of code

  • Project mention: The Distributed Tensor Algebra Compiler (2022) | news.ycombinator.com | 2023-06-15
  • vaex

    Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀

  • cudf

    cuDF - GPU DataFrame Library

  • Project mention: A Polars exploration into Kedro | dev.to | 2023-05-17

    The interesting thing about Polars is that it does not try to be a drop-in replacement to pandas, like Dask, cuDF, or Modin, and instead has its own expressive API. Despite being a young project, it quickly got popular thanks to its easy installation process and its “lightning fast” performance.

  • Smile

    Statistical Machine Intelligence & Learning Engine

  • Project mention: The Current State of Clojure's Machine Learning Ecosystem | news.ycombinator.com | 2024-04-07

    > I don't think it's right to recommend that new users move away from the package because of licensing issues

    I was going to chime in to agree but then I saw how this was done - a completely innocuous looking commit:

    https://github.com/haifengl/smile/commit/6f22097b233a3436519...

    And literally no mention in the release notes:

    https://github.com/haifengl/smile/releases/tag/v3.0.0

    I think if you are going to change license especially in a way that makes it less permissive you need to be super open and clear about both the fact you are doing it and your reasons for that. This is done so silently as to look like it is intentionally trying to mislead and trick people.

    So maybe I wouldn't say to move away because of the specific license, but it's legitimate to avoid something when it's so clearly driven by a single entity and that entity acts in a way that isn't trustworthy.

  • arrow-datafusion

    Apache DataFusion SQL Query Engine

  • Project mention: Velox: Meta's Unified Execution Engine [pdf] | news.ycombinator.com | 2024-03-25

    Python's Substrait seems like the biggest/most-used competitor-ish out there. I'd love some compare & contrast; my sense is that Substrait has a smaller ambition, and more wants to be a language for talking about execution rather than a full on execution engine. https://github.com/substrait-io/substrait

    We can also see from the DataFusion discussion that they too see themselves as a bit of a Velox competitor. https://github.com/apache/arrow-datafusion/discussions/6441

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
  • pandas-ta

    Technical Analysis Indicators - Pandas TA is an easy to use Python 3 Pandas Extension with 150+ Indicators

  • Project mention: Help recreating ta-lib python MACDFIX in pure python | /r/algotrading | 2023-05-03

    I do not know what is the difference between MACD and MACDFIX but maybe you can take a look how MACD is implemented in pandas_ta library and modify it a bit to achive a behavior you want.

  • danfojs

    Danfo.js is an open source, JavaScript library providing high performance, intuitive, and easy to use data structures for manipulating and processing structured data.

  • Mimesis

    Mimesis is a powerful Python library that empowers developers to generate massive amounts of synthetic data efficiently.

  • Tablesaw

    Java dataframe and visualization library

  • koalas

    Koalas: pandas API on Apache Spark

  • PandasGUI

    A GUI for Pandas DataFrames

  • Project mention: PandasGUI: A GUI for Pandas DataFrames | news.ycombinator.com | 2023-08-19
  • mars

    Mars is a tensor-based unified framework for large-scale data computation which scales numpy, pandas, scikit-learn and Python functions.

  • DataFrame

    C++ DataFrame for statistical, Financial, and ML analysis -- in modern C++ using native types and contiguous memory storage

  • Project mention: New multithreaded version of C++ DataFrame was released | news.ycombinator.com | 2024-02-13
  • sketch

    AI code-writing assistant that understands data content

  • Project mention: Ask HN: What have you built with LLMs? | news.ycombinator.com | 2024-02-05

    We've made a lot of data tooling things based on LLMs, and are in the process of rebranding and launching our main product.

    1. sketch (in notebook, ai for pandas) https://github.com/approximatelabs/sketch

    2. datadm (open source, "chat with data", with support for the open source LLMs (https://github.com/approximatelabs/datadm)

    3. Our main product: julyp. https://julyp.com/ (currently under very active rebrand and cleanup) -- but a "chat with data" style app, with a lot of specialized features. I'm also streaming me using it (and sometimes building it) every weekday on twitch to solve misc data problems (https://www.twitch.tv/bluecoconut)

    For your next question, about the stack and deploy:

  • tidy-viewer

    đź“ş(tv) Tidy Viewer is a cross-platform CLI csv pretty printer that uses column styling to maximize viewer enjoyment.

  • Project mention: Csvlens: Command line CSV file viewer. Like less but made for CSV | news.ycombinator.com | 2024-01-06
  • connector-x

    Fastest library to load data from DB to DataFrames in Rust and Python

  • Project mention: How moving from Pandas to Polars made me write better code without writing better code | dev.to | 2024-03-05

    This was originally a blocker, however, we managed to set up a multi-stage Docker build to build from source. Here is the Github issue where we, along with community members, managed to solve it.

  • Daft

    Distributed DataFrame for Python designed for the cloud, powered by Rust

  • Project mention: Daft: Distributed DataFrame for Python | news.ycombinator.com | 2024-02-29

    There are benchmarks here - https://github.com/Eventual-Inc/Daft?tab=readme-ov-file#benc.... Seems to outperform Dask by a fair bit.

  • hamilton

    Hamilton helps data scientists and engineers define testable, modular, self-documenting dataflows, that encode lineage and metadata. Runs and scales everywhere python does.

  • Project mention: Using IPython Jupyter Magic commands to improve the notebook experience | dev.to | 2024-03-03

    In this post, we’ll show how your team can turn any utility function(s) into reusable IPython Jupyter magics for a better notebook experience. As an example, we’ll use Hamilton, my open source library, to motivate the creation of a magic that facilitates better development ergonomics for using it. You needn’t know what Hamilton is to understand this post.

  • pyjanitor

    Clean APIs for data cleaning. Python implementation of R package Janitor

  • Project mention: Sub library with useful code | /r/learnpython | 2023-05-19
  • datafusion-ballista

    Apache Arrow Ballista Distributed Query Engine

  • Project mention: Polars | news.ycombinator.com | 2024-01-08

    Not super on topic because this is all immature and not integrated with one another yet, but there is a scaled-out rust data-frames-on-arrow implementation called ballista that could maybe? form the backend of a polars scale out approach: https://github.com/apache/arrow-ballista

  • arquero

    Query processing and transformation of array-backed data tables.

  • Project mention: Show HN: Matrices – explore, visualize, and share large datasets | news.ycombinator.com | 2023-12-07

    Hey HN, I'm excited to share a new side project I've been working on.

    The product is called Matrices. You can check it out here: https://matrices.com/.

    With Matrices, you can *explore*, *visualize*, and *share* large (100k rows) datasets–all without code. Filter data down to just what you want, visualize it with built-in charts, and share your results with one click.

    You can use it today (no login or waitlist or anything). Just copy and paste your data from a google sheet or CSV file.

    It's hard to describe the feeling of "gliding over data" you get with Matrices, so I'd rather *show* you how it works instead. This 75s video will give you a sense of how it works: https://www.youtube.com/watch?v=Rrh9_I3Ux8E.

    Data is stored locally in your browser until you publish it, though small sample does go to the OpenAI APIs for AI-assisted features.

    I started building Matrices because I wanted a tool that made it easy to explore new datasets. When I'm first trying to dig into data, I'll have one question... that leads to another... that will invariably lead to five more questions. It's sort of a fractal process, and I couldn't find many good options that were fast, responsive, and visual.

    I figured this crowd would be interested in tech stack as well, it's using arquero [1] bindings over apache arrow for in-memory analytics, and visx [2] for visualizations. I'd like to add duckdb-wasm support at some point to open up a wider set of databases. Data is serialized as parquet to save a bit on bandwidth + storage.

    Give it a spin, and let me know what you think. This is my first 'serious frontend project' so I appreciate any and all feedback and bug reports. Feel free to comment here (I'll be around most of the day), or shoot me a note: [email protected]

    [1]: https://uwdata.github.io/arquero/

  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Dataframe related posts

Index

What are some of the best open-source Dataframe projects? This list will help you:

Project Stars
1 polars 26,043
2 pygwalker 9,759
3 modin 9,465
4 vaex 8,173
5 cudf 7,274
6 Smile 5,921
7 arrow-datafusion 4,924
8 pandas-ta 4,732
9 danfojs 4,649
10 Mimesis 4,304
11 Tablesaw 3,441
12 koalas 3,319
13 PandasGUI 3,129
14 mars 2,675
15 DataFrame 2,258
16 sketch 2,194
17 tidy-viewer 2,020
18 connector-x 1,769
19 Daft 1,666
20 hamilton 1,312
21 pyjanitor 1,279
22 datafusion-ballista 1,275
23 arquero 1,186

Sponsored
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com