Top 23 Python Dask Projects

Dask

32 11,982 9.7 Python

Parallel computing with task scheduling

Project mention: The Distributed Tensor Algebra Compiler (2022) | news.ycombinator.com | 2023-06-15

ibis

23 4,074 10.0 Python

the portable Python dataframe library

Project mention: Show HN: Hashquery, a Python library for defining reusable analysis | news.ycombinator.com | 2024-04-23

I really don't understand the appeal of dbt vs a proper programming language. The templating approach leads to massive spaghetti. I look forward to trying out something like Ibis [0]
0: https://ibis-project.org/

InfluxDB

www.influxdata.com sponsored

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
xarray

7 3,404 9.7 Python

N-D labeled arrays and datasets in Python
stumpy

6 2,984 7.9 Python

STUMPY is a powerful and scalable Python library for modern time series analysis

Project mention: Stumpy: Matrix profile time series analysis | news.ycombinator.com | 2024-03-03

mars

0 2,677 5.7 Python

Mars is a tensor-based unified framework for large-scale data computation which scales numpy, pandas, scikit-learn and Python functions.
swifter

3 2,459 5.5 Python

A package which efficiently applies any function to a pandas dataframe or series in the fastest available manner (by jmcarpenter2)
fugue

11 1,876 6.7 Python

A unified interface for distributed computing. Fugue executes SQL, Python, Pandas, and Polars code on Spark, Dask and Ray without any rewrites.

Project mention: FLaNK Stack Weekly 22 January 2024 | dev.to | 2024-01-22

WorkOS

workos.com sponsored

The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
distributed

3 1,541 9.6 Python

A distributed task scheduler for Dask
Optimus

0 1,446 1.9 Python

:truck: Agile Data Preparation Workflows made easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark (by ironmussa)
Eliot

1 1,083 7.2 Python

Eliot: the logging system that tells you *why* it happened
mlforecast

11 713 8.8 Python

Scalable machine 🤖 learning for time series forecasting.

Project mention: Sales forecast for next two years | /r/datascience | 2023-06-25

MLForecast

pystore

8 539 0.0 Python

Fast data store for Pandas time-series data
dask-sql

1 363 8.4 Python

Distributed SQL Engine in Python using Dask

Project mention: FLaNK Stack Weekly for 20 June 2023 | dev.to | 2023-06-20

nebari

1 256 9.3 Python

🪴 Nebari - your open source data science platform (by nebari-dev)
amazon-sagemaker-local-mode

1 228 7.9 Python

Amazon SageMaker Local Mode Examples

Project mention: Debugging Python Code in Amazon SageMaker Locally Using Visual Studio Code and PyCharm: A Step-by-Step Guide | dev.to | 2023-11-15

git clone https://github.com/aws-samples/amazon-sagemaker-local-mode/ cd amazon-sagemaker-local-mode/general_pipeline_local_debug python3 -m venv .venv source .venv/bin/activate pip install jupyter jupyter lab

stackstac

1 222 5.6 Python

Turn a STAC catalog into a dask-based xarray
aicsimageio

1 192 6.8 Python

Image Reading, Metadata Conversion, and Image Writing for Microscopy Images in Python
xgboost_ray

1 131 5.8 Python

Distributed XGBoost on Ray
bytehub

3 57 0.0 Python

ByteHub: making feature stores simple
dask-awkward

1 56 9.3 Python

Native Dask collection for awkward arrays, and the library to use it.
dask-memusage

0 24 0.0 Python

A low-impact profiler to figure out how much memory each task in Dask is using
steam-data-engineering

1 20 10.0 Python

A data engineering project with Airflow, dbt, Terrafrom, GCP and much more!
pangeo-binder

1 18 0.0 Python

Pangeo + Binder (dev repo for a binder/pangeo fusion concept)
SaaSHub

www.saashub.com sponsored

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Python Dask related posts

Stumpy: Matrix profile time series analysis
1 project | news.ycombinator.com | 3 Mar 2024
Shuffling large data at constant memory in Dask
1 project | /r/Python | 17 Apr 2023
Fugue: A unified interface for distributed computing
1 project | news.ycombinator.com | 26 Mar 2023
[Discussion] Open Source beats Google's AutoML for Time series
1 project | /r/MachineLearning | 28 Feb 2023
File format for large data with many columns
2 projects | /r/Python | 15 May 2022
Time Series Analysis for air pollution data not aligned [R] [P]
1 project | /r/MachineLearning | 23 Apr 2022
What is the best way to save a csv.file in number only ? PC hangs when my file is more than 2GB
2 projects | /r/learnpython | 4 Apr 2022
A note from our sponsor - InfluxDB
www.influxdata.com | 26 Apr 2024

Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality. Learn more →

Index

What are some of the best open-source Dask projects in Python? This list will help you:

	Project	Stars
1	Dask	11,982
2	ibis	4,074
3	xarray	3,404
4	stumpy	2,984
5	mars	2,677
6	swifter	2,459
7	fugue	1,876
8	distributed	1,541
9	Optimus	1,446
10	Eliot	1,083
11	mlforecast	713
12	pystore	539
13	dask-sql	363
14	nebari	256
15	amazon-sagemaker-local-mode	228
16	stackstac	222
17	aicsimageio	192
18	xgboost_ray	131
19	bytehub	57
20	dask-awkward	56
21	dask-memusage	24
22	steam-data-engineering	20
23	pangeo-binder	18