Dask

Our great sponsors

WorkOS - The modern identity platform for B2B SaaS

InfluxDB - Power Real-Time Data Analytics at Scale

SaaSHub - Software Alternatives and Reviews

Our great sponsors

Dask

32 11,982 9.7 Python

Parallel computing with task scheduling
cunumeric

9 594 8.7 Python

An Aspiring Drop-In Replacement for NumPy at Scale

If you want built-in GPU support (and distributed), you should check out cuNumeric (released by NVIDIA in the last week or so). Also avoids needing to manually specify chunk sizes, like it says in a sibling comment.
https://github.com/nv-legate/cunumeric

WorkOS

workos.com sponsored

The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
DataProfiler

61 1,357 6.3 Python

What's in your data? Extract schema, statistics and entities from datasets

Having used both ray, dask, and writing custom threads, my personal view is that while there are advantages I wouldn’t want to use any of these unless absolutely necessary.
My personal approach for most of these tasks are to try to break down the problem to be as asynchronous as possible. Then you can create threads.
The nice thing about dask is really the way you can effectively use it as a pandas dataframe.
Having said that, we opted to write our own parallelization for this library:
https://github.com/capitalone/DataProfiler
As opposed to using the dask frame. Effectively, it’s a high overhead and easier to maintain the threading ourselves given the particular approaches taken.
That said, if I was working with large pandas dataframes, id likely use dask. For large datasets which couldn’t be stored in memory of use ray.io

legate.pandas

1 72 0.0 C++

An Aspiring Drop-In Replacement for Pandas at Scale

I see they also have have pandas replacement: https://github.com/nv-legate/legate.pandas. How is it different from cuDF?

pathml

2 358 6.9 Python

Tools for computational pathology

We have been using dask to support our computational pathology workflows [1], where the images are so big that they cannot be loaded in memory, let alone analyzed (standard pathology whole slide images are ~1GB; some microscopy techniques generate images >1TB). We divide each image into a bunch of smaller tiles and process each tile independently. The dask.distributed scheduler lets us scale up by distributing the tile processing across a cluster.
Benefits of dask.distributed: easy to get up and running, and has support for spinning up clusters on lots of different computing platforms (local machines, HPC cluster, k8s, etc.)
One difficulty is optimizing performance - there are so many configuration details (job size, number of workers, worker resources, etc. etc.) that it's been hard to know what is best.
[1] https://github.com/Dana-Farber-AIOS/pathml

mpire

8 1,898 7.6 Python

A Python package for easy multiprocessing, but faster than multiprocessing

Shout out to an alternative to Dask: MPIRE https://github.com/Slimmer-AI/mpire

distributed

3 1,540 9.6 Python

A distributed task scheduler for Dask

I would not recommend Dask. We use it just for simple job scheduling (that is, none of its fancy data structures) and run into issues just getting the work done efficiently. This issue, for instance, keeps the cluster from actually being utilized fully: https://github.com/dask/distributed/issues/4501. I feel like I'm on crazy pills, because it seems pretty serious yet it's gotten no attention.

InfluxDB

www.influxdata.com sponsored

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
cudf

23 7,257 9.9 C++

cuDF - GPU DataFrame Library

You can probably use https://github.com/rapidsai/cudf/tree/main/python/dask_cudf a dask wrapper around cuDF.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project