Dask – a flexible library for parallel computing in Python

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • WorkOS - The modern identity platform for B2B SaaS
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • SaaSHub - Software Alternatives and Reviews
  • Dask

    Parallel computing with task scheduling

  • cunumeric

    An Aspiring Drop-In Replacement for NumPy at Scale

  • If you want built-in GPU support (and distributed), you should check out cuNumeric (released by NVIDIA in the last week or so). Also avoids needing to manually specify chunk sizes, like it says in a sibling comment.

    https://github.com/nv-legate/cunumeric

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
  • DataProfiler

    What's in your data? Extract schema, statistics and entities from datasets

  • Having used both ray, dask, and writing custom threads, my personal view is that while there are advantages I wouldn’t want to use any of these unless absolutely necessary.

    My personal approach for most of these tasks are to try to break down the problem to be as asynchronous as possible. Then you can create threads.

    The nice thing about dask is really the way you can effectively use it as a pandas dataframe.

    Having said that, we opted to write our own parallelization for this library:

    https://github.com/capitalone/DataProfiler

    As opposed to using the dask frame. Effectively, it’s a high overhead and easier to maintain the threading ourselves given the particular approaches taken.

    That said, if I was working with large pandas dataframes, id likely use dask. For large datasets which couldn’t be stored in memory of use ray.io

  • legate.pandas

    An Aspiring Drop-In Replacement for Pandas at Scale

  • I see they also have have pandas replacement: https://github.com/nv-legate/legate.pandas. How is it different from cuDF?

  • pathml

    Tools for computational pathology

  • We have been using dask to support our computational pathology workflows [1], where the images are so big that they cannot be loaded in memory, let alone analyzed (standard pathology whole slide images are ~1GB; some microscopy techniques generate images >1TB). We divide each image into a bunch of smaller tiles and process each tile independently. The dask.distributed scheduler lets us scale up by distributing the tile processing across a cluster.

    Benefits of dask.distributed: easy to get up and running, and has support for spinning up clusters on lots of different computing platforms (local machines, HPC cluster, k8s, etc.)

    One difficulty is optimizing performance - there are so many configuration details (job size, number of workers, worker resources, etc. etc.) that it's been hard to know what is best.

    [1] https://github.com/Dana-Farber-AIOS/pathml

  • mpire

    A Python package for easy multiprocessing, but faster than multiprocessing

  • Shout out to an alternative to Dask: MPIRE https://github.com/Slimmer-AI/mpire

  • distributed

    A distributed task scheduler for Dask

  • I would not recommend Dask. We use it just for simple job scheduling (that is, none of its fancy data structures) and run into issues just getting the work done efficiently. This issue, for instance, keeps the cluster from actually being utilized fully: https://github.com/dask/distributed/issues/4501. I feel like I'm on crazy pills, because it seems pretty serious yet it's gotten no attention.

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • cudf

    cuDF - GPU DataFrame Library

  • You can probably use https://github.com/rapidsai/cudf/tree/main/python/dask_cudf a dask wrapper around cuDF.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts