Our great sponsors
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
If you want built-in GPU support (and distributed), you should check out cuNumeric (released by NVIDIA in the last week or so). Also avoids needing to manually specify chunk sizes, like it says in a sibling comment.
https://github.com/nv-legate/cunumeric
Having used both ray, dask, and writing custom threads, my personal view is that while there are advantages I wouldn’t want to use any of these unless absolutely necessary.
My personal approach for most of these tasks are to try to break down the problem to be as asynchronous as possible. Then you can create threads.
The nice thing about dask is really the way you can effectively use it as a pandas dataframe.
Having said that, we opted to write our own parallelization for this library:
https://github.com/capitalone/DataProfiler
As opposed to using the dask frame. Effectively, it’s a high overhead and easier to maintain the threading ourselves given the particular approaches taken.
That said, if I was working with large pandas dataframes, id likely use dask. For large datasets which couldn’t be stored in memory of use ray.io
I see they also have have pandas replacement: https://github.com/nv-legate/legate.pandas. How is it different from cuDF?
We have been using dask to support our computational pathology workflows [1], where the images are so big that they cannot be loaded in memory, let alone analyzed (standard pathology whole slide images are ~1GB; some microscopy techniques generate images >1TB). We divide each image into a bunch of smaller tiles and process each tile independently. The dask.distributed scheduler lets us scale up by distributing the tile processing across a cluster.
Benefits of dask.distributed: easy to get up and running, and has support for spinning up clusters on lots of different computing platforms (local machines, HPC cluster, k8s, etc.)
One difficulty is optimizing performance - there are so many configuration details (job size, number of workers, worker resources, etc. etc.) that it's been hard to know what is best.
[1] https://github.com/Dana-Farber-AIOS/pathml
Shout out to an alternative to Dask: MPIRE https://github.com/Slimmer-AI/mpire
I would not recommend Dask. We use it just for simple job scheduling (that is, none of its fancy data structures) and run into issues just getting the work done efficiently. This issue, for instance, keeps the cluster from actually being utilized fully: https://github.com/dask/distributed/issues/4501. I feel like I'm on crazy pills, because it seems pretty serious yet it's gotten no attention.
You can probably use https://github.com/rapidsai/cudf/tree/main/python/dask_cudf a dask wrapper around cuDF.
Related posts
- What is the best way to save a csv.file in number only ? PC hangs when my file is more than 2GB
- What does it mean to scale your python powered pipeline?
- Is Numpy always more efficient than Pandas? And how much should we rely on Python anyway?
- Ask HN: Is PySPark a Dead-End?
- How to load 85.6 GB of XML data into a dataframe