How do you deal with parallelising parts of an ML pipeline especially on Python?

This page summarizes the projects mentioned and recommended in the original post on /r/mlops

Our great sponsors
  • WorkOS - The modern identity platform for B2B SaaS
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • SaaSHub - Software Alternatives and Reviews
  • ploomber

    The fastest ⚡️ way to build data pipelines. Develop iteratively, deploy anywhere. ☁️

  • Multiprocessing works well but you probably need an abstraction on top to make it work reliably. For starters, it's best to use a pool of processes because creating new ones is expensive, you also need to ensure that errors in the sub-processes are correctly displayed in the main process, otherwise, it becomes frustrating. Also, sometimes sub-processings might get stuck so you have to monitor them. I implemented something that takes care of all of that for a project I'm working on, it'll give you an idea of what it looks like (of course, you can use the framework as well, which lets you parallelize functions and notebooks).

  • debuglater

    Store Python traceback for later debugging. 🐛

  • Finally, debugging. If you're running code in sub-processes; debugging becomes a real pain because out of the box, you won't be able to start a debugger in the sub-processes. Furthermore, there's a chance that more than one fails. One solution is to dump the traceback when any sub-process fails, so you can start a debugging sesstion afterward; look at this project for an example.

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
  • mpire

    A Python package for easy multiprocessing, but faster than multiprocessing

  • https://github.com/Slimmer-AI/mpire is a nice lib, with better performance than multiprocessing.

  • orchest

    Build data pipelines, the easy way 🛠️

  • We automatically provide container level parallelism in Orchest: https://github.com/orchest/orchest

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts