How do you deal with parallelising parts of an ML pipeline especially on Python?

This page summarizes the projects mentioned and recommended in the original post on /r/mlops

Our great sponsors
  • InfluxDB - Collect and Analyze Billions of Data Points in Real Time
  • Sonar - Write Clean Python Code. Always.
  • Revelo Payroll - Free Global Payroll designed for tech teams
  • Onboard AI - Learn any GitHub repo in 59 seconds
  • ploomber

    The fastest ⚡️ way to build data pipelines. Develop iteratively, deploy anywhere. ☁️

    Multiprocessing works well but you probably need an abstraction on top to make it work reliably. For starters, it's best to use a pool of processes because creating new ones is expensive, you also need to ensure that errors in the sub-processes are correctly displayed in the main process, otherwise, it becomes frustrating. Also, sometimes sub-processings might get stuck so you have to monitor them. I implemented something that takes care of all of that for a project I'm working on, it'll give you an idea of what it looks like (of course, you can use the framework as well, which lets you parallelize functions and notebooks).

  • debuglater

    Store Python traceback for later debugging. 🐛

    Finally, debugging. If you're running code in sub-processes; debugging becomes a real pain because out of the box, you won't be able to start a debugger in the sub-processes. Furthermore, there's a chance that more than one fails. One solution is to dump the traceback when any sub-process fails, so you can start a debugging sesstion afterward; look at this project for an example.

  • InfluxDB

    Collect and Analyze Billions of Data Points in Real Time. Manage all types of time series data in a single, purpose-built database. Run at any scale in any environment in the cloud, on-premises, or at the edge.

  • mpire

    A Python package for easy multiprocessing, but faster than multiprocessing

    https://github.com/Slimmer-AI/mpire is a nice lib, with better performance than multiprocessing.

  • orchest

    Build data pipelines, the easy way 🛠️

    We automatically provide container level parallelism in Orchest: https://github.com/orchest/orchest

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts