Our great sponsors
- InfluxDB - Collect and Analyze Billions of Data Points in Real Time
- Sonar - Write Clean Python Code. Always.
- Revelo Payroll - Free Global Payroll designed for tech teams
- Onboard AI - Learn any GitHub repo in 59 seconds
-
Multiprocessing works well but you probably need an abstraction on top to make it work reliably. For starters, it's best to use a pool of processes because creating new ones is expensive, you also need to ensure that errors in the sub-processes are correctly displayed in the main process, otherwise, it becomes frustrating. Also, sometimes sub-processings might get stuck so you have to monitor them. I implemented something that takes care of all of that for a project I'm working on, it'll give you an idea of what it looks like (of course, you can use the framework as well, which lets you parallelize functions and notebooks).
-
Finally, debugging. If you're running code in sub-processes; debugging becomes a real pain because out of the box, you won't be able to start a debugger in the sub-processes. Furthermore, there's a chance that more than one fails. One solution is to dump the traceback when any sub-process fails, so you can start a debugging sesstion afterward; look at this project for an example.
-
InfluxDB
Collect and Analyze Billions of Data Points in Real Time. Manage all types of time series data in a single, purpose-built database. Run at any scale in any environment in the cloud, on-premises, or at the edge.
-
https://github.com/Slimmer-AI/mpire is a nice lib, with better performance than multiprocessing.
-
We automatically provide container level parallelism in Orchest: https://github.com/orchest/orchest
Related posts
- Decent low code options for orchestration and building data flows?
- Build ML workflows with Jupyter notebooks
- Building container images in Kubernetes, how would you approach it?
- Ideas for infrastructure and tooling to use for frequent model retraining?
- Looking for a mentor in MLOps. I am a lead developer.