Top 8 Python distributed-training Projects

pytorch-image-models

35 29,659 9.4 Python

PyTorch image models, scripts, pretrained weights -- ResNet, ResNeXT, EfficientNet, NFNet, Vision Transformer (ViT), MobileNet-V3/V2, RegNet, DPN, CSPNet, Swin Transformer, MaxViT, CoAtNet, ConvNeXt, and more

Project mention: FLaNK AI Weekly 18 March 2024 | dev.to | 2024-03-18

skypilot

33 5,602 9.8 Python

SkyPilot: Run LLMs, AI, and Batch jobs on any cloud. Get maximum savings, highest GPU availability, and managed execution—all with a simple interface.

Project mention: Ask HN: Most efficient way to fine-tune an LLM in 2024? | news.ycombinator.com | 2024-04-04

WorkOS

workos.com sponsored

The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
FedML

6 4,052 9.9 Python

FEDML - The unified and scalable ML library for large-scale distributed training, model serving, and federated learning. FEDML Launch, a cross-cloud scheduler, further enables running any AI jobs on any GPU cloud or on-premise cluster. Built on this library, FEDML Nexus AI (https://fedml.ai) is your generative AI platform at scale.

Project mention: [Experiment] The future of AI is open-source, and here is the plan | /r/samkoesnadi | 2023-06-05

FedML https://github.com/FedML-AI/FedML might already provide a lot of tools to do the job

alpa

4 2,983 5.1 Python

Training and serving large-scale neural networks with auto parallelization.
hivemind

40 1,833 5.9 Python

Decentralized deep learning in PyTorch. Built to train models on thousands of volunteers across the world.

Project mention: You can now train a 70B language model at home | news.ycombinator.com | 2024-03-07

https://github.com/learning-at-home/hivemind is also relevant

adaptdl

4 395 0.0 Python

Resource-adaptive cluster scheduler for deep learning training.
HandyRL

1 282 3.9 Python

HandyRL is a handy and simple framework based on Python and PyTorch for distributed reinforcement learning that is applicable to your own environments.
InfluxDB

www.influxdata.com sponsored

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
Fast-Kubeflow

7 69 3.6 Python

This repo covers Kubeflow Environment with LABs: Kubeflow GUI, Jupyter Notebooks on pods, Kubeflow Pipelines, Experiments, KALE, KATIB (AutoML: Hyperparameter Tuning), KFServe (Model Serving), Training Operators (Distributed Training), Projects, etc.

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).