spotty
sagemaker-training-toolkit
spotty | sagemaker-training-toolkit | |
---|---|---|
3 | 1 | |
491 | 468 | |
0.2% | 2.4% | |
0.0 | 6.3 | |
7 months ago | about 1 month ago | |
Python | Python | |
MIT License | Apache License 2.0 |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
spotty
-
[D] Interactive Compute Platform Recommendations for ML Research
Use spotty https://github.com/spotty-cloud/spotty to launch AWS Spot instances that you can use ssh port forwarding with to run GPU experiments from a jupyer notebook.
-
[P]Spotml.io: Seamless ML training on AWS Spot instances, with docker (training 3X cheaper)
SpotML is a command line tool that automatically manages ML training on AWS spot instances. It lets you handle spot interruptions by resuming training using the latest checkpoint. Documentation link to try it out Looking for feedback from early testers. You would be an ideal candidate if you have a side project that you're spending your own money to train. Acknowledgement: - SpotML is built on top of existing open source library Spotty
-
Show HN: Seamless ML training on AWS Spot instances
- SpotML is built on top of existing open source library Spotty: https://github.com/spotty-cloud/spotty
sagemaker-training-toolkit
-
Distributed training with Horovod/MPI
I'm using sagemaker-training-toolkit to attempt hyperparameter optimization and trying to take advantage of all the cores on each machine using their MPI options (which uses Horovod with MPI to my understanding). I'm pretty new to this space and can't find anything that describes in somewhat lay-terms how training works in this distributed model. With AllReduce, how often does the reduce happen? I'm trying to figure out if all training threads are training a shared model such that every thread is training on the "latest" version of the model.
What are some alternatives?
ethereum-etl - Python scripts for ETL (extract, transform and load) jobs for Ethereum blocks, transactions, ERC20 / ERC721 tokens, transfers, receipts, logs, contracts, internal transactions. Data is available in Google BigQuery https://goo.gl/oY5BCQ
image-super-resolution - 🔎 Super-scale your images and run experiments with Residual Dense and Adversarial Networks.
sagemaker-run-notebook - Tools to run Jupyter notebooks as jobs in Amazon SageMaker - ad hoc, on a schedule, or in response to events
jina - ☁️ Build multimodal AI applications with cloud-native stack
sagemaker-tensorflow-training-toolkit - Toolkit for running TensorFlow training scripts on SageMaker. Dockerfiles used for building SageMaker TensorFlow Containers are at https://github.com/aws/deep-learning-containers.
Activeloop Hub - Data Lake for Deep Learning. Build, manage, query, version, & visualize datasets. Stream data real-time to PyTorch/TensorFlow. https://activeloop.ai [Moved to: https://github.com/activeloopai/deeplake]
micro-service-email - Deploy a self-hosted gmail microservice in minutes
torchlambda - Lightweight tool to deploy PyTorch models to AWS Lambda
data-science-ipython-notebooks - Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.
sagemaker-distribution - A set of Docker images that include popular frameworks for machine learning, data science and visualization.