-
Log everything, literally everything, including hyperparameters, command-line arguments, environment variables, outputs, checkpoints, resource usage, etc. Decent High-level ML frameworks provide this out-of-the-box. Configure a callback to your trainer to send a notification through Slack. To track and compare your experiments use tools other than just a plain tensorboard. Aim is a fantastic tool to get insights from hundreds of experiments.
-
CodeRabbit
CodeRabbit: AI Code Reviews for Developers. Revolutionize your code reviews with AI. CodeRabbit offers PR summaries, code walkthroughs, 1-click suggestions, and AST-based analysis. Boost productivity and code quality across all major languages with each PR.
-
Yes for deciding the order of experiments, I also like a Kanban board, like the other commenter suggested. There is a VSCode plugin that displays the content of a TODO.md as kanban board: https://github.com/coddx-hq/coddx-alpha
-
There are a number of experiment tracking systems out there. mlflow, wandb, Guild AI, etc. (disclaimer I developed Guild). I would look at adopting one of those. While you can roll your own experiment tracking tool, there's just no point IMO.
-
detectron2
Detectron2 is a platform for object detection, segmentation and other visual recognition tasks.
http://karpathy.github.io/2019/04/25/recipe/ I sense that your experiments are not very organised. I would recommend using a configuration approach, where each experiment can be described by config such as https://github.com/facebookresearch/detectron2/blob/master/detectron2/config/config.py, see https://github.com/facebookresearch/detectron2/tree/master/configs for example of usage. Most experiments should only require changing parameters in main config. For experiments that require code changes, use git branches to try and if they are successful implement them as config keys.
-
Sacred
Sacred is a tool to help you configure, organize, log and reproduce experiments developed at IDSIA.
For 1, setup an experiment tracking framework. I found Sacred to be helpful https://github.com/IDSIA/sacred.
-
For building experiments as a DAG, I suggest Metaflow from Netflix. I like the ability to resume if I make a mistake. Make sure you tag your runs so you can always filter runs that had a flaw in them.
-
Sure. No, a simple bash script is not enough. In my case, we have several machines shared in the department, some with GPUs, some without. What I have is a python script that gets a list of jobs and then it schedule them in the first available machine (according to memory/CPU/GPU availability). Unfortunately, what I have is really entangled with our computing platform (Docker-based with a shared filesystem) and not really easy to have it as standalone project (that's why I said "know you infrastructure"). The most similar thing that I could find online is this project. I believe there are then some HPC tools that could be useful (e.g. Slurm), but that's way too much for what we need.
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
-
Try to avoid jupyter notebooks, use them only for very preliminary experiments to save time... But for the long-run, use decent IDEs (vscode, PyCharm) can easily help you to stay away from stupid bugs. PyCharm has stunning Python language support, while open-source VSCode, Insiders Channel makes it very easy to code, run and debug remotely. Use Mosh or Eternal Terminal to prevent disconnection even if your computer is asleep/disconnected from the internet, use tmux to run tasks when you're away. You can use your smartphone to always stay connected to the same tmux session and monitor the training.
-
pytorch-lightning
Discontinued Build high-performance AI models with PyTorch Lightning (organized PyTorch). Deploy models with Lightning Apps (organized Python to build end-to-end ML systems). [Moved to: https://github.com/Lightning-AI/lightning] (by PyTorchLightning)
First of all, use high-level ML frameworks (AllenNLP, PyTorch-Lightning). No need to write boilerplate code and implement standard ML approaches from scratch. Here are some suggestions (thought more NLP-focused) that I feel improved my research coding experience a lot.
Related posts
-
10 Open Source MLOps Projects You Didn’t Know About
-
Ask HN: What's the right tool for this job?
-
25 Open Source AI Tools to Cut Your Development Time in Half
-
Building an Email Assistant Application with Burr
-
In Need of Guidance: Implementing MLOps in a Complex Organization as a Junior Data Engineer