[D] How to be more productive while doing Deep Learning experiments?

This page summarizes the projects mentioned and recommended in the original post on /r/MachineLearning

Our great sponsors
  • WorkOS - The modern identity platform for B2B SaaS
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • SaaSHub - Software Alternatives and Reviews
  • aim

    Aim 💫 — An easy-to-use & supercharged open-source experiment tracker.

  • Log everything, literally everything, including hyperparameters, command-line arguments, environment variables, outputs, checkpoints, resource usage, etc. Decent High-level ML frameworks provide this out-of-the-box. Configure a callback to your trainer to send a notification through Slack. To track and compare your experiments use tools other than just a plain tensorboard. Aim is a fantastic tool to get insights from hundreds of experiments.

  • coddx-alpha

    Todo Kanban Board manages tasks and save them as TODO.md - a simple plain text file.

  • Yes for deciding the order of experiments, I also like a Kanban board, like the other commenter suggested. There is a VSCode plugin that displays the content of a TODO.md as kanban board: https://github.com/coddx-hq/coddx-alpha

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
  • guildai

    Experiment tracking, ML developer tools

  • There are a number of experiment tracking systems out there. mlflow, wandb, Guild AI, etc. (disclaimer I developed Guild). I would look at adopting one of those. While you can roll your own experiment tracking tool, there's just no point IMO.

  • detectron2

    Detectron2 is a platform for object detection, segmentation and other visual recognition tasks.

  • http://karpathy.github.io/2019/04/25/recipe/ I sense that your experiments are not very organised. I would recommend using a configuration approach, where each experiment can be described by config such as https://github.com/facebookresearch/detectron2/blob/master/detectron2/config/config.py, see https://github.com/facebookresearch/detectron2/tree/master/configs for example of usage. Most experiments should only require changing parameters in main config. For experiments that require code changes, use git branches to try and if they are successful implement them as config keys.

  • Sacred

    Sacred is a tool to help you configure, organize, log and reproduce experiments developed at IDSIA.

  • For 1, setup an experiment tracking framework. I found Sacred to be helpful https://github.com/IDSIA/sacred.

  • metaflow

    :rocket: Build and manage real-life ML, AI, and data science projects with ease!

  • For building experiments as a DAG, I suggest Metaflow from Netflix. I like the ability to resume if I make a mistake. Make sure you tag your runs so you can always filter runs that had a flaw in them.

  • nvidia-gpu-scheduler

    NVIDIA GPU compute task scheduling utility

  • Sure. No, a simple bash script is not enough. In my case, we have several machines shared in the department, some with GPUs, some without. What I have is a python script that gets a list of jobs and then it schedule them in the first available machine (according to memory/CPU/GPU availability). Unfortunately, what I have is really entangled with our computing platform (Docker-based with a shared filesystem) and not really easy to have it as standalone project (that's why I said "know you infrastructure"). The most similar thing that I could find online is this project. I believe there are then some HPC tools that could be useful (e.g. Slurm), but that's way too much for what we need.

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • tmux

    tmux source code

  • Try to avoid jupyter notebooks, use them only for very preliminary experiments to save time... But for the long-run, use decent IDEs (vscode, PyCharm) can easily help you to stay away from stupid bugs. PyCharm has stunning Python language support, while open-source VSCode, Insiders Channel makes it very easy to code, run and debug remotely. Use Mosh or Eternal Terminal to prevent disconnection even if your computer is asleep/disconnected from the internet, use tmux to run tasks when you're away. You can use your smartphone to always stay connected to the same tmux session and monitor the training.

  • pytorch-lightning

    Discontinued Build high-performance AI models with PyTorch Lightning (organized PyTorch). Deploy models with Lightning Apps (organized Python to build end-to-end ML systems). [Moved to: https://github.com/Lightning-AI/lightning] (by PyTorchLightning)

  • First of all, use high-level ML frameworks (AllenNLP, PyTorch-Lightning). No need to write boilerplate code and implement standard ML approaches from scratch. Here are some suggestions (thought more NLP-focused) that I feel improved my research coding experience a lot.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts