Improve Jupyter Notebook Reruns by Caching Cells

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • WorkOS - The modern identity platform for B2B SaaS
  • SaaSHub - Software Alternatives and Reviews
  • mandala

    A powerful and easy to use Python framework for experiment tracking and incremental computing

  • This is neat and self-contained! But as someone running experiments with a high degree of interactivity, I often have an orthogonal requirement: add more computations to the same cell without recomputing previous computations done in the cell (or in other cells).

    For a concrete example, often in an ML project you want to study how several quantities vary across several parameters. A straightforward workflow for this is: write some nested loops, collect results in python dictionaries, finally put everything together in a dataframe and compare (by plotting or otherwise).

    However, after looking at the results, maybe you spot some trend and wonder if it will continue if you tweak one of the parameters by using a new value for it; of course, you also want to look at the previous values and bring everything together in the same plot(s). You now have a problem: either re-run the cell (thus losing previous work, which is annoying even if you have to wait 1 minute - you know it's a wasted minute!), or write the new computation in a new cell, possibly with a lot of redundancy (which over time makes the notebook hard to navigate and keep consistent).

    So, this and other considerations eventually convinced me that the function is more natural than the cell as an interface/boundary at which caching should be implemented, at least for my use cases (coming from ML research). I wrote a framework based on this idea, with lots of other features (some quite experimental/unusual) to turn this into a feasible experiment management tool - check it out at https://github.com/amakelov/mandala

    P.S.: I notice you use `pickle` for the hashing - `joblib.dump` is faster with objects containing numpy arrays, which covers a lot of useful ML things

  • xetcache

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • common

    Location for shared common files in github.com/containers repos.

  • Dockerfile and Containerfile also cache outputs as layers.

    `docker build --layers` is the default: https://docs.podman.io/en/latest/markdown/podman-build.1.htm...

    container/common//docs/Containerfile.5.md: https://github.com/containers/common/blob/main/docs/Containe...

  • clerk

    ⚡️ Moldable Live Programming for Clojure

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts