Data Science Workflows — Notebook to Production

This page summarizes the projects mentioned and recommended in the original post on dev.to

InfluxDB - Power Real-Time Data Analytics at Scale
Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
www.influxdata.com
featured
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com
featured
  • dvc

    🦉 ML Experiments and Data Management with Git

  • At DagsHub, we’re integrated with DVC, which I love using. First and foremost, it’s open-source. It provides pipeline capabilities and supports many cloud providers for remote storage. Also, DVC acts as an extension to Git, which allows you to keep using the standard Git flow in your work. If you don’t want to use both tools, I recommend using FDS, an open-source tool that makes version control for machine learning fast & easy. It combines Git and DVC under one roof and takes care of code, data, and model versioning. (Bias alert: DagsHub developed FDS)

  • MLflow

    Open source platform for the machine learning lifecycle

  • But as you can imagine, tracking each experiment with Git can become a hassle. We’d like to automate the logging process of each run. The same as for large file versioning, many tools emerged in recent years for experiment logging, such as W&B, MLflow, TensorBoard, and the list goes on. In this case, I believe that it doesn’t matter with which hammer you choose to hit the nail, as long as you punch it through.

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • fds

    Fast Data Science, AKA fds, is a CLI for Data Scientists to version control data and code at once, by conveniently wrapping git and dvc

  • At DagsHub, we’re integrated with DVC, which I love using. First and foremost, it’s open-source. It provides pipeline capabilities and supports many cloud providers for remote storage. Also, DVC acts as an extension to Git, which allows you to keep using the standard Git flow in your work. If you don’t want to use both tools, I recommend using FDS, an open-source tool that makes version control for machine learning fast & easy. It combines Git and DVC under one roof and takes care of code, data, and model versioning. (Bias alert: DagsHub developed FDS)

  • lakeFS

    lakeFS - Data version control for your data lake | Git for data

  • Git was designed for managing software development projects and for versioning text/code files. Therefore, Git doesn’t handle large files. Git released Git LFS (Large File System) to overcome large file versioning, which is better than Git, but fails when scaling. Also, both Git and Git LFS are not optimized for data science workflow. To overcome this challenge, many powerful tools emerged in recent years, such as DVC, Delta Lake, LakeFS, and more.

  • git-lfs

    Git extension for versioning large files

  • Git was designed for managing software development projects and for versioning text/code files. Therefore, Git doesn’t handle large files. Git released Git LFS (Large File System) to overcome large file versioning, which is better than Git, but fails when scaling. Also, both Git and Git LFS are not optimized for data science workflow. To overcome this challenge, many powerful tools emerged in recent years, such as DVC, Delta Lake, LakeFS, and more.

  • delta

    An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs (by delta-io)

  • Git was designed for managing software development projects and for versioning text/code files. Therefore, Git doesn’t handle large files. Git released Git LFS (Large File System) to overcome large file versioning, which is better than Git, but fails when scaling. Also, both Git and Git LFS are not optimized for data science workflow. To overcome this challenge, many powerful tools emerged in recent years, such as DVC, Delta Lake, LakeFS, and more.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts

  • Alternative for git with big file

    2 projects | /r/datascience | 13 Jul 2022
  • GitHub for code but where/how do you organize your datafiles?

    2 projects | /r/github | 29 Mar 2022
  • Ask HN: Most efficient way to fine-tune an LLM in 2024?

    6 projects | news.ycombinator.com | 4 Apr 2024
  • Git Version Controlled Datasets in S3

    1 project | news.ycombinator.com | 25 Oct 2023
  • Frouros: A Python library for drift detection in ML systems

    1 project | news.ycombinator.com | 8 Jul 2023