Data Science Workflows — Notebook to Production

InfluxDB - Power Real-Time Data Analytics at Scale

Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

www.influxdata.com

featured

SaaSHub - Software Alternatives and Reviews

SaaSHub helps you find the best software and product alternatives

www.saashub.com

featured

dvc

109 13,139 9.6 Python

🦉 ML Experiments and Data Management with Git

At DagsHub, we’re integrated with DVC, which I love using. First and foremost, it’s open-source. It provides pipeline capabilities and supports many cloud providers for remote storage. Also, DVC acts as an extension to Git, which allows you to keep using the standard Git flow in your work. If you don’t want to use both tools, I recommend using FDS, an open-source tool that makes version control for machine learning fast & easy. It combines Git and DVC under one roof and takes care of code, data, and model versioning. (Bias alert: DagsHub developed FDS)

MLflow

56 17,335 9.9 Python

Open source platform for the machine learning lifecycle

But as you can imagine, tracking each experiment with Git can become a hassle. We’d like to automate the logging process of each run. The same as for large file versioning, many tools emerged in recent years for experiment logging, such as W&B, MLflow, TensorBoard, and the list goes on. In this case, I believe that it doesn’t matter with which hammer you choose to hit the nail, as long as you punch it through.

InfluxDB

www.influxdata.com featured

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
fds

3 382 3.7 Python

Fast Data Science, AKA fds, is a CLI for Data Scientists to version control data and code at once, by conveniently wrapping git and dvc

At DagsHub, we’re integrated with DVC, which I love using. First and foremost, it’s open-source. It provides pipeline capabilities and supports many cloud providers for remote storage. Also, DVC acts as an extension to Git, which allows you to keep using the standard Git flow in your work. If you don’t want to use both tools, I recommend using FDS, an open-source tool that makes version control for machine learning fast & easy. It combines Git and DVC under one roof and takes care of code, data, and model versioning. (Bias alert: DagsHub developed FDS)

lakeFS

48 4,081 9.8 Go

lakeFS - Data version control for your data lake | Git for data

Git was designed for managing software development projects and for versioning text/code files. Therefore, Git doesn’t handle large files. Git released Git LFS (Large File System) to overcome large file versioning, which is better than Git, but fails when scaling. Also, both Git and Git LFS are not optimized for data science workflow. To overcome this challenge, many powerful tools emerged in recent years, such as DVC, Delta Lake, LakeFS, and more.

git-lfs

159 12,492 9.0 Go

Git extension for versioning large files

Git was designed for managing software development projects and for versioning text/code files. Therefore, Git doesn’t handle large files. Git released Git LFS (Large File System) to overcome large file versioning, which is better than Git, but fails when scaling. Also, both Git and Git LFS are not optimized for data science workflow. To overcome this challenge, many powerful tools emerged in recent years, such as DVC, Delta Lake, LakeFS, and more.

delta

69 6,919 9.8 Scala

An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs (by delta-io)

Git was designed for managing software development projects and for versioning text/code files. Therefore, Git doesn’t handle large files. Git released Git LFS (Large File System) to overcome large file versioning, which is better than Git, but fails when scaling. Also, both Git and Git LFS are not optimized for data science workflow. To overcome this challenge, many powerful tools emerged in recent years, such as DVC, Delta Lake, LakeFS, and more.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Alternative for git with big file

2 projects | /r/datascience | 13 Jul 2022
GitHub for code but where/how do you organize your datafiles?

2 projects | /r/github | 29 Mar 2022
Ask HN: Most efficient way to fine-tune an LLM in 2024?

6 projects | news.ycombinator.com | 4 Apr 2024
Git Version Controlled Datasets in S3

1 project | news.ycombinator.com | 25 Oct 2023
Frouros: A Python library for drift detection in ML systems

1 project | news.ycombinator.com | 8 Jul 2023

Data Science Workflows — Notebook to Production

This page summarizes the projects mentioned and recommended in the original post on dev.to
Git Machine Learning Go Data Science AI
Post date: 8 Feb 2022

dvc

MLflow

InfluxDB

fds

lakeFS

git-lfs

delta

Related posts

Alternative for git with big file

GitHub for code but where/how do you organize your datafiles?

Ask HN: Most efficient way to fine-tune an LLM in 2024?

Git Version Controlled Datasets in S3

Frouros: A Python library for drift detection in ML systems

Data Science Workflows — Notebook to Production

This page summarizes the projects mentioned and recommended in the original post on dev.to Git Machine Learning Go Data Science AI Post date: 8 Feb 2022

dvc

MLflow

InfluxDB

fds

lakeFS

git-lfs

delta

Related posts

Alternative for git with big file

GitHub for code but where/how do you organize your datafiles?

Ask HN: Most efficient way to fine-tune an LLM in 2024?

Git Version Controlled Datasets in S3

Frouros: A Python library for drift detection in ML systems

This page summarizes the projects mentioned and recommended in the original post on dev.to
Git Machine Learning Go Data Science AI
Post date: 8 Feb 2022