Launch HN: Ploomber (YC W22) – Quickly Deploy Data Pipelines from Jupyter/VSCode

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

InfluxDB - Power Real-Time Data Analytics at Scale
Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
www.influxdata.com
featured
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com
featured
  • ploomber

    The fastest ⚡️ way to build data pipelines. Develop iteratively, deploy anywhere. ☁️

  • Launch HN: Ploomber (YC W22) – Quickly Deploy Data Pipelines From Jupyter/VSCode

    https://github.com/ploomber/ploomber

    Hi HN, we’re Eduardo & Ido, the founders of Ploomber (https://ploomber.io). We’re building an open-source framework (https://github.com/ploomber/ploomber) that helps data scientists quickly deploy the code they develop in interactive environments (Jupyter/VScode/PyCharm) eliminating the need for time-consuming manual porting to production platforms.

    Jupyter and other interactive environments are the go-to tools for most data scientists. However, many production data pipeline platforms (e.g. Airflow, Kubernetes) drag them into non-interactive development paradigms. Hence, when moving to production, the data scientist’s code needs to move from the interactive environment to a more traditional software environment (e.g. declaring workflows as Python classes). This process creates friction since the code needs to cross this gap every time the data scientist deploys their work. Data scientists often pair with software engineers to work on the conversion, but this is time-consuming and costly. It’s also frustrating because it’s just busy work.

    We encountered this problem while working in the data space. Eduardo was a data scientist at Fidelity for a few years. He deployed ML models and always found it annoying and wasteful to port the code from his notebooks into a production framework like Airflow or Kubernetes. Ido worked as a consultant at AWS and constantly found that data science projects would allocate about 30% of their time to convert a notebook prototype into a production pipeline.

    Interactive environments have historically been used for prototyping and are considered unsuitable for production; this is reasonable because, in our experience, most of the code developed interactively exists in a single file with little to no structure (e.g., a gigantic notebook). However, we believe it’s possible to bring software engineering best practices and apply them to the interactive development world so data scientists can produce maintainable projects to streamline deployment.

    Ploomber allows data scientists to quickly develop their code in modular pipelines rather than a giant single file. When developed this way, their code is suitable for deployment to production platforms; we currently support exporting to Kubernetes, AWS Batch, Airflow, Kubeflow, and SLURM with no code changes. Our integration with Jupyter/VSCode/PyCharm allows them to iteratively build these modular pipelines without moving away from the interactive environment. In addition, modularizing the work enables them to create more maintainable and testable projects. Our goal is ease of use, with minimal disturbance to the data scientist’s existing workflows.

    Users can install Ploomber with pip, open Jupyter/VSCode/PyCharm, and start building in minutes. We’ve made a significant effort to create a simple tool so people can get started quickly and learn the advanced features when they need them. Ploomber is available at https://github.com/ploomber/ploomber under the Apache 2.0 license. In addition, we are working on a cloud version to help enterprises operationalize models. We’re still working on the pricing details, but if you’d like us to let you know when we open the private beta, you can sign up here: https://ploomber.io/cloud. However, the core of our offering is the open-source framework, and it will remain free.

    We’re thrilled to share Ploomber with you! If you’re a data scientist who has experienced these endless cycles of porting your code for deployment, an ML engineer who helps data scientists deploy their work, or you have any feedback, please share your thoughts! We love chatting about this domain since exchanging ideas always sheds light on aspects we haven’t considered before! You may also reach out to me at eduardo@ploomber.io.

  • orchest

    Build data pipelines, the easy way 🛠️

  • Congrats on the launch! It’s great to see validation of the usefulness of notebooks in data workflows even when moving beyond the proof of concept/exploration stage into production type workloads and deployments. Once deployed, iteration is often still necessary or desirable and that’s where having notebooks available for continued iteration is a big advantage.

    For those who’d like to compare and contrast different solutions that support the use of notebooks in the (batch) deployment context you can also check out Orchest (https://github.com/orchest/orchest). I’d say a meaningful point of difference between Ploomber and Orchest is that we are more container oriented as we’ve found that gives robust units to deploy in production with isolated and well defined dependencies.

    Disclaimer: I’m one of the Orchest creators.

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • projects

    Sample projects using Ploomber. (by ploomber)

  • I'm not a DVC user so I'll speak for what I've seen in the documentation and the couple examples I ran a while ago. DVC's core is data versioning and the pipeline features are an extension to it. The main difference is that DVC's pipeline feature is agnostic: you define the command, inputs and outputs, and DVC executes pipeline. On the other hand, Ploomber has a deeper integration between your code and the pipeline. For example, our SQL integration allows you to tell Ploomber how to connect to a database and then list a bunch of SQL files as stages in your pipeline (example: https://github.com/ploomber/projects/blob/master/templates/s...), this reduces the boilerplate a lot since you only have to write SQL, if you wanted to do the same thing with DVC, you'd have to manage the connections and create bash scripts to submit the queries.

    The other important difference is that AFAIK, DVC can only run your pipelines locally, and Ploomber can export this pipelines to run in other environments like (Kubernetes, Airflow, AWS, SLURM, Kubeflow), this allows you to run experiments locally but easily move to a distributed environment when you need to train models at a larger scale or want to deploy an ML pipeline.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts

  • MLFlow users, what would you want from an integration with GitLab?

    6 projects | /r/mlops | 22 Apr 2022
  • [D] What MLOps platform do you use, and how helpful are they?

    3 projects | /r/MachineLearning | 24 Mar 2022
  • How do I number my .py file names?

    2 projects | /r/learnpython | 7 Feb 2022
  • Show HN: JupySQL – a SQL client for Jupyter (ipython-SQL successor)

    2 projects | news.ycombinator.com | 6 Dec 2023
  • Decent low code options for orchestration and building data flows?

    1 project | /r/dataengineering | 23 Dec 2022