Top 23 Pipeline Open-Source Projects

jina

126 19,884 9.2 Python

☁️ Build multimodal AI applications with cloud-native stack

Project mention: Jina.ai: Self-host Multimodal models | news.ycombinator.com | 2024-01-26
vector

95 16,366 9.9 Rust

A high-performance observability data pipeline.

Project mention: FLaNK AI Weekly 18 March 2024 | dev.to | 2024-03-18
InfluxDB

www.influxdata.com
sponsored

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
argo-cd

72 16,024 9.9 Go

Declarative Continuous Deployment for Kubernetes

Project mention: ArgoCD Deployment on RKE2 with Cilium Gateway API | dev.to | 2024-02-19

The code above will create the argocd Kubernetes namespace and deploy the latest stable manifest. If you would like to install a specific manifest, have a look here.
Prefect

19 14,512 9.9 Python

The easiest way to build, run, and monitor data pipelines at scale.

Project mention: Prefect: A workflow orchestration tool for data pipelines | news.ycombinator.com | 2024-03-13
airbyte

139 13,821 10.0 Python

The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.

Project mention: Launch HN: Bracket (YC W22) – Two-Way Sync Between Salesforce and Postgres | news.ycombinator.com | 2023-12-12

I'l also give a shout-out to Airbyte (https://airbyte.com/), with which I've had some limited success with integrating Salesforce to a local database. The particular pull for Airbyte is that we can self-host the open source version, rather than pay Fivetran a significant sum to do this for us.
It's an immature tool, so I don't yet know that I can claim we've spent _less_ than Fivetran on the additional engineering and ops time, but it feels like it has potential to do so once stabilized.
great_expectations

15 9,418 9.9 Python

Always know what to expect from your data.

Project mention: Data Quality at Scale with Great Expectations, Spark, and Airflow on EMR | dev.to | 2023-04-24

Great Expectations (GE) is an open-source data validation tool that helps ensure data quality.
prql

106 9,414 9.9 Rust

PRQL is a modern language for transforming data — a simple, powerful, pipelined SQL replacement

Project mention: Prolog language for PostgreSQL proof of concept | news.ycombinator.com | 2024-03-30
WorkOS

workos.com
sponsored

The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
Kedro

29 9,341 9.7 Python

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.

Project mention: Nextflow: Data-Driven Computational Pipelines | news.ycombinator.com | 2023-08-10

Interesting, thanks for sharing. I'll definitely take a look, although at this point I am so comfortable with Snakemake, it is a bit hard to imagine what would convince me to move to another tool. But I like the idea of composable pipelines: I am building a tool (too early to share) that would allow to lay Snakemake pipelines on top of each other using semi-automatic data annotations similar to how it is done in kedro (https://github.com/kedro-org/kedro).
pipeline

51 8,270 9.7 Go

A cloud-native Pipeline resource.

Project mention: 14 DevOps and SRE Tools for 2024: Your Ultimate Guide to Stay Ahead | dev.to | 2023-12-04

Tekton
Taipy

15 8,257 9.9 Python

Turns Data and AI algorithms into production-ready web applications in no time.

Project mention: +10 Resources to Empower Women in Technology | dev.to | 2024-03-06

I’ve been working in tech for more than five years. I started as a Data Scientist, and now I’m exploring and loving the DevRel 🥑 role for Taipy. Needless to say, evolving in the tech scene has been a ride full of ups, downs, and everything in between.
proposal-pipeline-operator

102 7,359 2.7 HTML

A proposal for adding a useful pipe operator to JavaScript.

Project mention: Pipeline Operator great again! | dev.to | 2023-09-29

Current Status: You'd have to check the TC39 proposals repository or the official proposal text for the most recent status. As of my last update, it had not yet reached Stage 4 (final stage) of the TC39 process, which means it wasn't part of the ECMAScript specification yet.
Mage

76 6,953 9.9 Python

🧙 The modern replacement for Airflow. Mage is an open-source data pipeline tool for transforming and integrating data. https://github.com/mage-ai/mage-ai

Project mention: A mage on the Hero’s Journey: a fantasy epic on how a startup rose from the ashes | dev.to | 2023-06-12

In the coming years, Mage will create a cooperative experience so that developers can build data pipelines with their team and level up together. After that journey, Mage will go on an epic quest to create the 1st open world community experience in the data universe.
httpx

3 6,778 9.5 Go

httpx is a fast and multi-purpose HTTP toolkit that allows running multiple probes using the retryablehttp library. (by projectdiscovery)

Project mention: HTTP toolkit that allows running multiple probes | news.ycombinator.com | 2024-04-02
kestra

32 6,188 9.9 Java

Infinitely scalable, event-driven, language-agnostic orchestration and scheduling platform to manage millions of workflows declaratively in code.

Project mention: A High-Performance, Java-Based Orchestration Platform | /r/java | 2023-10-11

Kestra's communication is asynchronous and based on a queuing mechanism. It leverages the Micronaut framework and offers two runners: one that uses a database (JDBC) for both the message queue and resource storage, and another that uses Kafka as the message queue and Elasticsearch as the resource storage. The platform is fully extensible and plugin-based, providing a rich set of plugins for various workflow tasks, triggers, and data storage options. For those interested, the GitHub repository is available here: https://github.com/kestra-io/kestra
papermill

26 5,615 7.9 Python

📚 Parameterize, execute, and analyze notebooks

Project mention: Spreadsheet errors can have disastrous consequences – yet we keep making them | news.ycombinator.com | 2024-01-25

Pandas docs > Comparison with spreadsheets: https://pandas.pydata.org/docs/getting_started/comparison/co...
Pandas docs > I/O > Excel files: https://pandas.pydata.org/docs/user_guide/io.html#excel-file...
nteract/papermill: https://github.com/nteract/papermill :
> papermill is a tool for parameterizing, executing, and analyzing Jupyter Notebooks. [...]
> This opens up new opportunities for how notebooks can be used. For example:
> - Perhaps you have a financial report that you wish to run with different values on the first or last day of a month or at the beginning or end of the year, using parameters makes this task easier.
"The World Excel Championship is being broadcast on ESPN" (2022) https://news.ycombinator.com/item?id=32420925 :
> Computational notebook speedrun ideas:
gaia

1 5,157 0.0 Go

Build powerful pipelines in any programming language.
jx

11 4,508 8.7 Go

Jenkins X provides automated CI+CD for Kubernetes with Preview Environments on Pull Requests using Cloud Native pipelines from Tekton

Project mention: Nu stiu ce sa fac, orice sfat e bine venit | /r/programare | 2023-05-24
GameDevMind

9 4,340 7.7 Shell

最全面的游戏开发技术图谱。帮助游戏开发者们在已知问题上节省时间，省出更多的精力投入到更有创造性的工作中去。
paradedb

16 3,756 9.8 Rust

Postgres for Search and Analytics

Project mention: Using ClickHouse to scale an events engine | news.ycombinator.com | 2024-04-11
pipelines

2 3,430 9.8 Python

Machine Learning Pipelines for Kubeflow
towhee

26 2,951 8.6 Python

Towhee is a framework that is dedicated to making neural data processing pipelines simple and fast.

Project mention: FLaNK Stack Weekly for 14 Aug 2023 | dev.to | 2023-08-14
nextflow

9 2,538 9.7 Groovy

A DSL for data-driven computational pipelines

Project mention: Nextflow: Data-Driven Computational Pipelines | news.ycombinator.com | 2023-08-10

> It's been a while since you can rerun/resume Nextflow pipelines
Yes, you can resume, but you need your whole upstream DAG to be present. Snakemake can rerun a job when only the dependencies of that job are present, which allows to neatly manage the disk usage, or archive an intermediate state of a project and rerun things from there.
> and yes, you can have dry runs in Nextflow
You have stubs, which really isn't the same thing.
> I have no idea what you're referring to with the 'arbitrary limit of 1000 parallel jobs' though
I was referring to this issue: https://github.com/nextflow-io/nextflow/issues/1871. Except, the discussion doesn't give the issue a full justice. Nextflow spans each job in a separate thread, and when it tries to span 1000+ condor jobs it die with a cryptic error message. The option of -Dnxf.pool.type=sync and -Dnxf.pool.maxThreads=N prevents the ability to resume and attempts to rerun the pipeline.
> As for deleting temporary files, there are features that allow you to do a few things related to that, and other features being implemented.
There are some hacks for this - but nothing I would feel safe to integrate into a production tool. They are implementing something - you're right - and it's been the case for several years now, so we'll see.
Snakemake has all that out of the box.
Pipcook

6 2,498 2.1 TypeScript

Machine learning platform for Web developers
SaaSHub

www.saashub.com
sponsored

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020). The latest post mention was on 2024-04-11.

Pipeline related posts

HTTP toolkit that allows running multiple probes
1 project | news.ycombinator.com | 2 Apr 2024
Launch HN: Bracket (YC W22) – Two-Way Sync Between Salesforce and Postgres
1 project | news.ycombinator.com | 12 Dec 2023
14 DevOps and SRE Tools for 2024: Your Ultimate Guide to Stay Ahead
10 projects | dev.to | 4 Dec 2023
Simple task runner for automation pipelines
1 project | news.ycombinator.com | 3 Nov 2023
25 million Creative Commons image dataset released!
1 project | /r/StableDiffusion | 1 Oct 2023
Pipeline Operator great again!
2 projects | dev.to | 29 Sep 2023
Show HN: A JavaScript function that looks and behaves like a pipe operator
1 project | /r/patient_hackernews | 29 Sep 2023
A note from our sponsor - SaaSHub
www.saashub.com | 18 Apr 2024

SaaSHub helps you find the best software and product alternatives Learn more →

Index

What are some of the best open-source Pipeline projects? This list will help you:

	Project	Stars
1	jina	19,884
2	vector	16,366
3	argo-cd	16,024
4	Prefect	14,512
5	airbyte	13,821
6	great_expectations	9,418
7	prql	9,414
8	Kedro	9,341
9	pipeline	8,270
10	Taipy	8,257
11	proposal-pipeline-operator	7,359
12	Mage	6,953
13	httpx	6,778
14	kestra	6,188
15	papermill	5,615
16	gaia	5,157
17	jx	4,508
18	GameDevMind	4,340
19	paradedb	3,756
20	pipelines	3,430
21	towhee	2,951
22	nextflow	2,538
23	Pipcook	2,498