data-pipeline

Top 23 data-pipeline Open-Source Projects

  • airbyte

    The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.

  • Project mention: Launch HN: Bracket (YC W22) – Two-Way Sync Between Salesforce and Postgres | news.ycombinator.com | 2023-12-12

    I'l also give a shout-out to Airbyte (https://airbyte.com/), with which I've had some limited success with integrating Salesforce to a local database. The particular pull for Airbyte is that we can self-host the open source version, rather than pay Fivetran a significant sum to do this for us.

    It's an immature tool, so I don't yet know that I can claim we've spent _less_ than Fivetran on the additional engineering and ops time, but it feels like it has potential to do so once stabilized.

  • Snowplow

    The enterprise-grade behavioral data engine (web, mobile, server-side, webhooks), running cloud-natively on AWS and GCP

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • kestra

    Infinitely scalable, event-driven, language-agnostic orchestration and scheduling platform to manage millions of workflows declaratively in code.

  • Project mention: A High-Performance, Java-Based Orchestration Platform | /r/java | 2023-10-11

    Kestra's communication is asynchronous and based on a queuing mechanism. It leverages the Micronaut framework and offers two runners: one that uses a database (JDBC) for both the message queue and resource storage, and another that uses Kafka as the message queue and Elasticsearch as the resource storage. The platform is fully extensible and plugin-based, providing a rich set of plugins for various workflow tasks, triggers, and data storage options. For those interested, the GitHub repository is available here: https://github.com/kestra-io/kestra

  • Rudderstack

    Privacy and Security focused Segment-alternative, in Golang and React

  • Project mention: Rudderstack Switches to Elastic License | news.ycombinator.com | 2023-09-08
  • memphis

    Memphis.dev is a highly scalable and effortless data streaming platform

  • Project mention: Memphis | /r/devopspro | 2023-05-11
  • whylogs

    An open-source data logging library for machine learning models and data pipelines. 📚 Provides visibility into data quality & model performance over time. 🛡️ Supports privacy-preserving data collection, ensuring safety & robustness. 📈

  • ingestr

    ingestr is a CLI tool to copy data between any databases with a single command seamlessly.

  • Project mention: FLaNK 04 March 2024 | dev.to | 2024-03-04
  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
  • doit

    task management & automation tool

  • Project mention: How do you deal with CI, project config, etc. falling out of sync across repos? | /r/ExperiencedDevs | 2023-12-06

    I like mage for Go and doit for Python.

  • go-streams

    A lightweight stream processing library for Go

  • elementary

    The dbt-native data observability solution for data & analytics engineers. Monitor your data pipelines in minutes. Available as self-hosted or cloud service with premium features.

  • bitsail

    BitSail is a distributed high-performance data integration engine which supports batch, streaming and incremental scenarios. BitSail is widely used to synchronize hundreds of trillions of data every day.

  • DataEngineeringProject

    Example end to end data engineering project.

  • covalent

    Pythonic tool for orchestrating machine-learning/high performance/quantum-computing workflows in heterogeneous compute environments. (by AgnostiqHQ)

  • Project mention: Remote execution of code | /r/Python | 2023-12-05

    Pretty interesting request, if SSH is not used, i would try using something like dask which uses tcp to connect and execute assuming your workers are in another machine.I also think something like covalent can be used to extend your own custom plugin in their ecosystem to connect how you want. We have a very custom private plugin written on top of covalent's to have a custom protocol to connect our central on-prem GPU machines to our local laptops that is rpc based, mostly for high performance as well as some mandate security from where the GPU machines are. Once done it is pretty much something like

  • multiwoven

    🔥 Open Source Reverse ETL and Customer Data Platform (CDP). An open-source alternative to Hightouch, Census, and RudderStack.

  • Project mention: Multiwoven Reverse ETL (0.2.0) – Open-Source Alternative to Hightouch and Census | news.ycombinator.com | 2024-04-19

    Multiwoven is now a leading Open Source Alternative to Hightouch, Census, and Rudderstack.

    It's been a great journey so far, and we are excited to announce a major update to Multiwoven - our new release, Multiwoven 0.2.0, is now available!

    Repo: https://github.com/Multiwoven/multiwoven

    This release brings a host of new features, enhancements, and bug fixes to streamline data syncs and user experience.

    From new connectors to advanced reporting dashboards, as a team, we have been working hard on these updates based on the feedback and requests from our customers and the community.

    - 10+ new connectors added to Multiwoven, including

  • awesome-kafka

    A list about Apache Kafka

  • piperider

    Code review for data in dbt

  • Project mention: Show HN: PipeRider – open-source Data Impact Analysis for dbt changes | news.ycombinator.com | 2023-09-06
  • practical-data-engineering

    Practical Data Engineering: A Hands-On Real-Estate Project Guide

  • Project mention: Show HN: Hands-On Data Engineering with a Real-Estate Project Guide | news.ycombinator.com | 2024-03-20
  • conduit

    Conduit streams data between data stores. Kafka Connect replacement. No JVM required. (by ConduitIO)

  • Project mention: Pulling CDC data from Postgres | /r/dataengineering | 2023-04-30

    I'd like to mention Conduit + its Postgres connector. The Pg connector comes built-in, so all that is needed is a single Conduit binary to get started. It relies on WAL, but the connector creates the replication slot itself (if needed).

  • cuelake

    Use SQL to build ELT pipelines on a data lakehouse.

  • scicloj.ml

    A Clojure machine learning library

  • pipebird

    Pipebird is open source infrastructure for securely sharing data with customers.

  • premier-league

    A Data Engineering project. Repository for backend infrastructure and Streamlit app files for a Premier League Dashboard.

  • Project mention: Google Cloud Portfolio Projects? | /r/googlecloud | 2023-12-09

    I have a data engineering project that uses BigQuery, Cloud Run, Compute Engine, Cloud SQL, Artifact Registry, Firestore, and Datastream.

  • spark

    Performance Observability for Apache Spark (by dataflint)

  • Project mention: Show HN: DataFlint, performance monitoring for Apache Spark | news.ycombinator.com | 2023-12-28
  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

data-pipeline related posts

Index

What are some of the best open-source data-pipeline projects? This list will help you:

Project Stars
1 airbyte 14,054
2 Snowplow 6,734
3 kestra 6,340
4 Rudderstack 3,926
5 memphis 3,149
6 whylogs 2,548
7 ingestr 2,308
8 doit 1,781
9 go-streams 1,753
10 elementary 1,739
11 bitsail 1,576
12 DataEngineeringProject 985
13 covalent 689
14 multiwoven 617
15 awesome-kafka 565
16 piperider 467
17 practical-data-engineering 449
18 conduit 345
19 cuelake 284
20 scicloj.ml 199
21 pipebird 167
22 premier-league 142
23 spark 123

Sponsored
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com