data-engineering

Open-source projects categorized as data-engineering

Top 23 data-engineering Open-Source Projects

  • superset

    Apache Superset is a Data Visualization and Data Exploration Platform

    Project mention: Apache Superset | news.ycombinator.com | 2024-02-26

    Superset is absolutely phenomenal. I really hope Microsoft eventually releases all of their customizations they made to it internally to the OS community someday.

    https://www.youtube.com/watch?v=RY0SSvSUkMA

    https://github.com/apache/superset/discussions/20094

  • Made-With-ML

    Learn how to design, develop, deploy and iterate on production-grade ML applications.

    Project mention: [D] How do you keep up to date on Machine Learning? | /r/learnmachinelearning | 2023-08-13

    Made With ML

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

  • Airflow

    Apache Airflow - A platform to programmatically author, schedule, and monitor workflows

    Project mention: Building in Public: Leveraging Tublian's AI Copilot for My Open Source Contributions | dev.to | 2024-02-12

    Contributing to Apache Airflow's open-source project immersed me in collaborative coding. Experienced maintainers rigorously reviewed my contributions, providing constructive feedback. This ongoing dialogue refined the codebase and honed my understanding of best practices.

  • applied-ml

    📚 Papers & tech blogs by companies sharing their work on data science & machine learning in production.

  • data-engineering-zoomcamp

    Free Data Engineering course!

    Project mention: Data Engineering Zoomcamp Week 6 - using redpanda 1 | dev.to | 2024-04-09

    References: Data engineering zoomcamp week 6 course and homework notes: https://github.com/DataTalksClub/data-engineering-zoomcamp/tree/main/cohorts/2024/06-streaming

  • Prefect

    The easiest way to build, run, and monitor data pipelines at scale.

    Project mention: Prefect: A workflow orchestration tool for data pipelines | news.ycombinator.com | 2024-03-13
  • argo

    Workflow Engine for Kubernetes

    Project mention: StackStorm – IFTTT for Ops | news.ycombinator.com | 2023-11-05

    Like Argo Workflows?

    https://github.com/argoproj/argo-workflows

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

  • airbyte

    The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.

    Project mention: Launch HN: Bracket (YC W22) – Two-Way Sync Between Salesforce and Postgres | news.ycombinator.com | 2023-12-12

    I'l also give a shout-out to Airbyte (https://airbyte.com/), with which I've had some limited success with integrating Salesforce to a local database. The particular pull for Airbyte is that we can self-host the open source version, rather than pay Fivetran a significant sum to do this for us.

    It's an immature tool, so I don't yet know that I can claim we've spent _less_ than Fivetran on the additional engineering and ops time, but it feels like it has potential to do so once stabilized.

  • Cookbook

    The Data Engineering Cookbook

    Project mention: Tranzitie catre data engineering | /r/programare | 2023-07-12

    https://github.com/andkret/Cookbook arunca un ochi aici. Omul are si youtube channel https://www.youtube.com/@andreaskayy

  • data-engineer-roadmap

    Roadmap to becoming a data engineer in 2021

    Project mention: Pitanje za data engineering? | /r/programiranje | 2023-06-30
  • dagster

    An orchestration platform for the development, production, and observation of data assets.

    Project mention: Experience with Dagster.io? | news.ycombinator.com | 2023-07-25
  • great_expectations

    Always know what to expect from your data.

    Project mention: Data Quality at Scale with Great Expectations, Spark, and Airflow on EMR | dev.to | 2023-04-24

    Great Expectations (GE) is an open-source data validation tool that helps ensure data quality.

  • Taipy

    Turns Data and AI algorithms into production-ready web applications in no time.

    Project mention: +10 Resources to Empower Women in Technology | dev.to | 2024-03-06

    I’ve been working in tech for more than five years. I started as a Data Scientist, and now I’m exploring and loving the DevRel 🥑 role for Taipy. Needless to say, evolving in the tech scene has been a ride full of ups, downs, and everything in between.

  • Benthos

    Fancy stream processing made operationally mundane

    Project mention: Ask HN: Who is hiring? (December 2023) | news.ycombinator.com | 2023-12-01
  • Mage

    🧙 The modern replacement for Airflow. Mage is an open-source data pipeline tool for transforming and integrating data. https://github.com/mage-ai/mage-ai

    Project mention: A mage on the Hero’s Journey: a fantasy epic on how a startup rose from the ashes | dev.to | 2023-06-12

    In the coming years, Mage will create a cooperative experience so that developers can build data pipelines with their team and level up together. After that journey, Mage will go on an epic quest to create the 1st open world community experience in the data universe.

  • kestra

    Infinitely scalable, event-driven, language-agnostic orchestration and scheduling platform to manage millions of workflows declaratively in code.

    Project mention: A High-Performance, Java-Based Orchestration Platform | /r/java | 2023-10-11

    Kestra's communication is asynchronous and based on a queuing mechanism. It leverages the Micronaut framework and offers two runners: one that uses a database (JDBC) for both the message queue and resource storage, and another that uses Kafka as the message queue and Elasticsearch as the resource storage. The platform is fully extensible and plugin-based, providing a rich set of plugins for various workflow tasks, triggers, and data storage options. For those interested, the GitHub repository is available here: https://github.com/kestra-io/kestra

  • cloudquery

    The open source high performance ELT framework powered by Apache Arrow

    Project mention: We might want to regularly keep track of how important each server is | news.ycombinator.com | 2024-02-06

    Check out CloudQuery - https://github.com/cloudquery/cloudquery for an easy cloud asset inventory.

  • growthbook

    Open Source Feature Flagging and A/B Testing Platform

    Project mention: GrowthBook: Open-source feature flagging and A/B testing platform | /r/opensource | 2023-10-20
  • feast

    Feature Store for Machine Learning

    Project mention: What's Happening with Feast? | news.ycombinator.com | 2023-12-07
  • lakeFS

    lakeFS - Data version control for your data lake | Git for data

    Project mention: A Step-by-Step Guide to Implementing Data Version Control | dev.to | 2023-09-04

    # Download the LakeFS binary wget https://github.com/treeverse/lakeFS/releases/latest/download/lakefs # Make the binary executable chmod +x lakefs # Initialize LakeFS with S3 as the storage backend ./lakefs init --backend s3 --s3-gateway-endpoint --s3-region --s3-force-path-style --s3-access-key --s3-secret-key

  • sql-translator

    SQL Translator is a tool for converting natural language queries into SQL code using artificial intelligence. This project is 100% free and open source.

    Project mention: Storybook GPT | dev.to | 2023-05-08

    I started to see more and more applications that use the OpenAI API and I wanted to try it out. One of these apps is this one made by Kate.

  • AWS Data Wrangler

    pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, Neptune, OpenSearch, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).

    Project mention: Read files from s3 using Pandas/s3fs or AWS Data Wrangler? | /r/dataengineering | 2023-12-06

    I had no problem with awswrangler (https://github.com/aws/aws-sdk-pandas) and it supports reading and writing partitions which was really helpful and a few other optimizations that made it a great tool

  • paradedb

    Postgres for Search and Analytics

    Project mention: Using ClickHouse to scale an events engine | news.ycombinator.com | 2024-04-11
  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020). The latest post mention was on 2024-04-11.

data-engineering related posts

Index

What are some of the best open-source data-engineering projects? This list will help you:

Project Stars
1 superset 58,576
2 Made-With-ML 35,567
3 Airflow 34,397
4 applied-ml 25,853
5 data-engineering-zoomcamp 22,343
6 Prefect 14,512
7 argo 14,259
8 airbyte 13,821
9 Cookbook 12,899
10 data-engineer-roadmap 11,789
11 dagster 10,114
12 great_expectations 9,418
13 Taipy 8,257
14 Benthos 7,516
15 Mage 6,953
16 kestra 6,260
17 cloudquery 5,565
18 growthbook 5,495
19 feast 5,246
20 lakeFS 4,053
21 sql-translator 3,953
22 AWS Data Wrangler 3,791
23 paradedb 3,756
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com