Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality. Learn more →
Top 23 data-engineering Open-Source Projects
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
applied-ml
📚 Papers & tech blogs by companies sharing their work on data science & machine learning in production.
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
-
airbyte
The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
-
Mage
🧙 The modern replacement for Airflow. Mage is an open-source data pipeline tool for transforming and integrating data. https://github.com/mage-ai/mage-ai
-
kestra
Infinitely scalable, event-driven, language-agnostic orchestration and scheduling platform to manage millions of workflows declaratively in code.
-
sql-translator
SQL Translator is a tool for converting natural language queries into SQL code using artificial intelligence. This project is 100% free and open source.
-
AWS Data Wrangler
pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, Neptune, OpenSearch, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
Superset is absolutely phenomenal. I really hope Microsoft eventually releases all of their customizations they made to it internally to the OS community someday.
Project mention: [D] How do you keep up to date on Machine Learning? | /r/learnmachinelearning | 2023-08-13Made With ML
Project mention: Building in Public: Leveraging Tublian's AI Copilot for My Open Source Contributions | dev.to | 2024-02-12Contributing to Apache Airflow's open-source project immersed me in collaborative coding. Experienced maintainers rigorously reviewed my contributions, providing constructive feedback. This ongoing dialogue refined the codebase and honed my understanding of best practices.
References: Data engineering zoomcamp week 6 course and homework notes: https://github.com/DataTalksClub/data-engineering-zoomcamp/tree/main/cohorts/2024/06-streaming
Project mention: Prefect: A workflow orchestration tool for data pipelines | news.ycombinator.com | 2024-03-13
Like Argo Workflows?
Project mention: Launch HN: Bracket (YC W22) – Two-Way Sync Between Salesforce and Postgres | news.ycombinator.com | 2023-12-12I'l also give a shout-out to Airbyte (https://airbyte.com/), with which I've had some limited success with integrating Salesforce to a local database. The particular pull for Airbyte is that we can self-host the open source version, rather than pay Fivetran a significant sum to do this for us.
It's an immature tool, so I don't yet know that I can claim we've spent _less_ than Fivetran on the additional engineering and ops time, but it feels like it has potential to do so once stabilized.
https://github.com/andkret/Cookbook arunca un ochi aici. Omul are si youtube channel https://www.youtube.com/@andreaskayy
Project mention: Data Quality at Scale with Great Expectations, Spark, and Airflow on EMR | dev.to | 2023-04-24Great Expectations (GE) is an open-source data validation tool that helps ensure data quality.
I’ve been working in tech for more than five years. I started as a Data Scientist, and now I’m exploring and loving the DevRel 🥑 role for Taipy. Needless to say, evolving in the tech scene has been a ride full of ups, downs, and everything in between.
Project mention: A mage on the Hero’s Journey: a fantasy epic on how a startup rose from the ashes | dev.to | 2023-06-12In the coming years, Mage will create a cooperative experience so that developers can build data pipelines with their team and level up together. After that journey, Mage will go on an epic quest to create the 1st open world community experience in the data universe.
Kestra's communication is asynchronous and based on a queuing mechanism. It leverages the Micronaut framework and offers two runners: one that uses a database (JDBC) for both the message queue and resource storage, and another that uses Kafka as the message queue and Elasticsearch as the resource storage. The platform is fully extensible and plugin-based, providing a rich set of plugins for various workflow tasks, triggers, and data storage options. For those interested, the GitHub repository is available here: https://github.com/kestra-io/kestra
Project mention: We might want to regularly keep track of how important each server is | news.ycombinator.com | 2024-02-06Check out CloudQuery - https://github.com/cloudquery/cloudquery for an easy cloud asset inventory.
Project mention: GrowthBook: Open-source feature flagging and A/B testing platform | /r/opensource | 2023-10-20
# Download the LakeFS binary wget https://github.com/treeverse/lakeFS/releases/latest/download/lakefs # Make the binary executable chmod +x lakefs # Initialize LakeFS with S3 as the storage backend ./lakefs init --backend s3 --s3-gateway-endpoint --s3-region --s3-force-path-style --s3-access-key --s3-secret-key
I started to see more and more applications that use the OpenAI API and I wanted to try it out. One of these apps is this one made by Kate.
Project mention: Read files from s3 using Pandas/s3fs or AWS Data Wrangler? | /r/dataengineering | 2023-12-06I had no problem with awswrangler (https://github.com/aws/aws-sdk-pandas) and it supports reading and writing partitions which was really helpful and a few other optimizations that made it a great tool
data-engineering related posts
- Show HN: Privacy-first analytics in natural language in the browser
- Ask HN: Best way to mirror a Postgres database to parquet?
- Data Engineering Zoomcamp Week 6 - using redpanda 1
- Final project part 5
- Show HN: Hands-On Data Engineering with a Real-Estate Project Guide
- Building a streaming SQL engine with Arrow and DataFusion
- Temporal.io: It Just Works
-
A note from our sponsor - InfluxDB
www.influxdata.com | 19 Apr 2024
Index
What are some of the best open-source data-engineering projects? This list will help you:
Project | Stars | |
---|---|---|
1 | superset | 58,576 |
2 | Made-With-ML | 35,610 |
3 | Airflow | 34,397 |
4 | applied-ml | 25,853 |
5 | data-engineering-zoomcamp | 22,343 |
6 | Prefect | 14,512 |
7 | argo | 14,259 |
8 | airbyte | 13,821 |
9 | Cookbook | 12,899 |
10 | data-engineer-roadmap | 11,789 |
11 | dagster | 10,114 |
12 | great_expectations | 9,418 |
13 | Taipy | 8,257 |
14 | Benthos | 7,516 |
15 | Mage | 6,953 |
16 | kestra | 6,260 |
17 | cloudquery | 5,565 |
18 | growthbook | 5,495 |
19 | feast | 5,246 |
20 | lakeFS | 4,053 |
21 | sql-translator | 3,953 |
22 | AWS Data Wrangler | 3,797 |
23 | paradedb | 3,756 |