SaaSHub helps you find the best software and product alternatives Learn more →
Top 23 Python data-engineering Projects
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
airbyte
The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
-
Mage
🧙 The modern replacement for Airflow. Mage is an open-source data pipeline tool for transforming and integrating data. https://github.com/mage-ai/mage-ai
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
-
AWS Data Wrangler
pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, Neptune, OpenSearch, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).
-
soda-core
:zap: Data quality testing for the modern data stack (SQL, Spark, and Pandas) https://www.soda.io
-
meltano
Meltano: the declarative code-first data integration engine that powers your wildest data and ML-powered product ideas. Say goodbye to writing, maintaining, and scaling your own API integrations.
-
mlrun
MLRun is an open source MLOps platform for quickly building and managing continuous ML applications across their lifecycle. MLRun integrates into your development and CI/CD environment and automates the delivery of production data, ML pipelines, and online applications.
-
Udacity-Data-Engineering-Projects
Few projects related to Data Engineering including Data Modeling, Infrastructure setup on cloud, Data Warehousing and Data Lake development.
-
NeumAI
Neum AI is a best-in-class framework to manage the creation and synchronization of vector embeddings at large scale.
-
vectorflow
VectorFlow is a high volume vector embedding pipeline that ingests raw data, transforms it into vectors and writes it to a vector DB of your choice. (by dgarnitz)
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
Project mention: Building in Public: Leveraging Tublian's AI Copilot for My Open Source Contributions | dev.to | 2024-02-12Contributing to Apache Airflow's open-source project immersed me in collaborative coding. Experienced maintainers rigorously reviewed my contributions, providing constructive feedback. This ongoing dialogue refined the codebase and honed my understanding of best practices.
Project mention: Prefect: A workflow orchestration tool for data pipelines | news.ycombinator.com | 2024-03-13
Project mention: Launch HN: Bracket (YC W22) – Two-Way Sync Between Salesforce and Postgres | news.ycombinator.com | 2023-12-12I'l also give a shout-out to Airbyte (https://airbyte.com/), with which I've had some limited success with integrating Salesforce to a local database. The particular pull for Airbyte is that we can self-host the open source version, rather than pay Fivetran a significant sum to do this for us.
It's an immature tool, so I don't yet know that I can claim we've spent _less_ than Fivetran on the additional engineering and ops time, but it feels like it has potential to do so once stabilized.
Project mention: Python Day 9: Building Interactive Web Apps without HTML/CSS and JavaScript | dev.to | 2024-04-26Taipy is an open-source Python library that enables data scientists and developers to build robust end-to-end data pipelines.
Project mention: Phidata: Add memory, knowledge and tools to LLMs | news.ycombinator.com | 2024-05-06
Project mention: Read files from s3 using Pandas/s3fs or AWS Data Wrangler? | /r/dataengineering | 2023-12-06I had no problem with awswrangler (https://github.com/aws/aws-sdk-pandas) and it supports reading and writing partitions which was really helpful and a few other optimizations that made it a great tool
Project mention: Show HN: JupySQL – a SQL client for Jupyter (ipython-SQL successor) | news.ycombinator.com | 2023-12-06- One-click sharing powered by Ploomber Cloud: https://ploomber.io
Documentation: https://jupysql.ploomber.io
Note that JupySQL is a fork of ipython-sql; which is no longer actively developed. Catherine, ipython-sql's creator, was kind enough to pass the project to us (check out ipython-sql's README).
We'd love to learn what you think and what features we can ship for JupySQL to be the best SQL client! Please let us know in the comments!
If the issue happen a lot, there is also: https://github.com/datafold/data-diff
That is a nice tool to do it cross database as well.
I think it's based on checksum method.
Project mention: Ask HN: Freelancer? Seeking freelancer? (December 2023) | news.ycombinator.com | 2023-12-03SEEKING FREELANCER | REMOTE | GERMANY
dltHub is looking for a freelance help in the following repos:
- https://github.com/dlt-hub/dlt
Project mention: meltano VS cloudquery - a user suggested alternative | libhunt.com/r/meltano | 2023-06-02
Project mention: Building a streaming SQL engine with Arrow and DataFusion | news.ycombinator.com | 2024-03-18
Project mention: Show HN: Neum AI – Open-source large-scale RAG framework | news.ycombinator.com | 2023-11-21Interesting to see that the semantic chunking in the tools library is a wrapper around GPT-4. Asks GPT for the python code and executes it: https://github.com/NeumTry/NeumAI/blob/main/neumai-tools/neu...
Python data-engineering related posts
-
Phidata: Add memory, knowledge and tools to LLMs
-
Python Day 9: Building Interactive Web Apps without HTML/CSS and JavaScript
-
Building a streaming SQL engine with Arrow and DataFusion
-
Prefect: A workflow orchestration tool for data pipelines
-
+10 Resources to Empower Women in Technology
-
Show HN: Use function calling to build AI Assistants
-
Phidata: Build AI Assistants using function calling
-
A note from our sponsor - SaaSHub
www.saashub.com | 8 May 2024
Index
What are some of the best open-source data-engineering projects in Python? This list will help you:
Project | Stars | |
---|---|---|
1 | Airflow | 34,570 |
2 | Prefect | 14,724 |
3 | airbyte | 14,112 |
4 | dagster | 10,274 |
5 | great_expectations | 9,479 |
6 | Taipy | 8,731 |
7 | Mage | 7,085 |
8 | phidata | 5,340 |
9 | feast | 5,263 |
10 | AWS Data Wrangler | 3,804 |
11 | ploomber | 3,380 |
12 | data-diff | 2,847 |
13 | soda-core | 1,768 |
14 | dlt | 1,736 |
15 | meltano | 1,601 |
16 | pyspark-example-project | 1,370 |
17 | mlrun | 1,301 |
18 | Udacity-Data-Engineering-Projects | 1,295 |
19 | pyjanitor | 1,287 |
20 | bytewax | 1,160 |
21 | DataEngineeringProject | 985 |
22 | NeumAI | 783 |
23 | vectorflow | 637 |
Sponsored