Our great sponsors
-
Mage
🧙 The modern replacement for Airflow. Mage is an open-source data pipeline tool for transforming and integrating data. https://github.com/mage-ai/mage-ai
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
-
astro
Discontinued Astro SDK allows rapid and clean development of {Extract, Load, Transform} workflows using Python and SQL, powered by Apache Airflow. [Moved to: https://github.com/astronomer/astro-sdk] (by astro-projects)
-
astro-sdk
Astro SDK allows rapid and clean development of {Extract, Load, Transform} workflows using Python and SQL, powered by Apache Airflow.
Great point! Totally agree. A tool that can build complex high-code data pipelines to load data from multiple sources, do a bunch of transformations in parallel, then export that data to another table or multiple locations. BUT, also have that tool be able to build simple data integration pipelines; e.g. fetch data from Salesforce and replicate it in Snowflake. This is what Mage can do: batch pipelines and data integration pipelines.
We're still early stages, but since you've worked with lambda it would be really valuable to get your thoughts if you get a chance to check out the readme https://github.com/typhoon-data-org/typhoon-orchestrator.
What I would suggest is if you want an "Airflow 3.0" feel you check out the Astro SDK. My team and I basically spent a year and a half rewriting the Airflow DAG writing experience from the ground up. Completely different feel, highly scalable SQL/python/spark (soon) workflows that basically feel like native python. Way easier to test as well. You can pass dataframes into SQL queries, load data from any supported source to any supported warehouse, and things like lineage are natively supported :)
Mage uses the Singer Spec (https://github.com/singer-io/getting-started/blob/master/docs/SPEC.md), the data engineer community standard for building data integrations. This was created by Stitch and is widely adopted.
More of a general principle but when you don't have design patterns, you get varying levels of results right? I think what Astro is doing to introduce "strong defaults" through projects like the astro-sdk or the cloud ide are interesting experiments to remove some of the busy work of common dags (load from s3, do something, push to database) will HELP reduce the cognitive load of really common, simple actions and give them a better single pattern to optimize on. I don't think those efforts reduce the optionality of true power users at all who want to custom code their s3 log sink to have some unique implementation while at the same time maybe solving some of the fragmentation to very frequently performed operations. 🤞
Rewrite Airflow on top of temporal.io. This way, you get unlimited scalability and very high reliability out of the box and would be able to innovate on the features that matter for DE.
Related posts
- The Design Philosophy of Great Tables (Software Package)
- Welcome to 14 days of Data Science!
- [D] Major bug in Scikit-Learn's implementation of F-1 score
- Read files from s3 using Pandas/s3fs or AWS Data Wrangler?
- Why do companies still build data ingestion tooling instead of using a third-party tool like Airbyte?