Airflow
pinball
DISCONTINUED
Our great sponsors
Airflow | pinball | |
---|---|---|
169 | 5 | |
33,953 | 1,812 | |
2.2% | - | |
10.0 | 9.7 | |
7 days ago | over 1 year ago | |
Python | Dart | |
Apache License 2.0 | MIT License |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
Airflow
-
Airflow VS quix-streams - a user suggested alternative
2 projects | 7 Dec 2023
-
Simplifying Data Transformation in Redshift: An Approach with DBT and Airflow
Airflow is the most widely used and well-known tool for orchestrating data workflows. It allows for efficient pipeline construction, scheduling, and monitoring.
-
Ask HN: What is the correct way to deal with pipelines?
I agree there are many options in this space. Two others to consider:
- https://github.com/spotify/luigi
There are also many Kubernetes based options out there. For the specific use case you specified, you might even consider a plain old Makefile and incrond if you expect these all to run on a single host and be triggered by a new file showing up in a directory…
- Cómo construir tu propia data platform. From zero to hero.
-
Is it impossible to contribute to open source as a data engineer?
You can try and contribute some new connectors/operators for workflow managers like Airflow or Airbyte
-
Exploring MLOps Tools and Frameworks: Enhancing Machine Learning Operations
Apache Airflow:
-
Python task scheduler with a web UI
Looks interesting as a light-weight alternative to https://www.prefect.io/ (which itself is a lighter-weight / more modern alternative to https://airflow.apache.org/ ).
-
Working with Managed Workflows for Apache Airflow (MWAA) and Amazon Redshift
You can actually setup and delete new Redshift clusters using Apache Airflow. We can see in the example_dags of a DAG that does a complete setup and delete of a Redshift cluster. There are a few things to think about however.
-
.NET Modern Task Scheduler
A few years ago, I opened a GitHub issue with Microsoft telling them that I think the .NET ecosystem needs its own equivalent of Apache Airflow or Prefect. Fast forward 'til now, and I still don't think we have anything close to these frameworks.
-
How do you decide when to keep a project in a single python file vs break it up into multiple files?
Check out taskinstance.py in the Airflow project, it's a well targeted file, it has only one main class TaskInstance and a few small supporting classes and functions. It is ~3000 lines long: https://github.com/apache/airflow/blob/main/airflow/models/taskinstance.py
pinball
We haven't tracked posts mentioning pinball yet.
Tracking mentions began in Dec 2020.
What are some alternatives?
Kedro - Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
dagster - An orchestration platform for the development, production, and observation of data assets.
n8n - Free and source-available fair-code licensed workflow automation tool. Easily automate tasks across different services.
luigi - Luigi is a Python module that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization etc. It also comes with Hadoop support built in.
Apache Spark - Apache Spark - A unified analytics engine for large-scale data processing
Dask - Parallel computing with task scheduling
Pandas - Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
Apache Camel - Apache Camel is an open source integration framework that empowers you to quickly and easily integrate various systems consuming or producing data.
airbyte - The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
Apache Arrow - Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
argo - Workflow Engine for Kubernetes
Cronicle - A simple, distributed task scheduler and runner with a web based UI.