Airflow
dagster
Our great sponsors
Airflow | dagster | |
---|---|---|
143 | 39 | |
29,004 | 6,364 | |
1.5% | 3.9% | |
10.0 | 10.0 | |
about 22 hours ago | 7 days ago | |
Python | Python | |
Apache License 2.0 | Apache License 2.0 |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
Airflow
-
Building a Data Lakehouse for Analyzing Elon Musk Tweets using MinIO, Apache Airflow, Apache Drill and Apache Superset
đź’ˇ You can read more here.
-
How do you manage scheduled tasks?
Its a bit overkill but i use Airflow with local executor.
-
Twitter Data Pipeline with Apache Airflow + MinIO (S3 compatible Object Storage)
To learn more about it, I built a Data Pipeline that uses Apache Airflow to pull Elon Musk tweets using the Twitter API and store the result in a CSV stored in a MinIO (OSS alternative to AWS s3) Object Storage bucket.
-
Data Analytics at Potloc I: Making data integrity your priority with Elementary & Meltano
Airflow
- self hosted Alternative to easycron.com?
-
Azure OAuth CSRF State Not Equal Error
I am currently having a problem with trying to enable Azure OAuth to authenticate into our airflow instance. I have posted in countless other places trying to get answers so this is my next place I am trying. Here is the link to the discussion I posted within the airflow repo: https://github.com/apache/airflow/discussions/28098 but I will also do the liberty of posting it here as well. If anybody has any knowledge or can help I would greatly appreciate it as I have been dealing with this for over a month with no answers.
-
ETL tool
Airflow is really popular, started at Airbnb. Pros: huge community, super mature. Cons: generic workflow orchestration, not the best for handling only data, hard to scale and maintain.
-
How to do distributed cronjobs with worker queues?
Airflow might also be a good option for you. Essentially DAGs of cronjobs. We like it a lot.
-
Airflow :: Deploy Apache Airflow on Rancher K3s
$ helm upgrade --install airflow apache-airflow/airflow --namespace airflow --create-namespace Release "airflow" does not exist. Installing it now. NAME: airflow LAST DEPLOYED: Sun Nov 6 02:06:55 2022 NAMESPACE: airflow STATUS: deployed REVISION: 1 TEST SUITE: None NOTES: Thank you for installing Apache Airflow 2.4.1! Your release is named airflow. You can now access your dashboard(s) by executing the following command(s) and visiting the corresponding port at localhost in your browser: Airflow Webserver: kubectl port-forward svc/airflow-webserver 8080:8080 --namespace airflow Default Webserver (Airflow UI) Login credentials: username: admin password: admin Default Postgres connection credentials: username: postgres password: postgres port: 5432 You can get Fernet Key value by running the following: echo Fernet Key: $(kubectl get secret --namespace airflow airflow-fernet-key -o jsonpath="{.data.fernet-key}" | base64 --decode) ########################################################### # WARNING: You should set a static webserver secret key # ########################################################### You are using a dynamically generated webserver secret key, which can lead to unnecessary restarts of your Airflow components. Information on how to set a static webserver secret key can be found here: https://airflow.apache.org/docs/helm-chart/stable/production-guide.html#webserver-secret-key
-
Duct Size vs. Airflow (2012)
I gotta admit, my first thought was "Duct Size" is a weird name for a distributed work-flow tool[1].
dagster
-
dbt Cloud Alternatives?
Dagster? https://dagster.io
-
What's the best thing/library you learned this year ?
One that I haven't seen on here yet: dagster
- Can we take a moment to appreciate how much of dataengineering is open source?
-
Dagger Python SDK: Develop Your CI/CD Pipelines as Code
I wondered how it related to https://dagster.io/
-
Data Engineer Github Profile?
You can find all current, closed, and resolved issues on the “Issues” section and explore them using filters: eg issues for dagster. Look into some of the issues and feel free to ask a question or post your idea: it’s much less toxic here (compared to SO, for example).
-
[D] Should I go with Prefect, Argo or Flyte for Model Training and ML workflow orchestration?
You could also consider Dagster, which aims to improve Apache Airflow's shortcomings. Also, take a look at MyMLOps, where you can get a quick overview of open-source orchestration tools.
-
What aspects of Python should I learn that are most important for Data Engineering?
Python is one of the most accessible programming l code within Python. My favorite is dagster, which forces you to write functional blocks of code with superior features—coming from a more SQL, T-SQL, and PL-SQL background. As a data engineer, I'd say you'd not expect to write perfect code; it's better to know the Big-O annotation to avoid long-running data pipelines, even if your code doesn't look the prettiest. Static types such as mypy might be another good one to know, as it will detect errors pre-runtime, which is the biggest problem of Python.
-
Show HN: Airflow is cool but have you tried this for data pipelines?
This is cool, but looks like https://github.com/dagster-io/dagster
The issue with less popular data pipeline projects is that they’re less stable in production
-
Tips for using Jupyter Notebooks with GitHub
Papermill can also target cloud storage outputs for hosting rendered notebooks, execute notebooks from custom Python code, and even be used within distributed data pipelines like Dagster (see Dagstermill). For more information, see the papermill documentation.
-
Field Lineage
There are specialized tools like DataHub (see this for columnar level reporting: https://feature-requests.datahubproject.io/roadmap/541 ) that would help. But really, in a good data platform, the orchestration layer should be aggregating metadata and giving you everything you need to trace lineage, A tool like Dagster does this well if you make full use of the Software Defined Assets capability, but that is fairly new so not so many people have embraced it yet.
What are some alternatives?
Prefect - The easiest way to build, run, and monitor data pipelines at scale.
Kedro - A Python framework for creating reproducible, maintainable and modular data science code.
luigi - Luigi is a Python module that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization etc. It also comes with Hadoop support built in.
n8n - Free and source-available fair-code licensed workflow automation tool. Easily automate tasks across different services.
Dask - Parallel computing with task scheduling
Apache Spark - Apache Spark - A unified analytics engine for large-scale data processing
airbyte - Data integration platform for ELT pipelines from APIs, databases & files to warehouses & lakes.
Pandas - Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
Apache Camel - Apache Camel is an open source integration framework that empowers you to quickly and easily integrate various systems consuming or producing data.
argo - Workflow engine for Kubernetes
Apache Arrow - Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing