|5 days ago||3 days ago|
|Apache License 2.0||Apache License 2.0|
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
How to Serve Massive Computations Using Python Web Apps.
1 project | dev.to | 23 Nov 2021
In this demo, we use the request itself as the trigger and begin computation immediately. But it may vary according to the nature of your application. Often, you might have to use a separate pipeline as well. In such scenarios, you may need technologies such as Apache Airflow or Prefect.
Apache Airflow In EKS Cluster
1 project | dev.to | 10 Nov 2021
Airflow is one of the most popular tools for running workflows espeically data-pipeline.
Distributed computing in python??
2 projects | reddit.com/r/learnpython | 9 Nov 2021
AWS MWAA and AWS SES integration
1 project | dev.to | 2 Nov 2021
This problem was already reported in a few Airflow issues and PRs. The fix didn't make the cut for Airflow 2.2 and will be probably there in version 2.3, but because we are talking about MWAA (version 2.0.2), we don't really know when this will be fixed on AWS.
Noobie who is trying to use K8s needs confirmation to know if this is the way or he is overestimating Kubernetes.
3 projects | reddit.com/r/kubernetes | 20 Oct 2021
The Data Engineer Roadmap 🗺
12 projects | dev.to | 19 Oct 2021
Anything Comparable to power automate or flow for Linux?
2 projects | reddit.com/r/sysadmin | 17 Oct 2021
I never used Power Automate, but it looks like a workflow orchestrator. So checkout https://airflow.apache.org/
Airflow with different conda environments
1 project | reddit.com/r/dataengineering | 13 Oct 2021
If Airflow is the way to go then try DockerOperators (https://github.com/apache/airflow/blob/main/airflow/providers/docker/example_dags/example_docker.py). It's not the easiest set up but will do what you from what I get from your question.
Databricks jobs and Airflow on Kubernetes
1 project | reddit.com/r/dataengineering | 2 Oct 2021
I have not used databricks but it is something we are looking into integrating into our infrastructure in the future. Since Databricks is a service that does not run locally, I would use the databricks Operators/Hooks that come with airflow, rather than trying to build out anything of my own. https://github.com/apache/airflow/blob/main/airflow/providers/databricks/hooks/databricks.py
what do you think about airflow?
2 projects | reddit.com/r/dataengineering | 2 Oct 2021
I think one of the main design problems I have with Airflow is the fact that it tends to tightly couple processing/transform code with data movement code which makes debugging tricky. The way I have solved this is by building a command line interface to all the processing code so I can debug the processing code outside of any airflow infrastructure (which can be painful to get running locally if one does not use Airflow Breeze.
Airflow 2.0 vs Prefect
1 project | reddit.com/r/dataengineering | 20 Oct 2021
It has been such a pleasure to use dagster. The testability is nice. It was designed to be type aware, so you can leverage type checks and it is also designed to be data aware when it comes to passing data between tasks. One negative I dont like is its handling of instances where a task does not produce output, but need to still indicate dependency of another task, so you utilize its Nothing abstraction. The syntax for this situation is awkward IMO and they've recognized that. Its UI called dagit is hands down, the best as it provides rich information on each task in your DAG. The developer experience is definitely better with dagster compared to Airflow. I briefly looked at Airflow 2.0 examples, and I still think dagster's API is better ( with version 0.13.x ). However, on the managed environment side, there is no 3rd party managed dagster provider other than the creator of dagster called Elementl has their cloud offering which is currently in beta. So there is no mature managed services for dagster yet. Again, this is due to dagster being a relatively new library - less than 3 years old.
MLOps project based template
4 projects | reddit.com/r/mlops | 11 Oct 2021
Data Pipeline - Dagster
Runflow - define and run workflows using HCL2
1 project | reddit.com/r/datascience | 24 Jul 2021
I feel like dagster is a hidden gem that is useful for a more broader user base of Python data personas as it is cross platform and so all of its key features (scheduler and web UI) work on Windows OS, unlike other major workflow orchestration frameworks. Airflow? Nope. Prefect? Nope. They effectively ignored all the small fry data folks in the corporate Windows world who are still critical to their respective organization. So even a data analyst with Python coding experience can become immediately productive using dagster. Its web UI is optional and has APIs to execute with just Python API, CLI API, or web UI. So you are not forced to execute in one way. It is designed to be very general purpose and not domain specific which I think is a good thing as it means it can be used in a variety of use cases.
New to data orchestration? Start here.
2 projects | dev.to | 2 Jun 2021
Second-generation data orchestration tools like Dagster and Prefect are more focused on being data-driven. They’re able to detect the kinds of data within DAGs and improve data awareness by anticipating the actions triggered by each data type.
Is Airflow a passé? What replaces it?
2 projects | reddit.com/r/dataengineering | 10 May 2021
There's Prefect and Dagster as up and comers in the space.
Scheduling tools for ETL and ML flow
3 projects | reddit.com/r/dataengineering | 7 May 2021
I would give dagster a look. It has a built-in native scheduler and is cross-platform. It is general purpose, so your team can grow with it and tackle broader set of use cases if needed. If you struggle to get started after reading their docs/tutorials, you can take a look at my personal repo. Ive gotten a few feedback that my example has been very useful in getting started. I know they revamped their docs recently, but havent looked at their tutorial again or looked to see if they provided an intermediate level full example yet, so I need to get back in there to see.
API versioning has no “right way” (2017)
2 projects | news.ycombinator.com | 26 Apr 2021
Versioning is indeed a hard topic, especially for data science/engineering projects in production.
When you have a pipleine defined as complex DAG of operations, you can't just version the entire thing, unless you have enough resources to re-compute from scratch with every change, which is wasteful. So then, you have to keep track of data dependencies and their versions if you would like to ensure reproducibility.
Versioning code isn't enough when you have runtime parameters that affect output data, and you want to stay flexible by allowing experimenting and re-running computations with different parameters, to be able to iterate quickly. Which poses a lot of challenges.
And there doesn't seem to be a framework that solves those issues out of the box. I'm closely watching closely Dagster (https://dagster.io), as they seem to be aware of those challenges (for example for versioning: https://docs.dagster.io/guides/dagster/memoization), but I didn't try it yet; it introduces a lot of concepts and has a steep learning curve.
Best technologies for a beginner DE.
1 project | reddit.com/r/dataengineering | 21 Apr 2021
Hi, how can I do pipeline automation?
2 projects | reddit.com/r/learnpython | 18 Apr 2021
If you are just starting out or new to doing automation, I would look at just python scripts executed with CRON if on Linux/Mac or Windows Task Scheduler if on Windows. But you'll need bash (Linux/Mac) knowledge or DOS/batch knowledge (Windows). Then graduate to using frameworks. Since you didnt specify what types of jobs you want to automate, for general purpose needs, I would look at a class of frameworks called task orchestration frameworks or workflow management libraries. I would highly recommend dagster as it comes with a native scheduler so you would be free from having to use CRON or Windows Task Scheduler. Other options include prefect, but if you want its other features like its scheduler and web GUI, you'll have to mess with docker. That's what's nice about dagster, it all works out of the box without need for non-Python dependencies.
Open source contributions for a Data Engineer?
17 projects | reddit.com/r/dataengineering | 16 Apr 2021
It's a near crime that Dagster hasn't been mentioned already.
What are some alternatives?
Kedro - A Python framework for creating reproducible, maintainable and modular data science code.
Prefect - The easiest way to automate your data
luigi - Luigi is a Python module that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization etc. It also comes with Hadoop support built in.
Dask - Parallel computing with task scheduling
Pandas - Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
Apache Camel - Apache Camel is an open source integration framework that empowers you to quickly and easily integrate various systems consuming or producing data.
airbyte - Airbyte is an open-source EL(T) platform that helps you replicate your data in your warehouses, lakes and databases.
Numba - NumPy aware dynamic Python compiler using LLVM
Poetry - Python dependency management and packaging made easy.
n8n - Free and open fair-code licensed node based Workflow Automation Tool. Easily automate tasks across different services.