|9 months ago||5 days ago|
|MIT License||Apache License 2.0|
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
Well designed scala/spark project
4 projects | reddit.com/r/scala | 15 Oct 2022
Unit & integration testing in Databricks
3 projects | reddit.com/r/dataengineering | 30 Apr 2022
If the majority of your stuff is not UDF-based there is an OS solution to run assertion tests against full data frames called spark-fast-tests. The idea here is similar in that you have a it notebook that calls your actual notebook against a staged input reads the output and compares it to a prefabed expected output. This does take a bit of setup and trial and error but it’s the closest I’ve been able to get to proper automated regression testing in databricks
Show dataengineering: beavis, a library for unit testing Pandas/Dask code
3 projects | reddit.com/r/dataengineering | 9 Aug 2021
I am the author of spark-fast-tests and chispa, libraries for unit testing Scala Spark / PySpark code.
Ask HN: What are some tools / libraries you built yourself?
264 projects | news.ycombinator.com | 16 May 2021
I built daria (https://github.com/MrPowers/spark-daria) to make it easier to write Spark and spark-fast-tests (https://github.com/MrPowers/spark-fast-tests) to provide a good testing workflow.
Built bebe (https://github.com/MrPowers/bebe) to expose the Spark Catalyst expressions that aren't exposed to the Scala / Python APIs.
Also build spark-sbt.g8 to create a Spark project with a single command: https://github.com/MrPowers/spark-sbt.g8
Open source contributions for a Data Engineer?
17 projects | reddit.com/r/dataengineering | 16 Apr 2021
I've built popular PySpark (quinn, chispa) and Scala Spark (spark-daria, spark-fast-tests) libraries.
dbt Cloud Alternatives?
2 projects | reddit.com/r/dataengineering | 23 Jan 2023
What's the best thing/library you learned this year ?
12 projects | reddit.com/r/Python | 16 Dec 2022
One that I haven't seen on here yet: dagster
Can we take a moment to appreciate how much of dataengineering is open source?
8 projects | reddit.com/r/dataengineering | 23 Nov 2022
Dagger Python SDK: Develop Your CI/CD Pipelines as Code
6 projects | news.ycombinator.com | 10 Nov 2022
I wondered how it related to https://dagster.io/
Data Engineer Github Profile?
3 projects | reddit.com/r/dataengineering | 9 Oct 2022
You can find all current, closed, and resolved issues on the “Issues” section and explore them using filters: eg issues for dagster. Look into some of the issues and feel free to ask a question or post your idea: it’s much less toxic here (compared to SO, for example).
[D] Should I go with Prefect, Argo or Flyte for Model Training and ML workflow orchestration?
3 projects | reddit.com/r/MachineLearning | 26 Sep 2022
You could also consider Dagster, which aims to improve Apache Airflow's shortcomings. Also, take a look at MyMLOps, where you can get a quick overview of open-source orchestration tools.
What aspects of Python should I learn that are most important for Data Engineering?
4 projects | reddit.com/r/dataengineering | 24 Sep 2022
Python is one of the most accessible programming l code within Python. My favorite is dagster, which forces you to write functional blocks of code with superior features—coming from a more SQL, T-SQL, and PL-SQL background. As a data engineer, I'd say you'd not expect to write perfect code; it's better to know the Big-O annotation to avoid long-running data pipelines, even if your code doesn't look the prettiest. Static types such as mypy might be another good one to know, as it will detect errors pre-runtime, which is the biggest problem of Python.
Show HN: Airflow is cool but have you tried this for data pipelines?
2 projects | news.ycombinator.com | 21 Sep 2022
This is cool, but looks like https://github.com/dagster-io/dagster
The issue with less popular data pipeline projects is that they’re less stable in production
Tips for using Jupyter Notebooks with GitHub
5 projects | dev.to | 22 Aug 2022
Papermill can also target cloud storage outputs for hosting rendered notebooks, execute notebooks from custom Python code, and even be used within distributed data pipelines like Dagster (see Dagstermill). For more information, see the papermill documentation.
4 projects | reddit.com/r/dataengineering | 2 Aug 2022
There are specialized tools like DataHub (see this for columnar level reporting: https://feature-requests.datahubproject.io/roadmap/541 ) that would help. But really, in a good data platform, the orchestration layer should be aggregating metadata and giving you everything you need to trace lineage, A tool like Dagster does this well if you make full use of the Software Defined Assets capability, but that is fairly new so not so many people have embraced it yet.
What are some alternatives?
Prefect - The easiest way to build, run, and monitor data pipelines at scale.
Airflow - Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
airbyte - Data integration platform for ELT pipelines from APIs, databases & files to warehouses & lakes.
MLflow - Open source platform for the machine learning lifecycle
OpenLineage - An Open Standard for lineage metadata collection
ploomber - The fastest ⚡️ way to build data pipelines. Develop iteratively, deploy anywhere. ☁️
streamlit - Streamlit — The fastest way to build data apps in Python
Mage - 🧙 The modern replacement for Airflow. Mage is an open-source data pipeline tool for transforming and integrating data. https://github.com/mage-ai/mage-ai
superset - Apache Superset is a Data Visualization and Data Exploration Platform
hashi-ui - A modern user interface for @hashicorp Consul & Nomad