dagster
analytics
Our great sponsors
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
dagster
- Experience with Dagster.io?
-
Dagster tutorials
My recommendation is to continue on with the tutorial, then look at one of the larger example projects especially the ones named “project_”, and you should understand most of it. Of what you don't understand and you're curious about, look into the relevant concept page for the functions in the docs.
-
The Dagster Master Plan
I found this example that helped me - https://github.com/dagster-io/dagster/tree/master/examples/project_fully_featured/project_fully_featured
-
What are some open-source ML pipeline managers that are easy to use?
I would recommend the following: - https://www.mage.ai/ - https://dagster.io/ - https://www.prefect.io/ - https://metaflow.org/ - https://zenml.io/home
-
The Why and How of Dagster User Code Deployment Automation
In Helm terms: there are 2 charts, namely the system: dagster/dagster (values.yaml), and the user code: dagster/dagster-user-deployments (values.yaml). Note that you have to set dagster-user-deployments.enabled: true in the dagster/dagster values-yaml to enable this.
-
Best Orchestration Tool to run dbt projects?
Dagster seemed really cool when I looked into it as an alternative to airflow. I especially like the software defined assets and built-in lineage which I haven't seen in any other tool. However it seems it does not support RBAC which is a pretty big issue if you want a self-service type of architecture, see https://github.com/dagster-io/dagster/issues/2219. It does seem like it's available in their hosted version, but I wanted to run it myself on k8s.
-
dbt Cloud Alternatives?
Dagster? https://dagster.io
-
What's the best thing/library you learned this year ?
One that I haven't seen on here yet: dagster
- Anyone have an example of a project where a handful of the more popular Python tools are used? (E.g. airbyte, airflow, dbt, and pandas)
- Can we take a moment to appreciate how much of dataengineering is open source?
analytics
-
I'm not getting it...what's the point of DBT?
Take a look at gitlab's dbt project: https://gitlab.com/gitlab-data/analytics/-/blob/master/transform/snowflake-dbt/models/common/schema.yml
-
How would you structure a repo with 10+ ETL pipelines and shared code?
A good reference is the Gitlab data team repo. https://gitlab.com/gitlab-data/analytics
- What are your favourite GitHub repos that shows how data engineering should be done?
-
Are there any open corporate Data Team repositories / projects besides GitLab?
For example, their Data Team have a public repository, with a bunch of information on how they organize DAGs, machine learning projects, system configuration, etc.
- Kimball Dim Modelling Code Examples
- Can someone help me, an absolute newbie, understand the usage and benefit of dbt with practical example ?
-
Is jinja templating right for DBT?
So I've run through the DBT tutorial stuff and looked over some fairly complex uses of it i.e. GitLab Data and I was wondering if anyone has any opinions or insights into the use of jinja templating in the sql?
-
Where can I find free data engineering ( big data) projects online?
Gitlab has their DBT repo open source and is very useful for seeing how to structure a project at scale. https://gitlab.com/gitlab-data/analytics/-/tree/master/transform/snowflake-dbt
-
Gitlab's Data Team Platform (in depth look at their stack)
Currently the team is working hard on this: https://gitlab.com/gitlab-data/analytics/-/issues/9508
-
Can someone explain the big deal with dbt?
GitLab's dbt project is an excellent example of a mature project at scale. They also have a comprehensive guide to their methodology.
What are some alternatives?
Prefect - The easiest way to build, run, and monitor data pipelines at scale.
dbt-synapse - dbt adapter for Azure Synapse Dedicated SQL Pools
Airflow - Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
castled - Castled is an open source reverse ETL solution that helps you to periodically sync the data in your db/warehouse into sales, marketing, support or custom apps without any help from engineering teams
Mage - 🧙 The modern replacement for Airflow. Mage is an open-source data pipeline tool for transforming and integrating data. https://github.com/mage-ai/mage-ai
datahub - The Metadata Platform for your Data Stack
airbyte - The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
AdvancedSQLPuzzles - Welcome to my GitHub repository. I hope you enjoy solving these puzzles as much as I have enjoyed creating them.
MLflow - Open source platform for the machine learning lifecycle
lightdash - Self-serve BI to 10x your data team ⚡️
meltano
dbt-unit-testing - This dbt package contains macros to support unit testing that can be (re)used across dbt projects.