versatile-data-kit
data-engineering-zoomcamp
Our great sponsors
versatile-data-kit | data-engineering-zoomcamp | |
---|---|---|
52 | 119 | |
409 | 22,446 | |
2.2% | 5.1% | |
9.7 | 9.4 | |
7 days ago | 1 day ago | |
Python | Jupyter Notebook | |
Apache License 2.0 | - |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
versatile-data-kit
-
Looking for a data blogger
Here's the project: https://github.com/vmware/versatile-data-kit
-
Need advice on ETL tool
I don't really know if this would work for you because the UI is not functional yet, but a very simple REST API ingestion example here, there's one for csv too https://github.com/vmware/versatile-data-kit/wiki/Ingesting-data-from-REST-API-into-Database I can't imagine a simpler way unless it's really drag and drop.
-
If dbt is the "T" part of an "ELT", what do you use for "EL"?
I work at VMware and we use one tool for the whole ELT, it was made internally as there was no good alternative at the time and now we opensourced it, here it is: https://github.com/vmware/versatile-data-kit
-
Best way to fix errors in my data?
With my team we created csv ingestion plugin described here, maybe you want to try it out: https://github.com/vmware/versatile-data-kit/wiki/Ingesting-local-CSV-file-into-Database
-
What Orchestration Tool do you use for batch ETL/ELT?
We use Versatile Data Kit for batch data job orchestration (https://github.com/vmware/versatile-data-kit)
-
Dear, pipeline builders! Which step in your role is the most time consuming?
"suggestions on how to reduce the time spent on initially generating and adjusting the code" is using some tools that automate ELT. Here's one open-source tool I'm working on with my team: https://github.com/vmware/versatile-data-kit
-
Problem definition / vibe check for a repo
here's the repo: https://github.com/vmware/versatile-data-kit
-
Can we take a moment to appreciate how much of dataengineering is open source?
If you wish to contribute, projects usually have good first issues: https://github.com/vmware/versatile-data-kit/labels/good%20first%20issue If you wish to learn, check out examples: https://github.com/vmware/versatile-data-kit/tree/main/examples
-
ETL question (noob)
Have you heard about versatile data kit (https://github.com/vmware/versatile-data-kit)? I think it meets your needs perfectly:
-
DE Open Source
Versatile Data Kit is a framework to bBuild, run and manage your data pipelines with Python or SQL on any cloud https://github.com/vmware/versatile-data-kit here's a list of good first issues: https://github.com/vmware/versatile-data-kit/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22 Join our slack channel to connect with our team: https://cloud-native.slack.com/archives/C033PSLKCPR
data-engineering-zoomcamp
-
Data Engineering Zoomcamp Week 6 - using redpanda 1
References: Data engineering zoomcamp week 6 course and homework notes: https://github.com/DataTalksClub/data-engineering-zoomcamp/tree/main/cohorts/2024/06-streaming
-
Final project part 5
dbt is the main part of my data engineering project for Data Talks Club's data engineering zoomcamp. After a few frustrating errors on my part, I finally figured out how to make models, where to put the staging models and where to put the core models, how to compile a seed file, and how to join it to the main file in order to produce data for visualization. I also used the git interface to continually upgrade my repository. This was extremely convenient and helpful.
-
Building a project in DBT
For Week 4 of DataTalksClub's data engineering zoomcamp, we had to install dbt and create a project. This was a formidable task. dbt is a data transformation tool that enables data analysts and engineers to transform data in a cloud analytics warehouse, BigQuery in our case. It took me a very long time to do this, and in this case I needed the homework extension.
-
Testing and documenting DBT models
In this video we learned how to test and document dbt models. We also learned about the codegen library. This is part of Week 4 of the data engineering zoomcamp by DataTalksClub.
-
Extracting data with dlt
If you want to run these commands yourself, either in a Jupyter notebook or in Google Colab, you can get the file from HERE. You can get an overview of the workshop HERE. When I ran in a Jupyter notebook, I had to delete the first line (%%capture) and put quotes around dlt[duckdb] in the second line.
-
Data engineering at home?
Take a look.DE zoomcamp
-
Rockstar Data Engineers making big bucks: what are you doing exactly?
If you need guidance you can attend the data engineering zoomcamp, it's free and quite solid.
-
Self study material
Welcome. Start with Data Engineering Zoomcamp, try and build a project, see if you like it, then continue to get into deeper resources.
-
What is the best way to learn Python if I want to become a data engineer
Can take a look at this - https://github.com/DataTalksClub/data-engineering-zoomcamp
-
Course Recommendations for a New Grad
I think you can start with something free with this pretty practical course on Data Engineering from DataTalksClub - https://github.com/DataTalksClub/data-engineering-zoomcamp
What are some alternatives?
Mage - 🧙 The modern replacement for Airflow. Mage is an open-source data pipeline tool for transforming and integrating data. https://github.com/mage-ai/mage-ai
mlops-zoomcamp - Free MLOps course from DataTalks.Club
quadratic - Quadratic | Data Science Spreadsheet with Python & SQL
AdventureWorks - Projects using the AdventureWorks database
pyramid-jsonapi - Auto-build JSON API from sqlalchemy models using the pyramid framework
Cookbook - The Data Engineering Cookbook
hamilton - A scalable general purpose micro-framework for defining dataflows. THIS REPOSITORY HAS BEEN MOVED TO www.github.com/dagworks-inc/hamilton
Reddit-API-Pipeline
dbt-data-reliability - dbt package that is part of Elementary, the dbt-native data observability solution for data & analytics engineers. Monitor your data pipelines in minutes. Available as self-hosted or cloud service with premium features.
udacity-capstone
DataEngineerZoomCamp - I'm partaking in a Data Engineering Bootcamp / Zoomcamp. I'll store files and progress here.