SaaSHub helps you find the best software and product alternatives Learn more →
Versatile-data-kit Alternatives
Similar projects and alternatives to versatile-data-kit
-
missing-semester
The Missing Semester of Your CS Education 📚
-
aws-cdk
The AWS Cloud Development Kit is a framework for defining cloud infrastructure in code
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
-
Pulumi
Pulumi - Infrastructure as Code in any programming language. Build infrastructure intuitively on any cloud using familiar languages 🚀
-
Airflow
Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
-
airbyte
The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
-
superset
Apache Superset is a Data Visualization and Data Exploration Platform
-
terraform-cdk
Define infrastructure resources using programming constructs and provision them using HashiCorp Terraform
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
Apache Spark
Apache Spark - A unified analytics engine for large-scale data processing
-
dbt-core
dbt enables data analysts and engineers to transform their data using the same practices that software engineers use to build applications.
-
Mage
🧙 The modern replacement for Airflow. Mage is an open-source data pipeline tool for transforming and integrating data. https://github.com/mage-ai/mage-ai
-
Rudderstack
Privacy and Security focused Segment-alternative, in Golang and React
-
-
Apache Arrow
Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
-
delta
An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs (by delta-io)
-
dagster
An orchestration platform for the development, production, and observation of data assets.
-
Trino
Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
-
sqlfluff
A modular SQL linter and auto-formatter with support for multiple dialects and templated code.
-
-
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
versatile-data-kit reviews and mentions
-
Can we take a moment to appreciate how much of dataengineering is open source?
Free, Python+SQL ELT pipelines framework with orchestration functionality https://github.com/vmware/versatile-data-kit
If you wish to contribute, projects usually have good first issues: https://github.com/vmware/versatile-data-kit/labels/good%20first%20issue If you wish to learn, check out examples: https://github.com/vmware/versatile-data-kit/tree/main/examples
-
DE Open Source
Versatile Data Kit is a framework to bBuild, run and manage your data pipelines with Python or SQL on any cloud https://github.com/vmware/versatile-data-kit here's a list of good first issues: https://github.com/vmware/versatile-data-kit/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22 Join our slack channel to connect with our team: https://cloud-native.slack.com/archives/C033PSLKCPR
-
What is a personality type of a Data Engineer?
Okay, I will explain what I am doing and how I see the "fun" in the project. I work with an open-source framework for data engineers. The community members are developers and people who use the tool - DEs. Indeed, I am facilitating a monthly community meeting for everyone to meet and discuss important topics, but that's the only part that takes their direct time, and it's totally voluntary, so DEs usually don't join, but I'm glad that the developers are joining and participating. What I was having in mind is more of a design and promotion question. I have a vision for open source projects to have a feel of friendliness, and openness (fun) which I communicate through design and visuals that are part of the repo and information we share about the project. And, as I don't find long texts engaging, because I literally can't focus when I see a long description of, say, a GitHub repo, I have an internal struggle against very detailed descriptions. That said, I am having an internal wish to transform the project into something more like this: https://github.com/mage-ai/mage-ai Instead of this: https://github.com/vmware/versatile-data-kit But I'm questioning myself, and thinking that maybe it is better suited for DEs as it is.
-
Best Open source no-code ELT tool for startup
Opensource, good for basic SQL and/or Python skills, extensible and provides support in setup/adoption of the framework. https://github.com/vmware/versatile-data-kit I'm the community manager for this project, I built my first full ELT pipeline (tracking GitHub stats) with no previous experience on my first month totally by myself. It's covering the full data journey. Oh, and it has Airflow integration, with that you can have a dashboard to see your jobs, dependencies but has better/more intuitive scheduling.
-
I created a pipeline extracting Reddit data using Airflow, Docker, Terraform, S3, dbt, Redshift, and Google Data Studio
In order to simplify steps 1-5 I can bring another framework to your attention - Versatile Data Kit (entirely open-source) which allows you to create data jobs (being it ingestion, transformation, publishing) with SQL/ Python, which runs on any cloud and is also multi-tenant.
-
ELT of my own Strava data using the Strava API, MySQL, Python, S3, Redshift, and Airflow
I believe that you would not need to build the docker image yourself. There are data engineering frameworks which allow you to build your data jobs yourself and take care of the containerisation of your pipeline. You can have a look at this ingest from rest API example. They would also allow you to schedule your data job using cron, while data job itself can contain SQL & Python.
- How-to-Guide: Contributing to Open Source
-
Has anyone "inherited" a pipeline/code/model that was so poorly written they wanted to quit their job?
I wouldn't stay there if they absolutely disagree with changing things, it would drain my energy and I'd just get sad and depressed, on the other hand, if you decide to go for it and try to untangle this mess, I think it would contribute to the confidence, but take some real patience and persistence. I'm a real automation geek, everything that can be automated should be. Maybe if you wish for advice, I would check out this open-source DataOps / automation tool here: https://github.com/vmware/versatile-data-kit maybe it helps, maybe not, whatever you do, good luck!
-
Python or Tool for Pipelines
I would recommend taking a look at Versatile Data Kit . It is an open-source tool that covers the full end-to-end cycle of data engineering with data ops practices embedded - from ingesting data from a source system, transformations (including implementation of some design patterns like Kimbal) and publishing data (for reports, apps) .
-
A note from our sponsor - SaaSHub
www.saashub.com | 17 Apr 2024
Stats
vmware/versatile-data-kit is an open source project licensed under Apache License 2.0 which is an OSI approved license.
The primary programming language of versatile-data-kit is Python.
Popular Comparisons
- versatile-data-kit VS data-engineering-zoomcamp
- versatile-data-kit VS Mage
- versatile-data-kit VS quadratic
- versatile-data-kit VS pyramid-jsonapi
- versatile-data-kit VS hamilton
- versatile-data-kit VS Reddit-API-Pipeline
- versatile-data-kit VS dbt-data-reliability
- versatile-data-kit VS data-engineering-wiki
- versatile-data-kit VS missing-semester
- versatile-data-kit VS dbt-core