Our great sponsors
- Sonar - Write Clean Python Code. Always.
- InfluxDB - Access the most powerful time series database as a service
- ONLYOFFICE ONLYOFFICE Docs — document collaboration in your environment
-
streamify
A data engineering project with Kafka, Spark Streaming, dbt, Docker, Airflow, Terraform, GCP and much more!
I've documented on Git itself. It's slightly more focused on the setup part. But you can still get an idea on the data flow.
-
eventsim
Event data simulator. Generates a stream of pseudo-random events from a set of users, designed to simulate web traffic.
Eventsim is a program that generates event data to replicate page requests for a fake music web site. The results look like real use data, but are totally fake. The docker image is borrowed from viirya's fork of it, as the original project has gone without maintenance for a few years now.
-
Sonar
Write Clean Python Code. Always.. Sonar helps you commit clean code every time. With over 225 unique rules to find Python bugs, code smells & vulnerabilities, Sonar finds the issues while you focus on the work.
-
eventsim
Event data simulator. Generates a stream of pseudo-random events from a set of users, designed to simulate web traffic. (by viirya)
Eventsim is a program that generates event data to replicate page requests for a fake music web site. The results look like real use data, but are totally fake. The docker image is borrowed from viirya's fork of it, as the original project has gone without maintenance for a few years now.
-
terraform
Terraform enables you to safely and predictably create, change, and improve infrastructure. It is an open source tool that codifies APIs into declarative configuration files that can be shared amongst team members, treated as code, edited, reviewed, and versioned.
Infrastructure as Code software - Terraform
-
Language - Python
-
Stream Processing - Kafka, Spark Streaming
-
Containerization - Docker, Docker Compose
-
InfluxDB
Access the most powerful time series database as a service. Ingest, store, & analyze all types of time series data in a fully-managed, purpose-built database. Keep data forever with low-cost storage and superior data compression.
-
nodejs-bigquery
Node.js client for Google Cloud BigQuery: A fast, economical and fully-managed enterprise data warehouse for large-scale data analytics.
Data Warehouse - BigQuery
-
Orchestration - Airflow
-
spark-bigquery-connector
BigQuery data source for Apache Spark: Read data from BigQuery into DataFrames, write DataFrames into BigQuery tables.
-
Just a slight critique, but I noticed some of the dbt models are a bit hard to read. Especially your dim_users SCD2 model, which uses lots of nested subqueries and multiple columns on the same line. You may want to refer to this style guide from dbt Labs. I find CTEs are a lot easier to parse and read.
Related posts
- Data Analytics at Potloc I: Making data integrity your priority with Elementary & Meltano
- How to Build and Deploy a Multi-Container Application to Amazon EKS Cluster
- Wrangling BigQuery at Reddit
- Como evitar SQL Injection utilizando client do BigQuery
- Kubernetes as a Platform vs. Kubernetes as an API