Our great sponsors
-
streamify
A data engineering project with Kafka, Spark Streaming, dbt, Docker, Airflow, Terraform, GCP and much more!
-
eventsim
Event data simulator. Generates a stream of pseudo-random events from a set of users, designed to simulate web traffic.
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
eventsim
Event data simulator. Generates a stream of pseudo-random events from a set of users, designed to simulate web traffic. (by viirya)
-
terraform
Terraform enables you to safely and predictably create, change, and improve infrastructure. It is a source-available tool that codifies APIs into declarative configuration files that can be shared amongst team members, treated as code, edited, reviewed, and versioned.
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
-
nodejs-bigquery
Node.js client for Google Cloud BigQuery: A fast, economical and fully-managed enterprise data warehouse for large-scale data analytics.
-
spark-bigquery-connector
BigQuery data source for Apache Spark: Read data from BigQuery into DataFrames, write DataFrames into BigQuery tables.
I've documented on Git itself. It's slightly more focused on the setup part. But you can still get an idea on the data flow.
Eventsim is a program that generates event data to replicate page requests for a fake music web site. The results look like real use data, but are totally fake. The docker image is borrowed from viirya's fork of it, as the original project has gone without maintenance for a few years now.
Eventsim is a program that generates event data to replicate page requests for a fake music web site. The results look like real use data, but are totally fake. The docker image is borrowed from viirya's fork of it, as the original project has gone without maintenance for a few years now.
Infrastructure as Code software - Terraform
Language - Python
Stream Processing - Kafka, Spark Streaming
Containerization - Docker, Docker Compose
Data Warehouse - BigQuery
Orchestration - Airflow
Just a slight critique, but I noticed some of the dbt models are a bit hard to read. Especially your dim_users SCD2 model, which uses lots of nested subqueries and multiple columns on the same line. You may want to refer to this style guide from dbt Labs. I find CTEs are a lot easier to parse and read.
Related posts
- Data Analytics at Potloc I: Making data integrity your priority with Elementary & Meltano
- Build and Deploy a ReactJS App to AWS EC2 with Docker, NGINX, and Automate with GitHub Actions.
- How to deploy a Django app to Google Cloud Run using Terraform
- HashiCorp switching to BSL shows a need for open charter companies
- Security Analysis with JupiterOne’s Starbase and Memgraph