Completed my first Data Engineering project with Kafka, Spark, GCP, Airflow, dbt, Terraform, Docker and more!

Our great sponsors

InfluxDB - Power Real-Time Data Analytics at Scale

WorkOS - The modern identity platform for B2B SaaS

SaaSHub - Software Alternatives and Reviews

Our great sponsors

streamify

4 474 0.0 Python

A data engineering project with Kafka, Spark Streaming, dbt, Docker, Airflow, Terraform, GCP and much more!

I've documented on Git itself. It's slightly more focused on the setup part. But you can still get an idea on the data flow.

eventsim

1 481 0.0 Scala

Event data simulator. Generates a stream of pseudo-random events from a set of users, designed to simulate web traffic.

Eventsim is a program that generates event data to replicate page requests for a fake music web site. The results look like real use data, but are totally fake. The docker image is borrowed from viirya's fork of it, as the original project has gone without maintenance for a few years now.

InfluxDB

www.influxdata.com sponsored

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
eventsim

3 58 0.0 Scala

Event data simulator. Generates a stream of pseudo-random events from a set of users, designed to simulate web traffic. (by viirya)

Eventsim is a program that generates event data to replicate page requests for a fake music web site. The results look like real use data, but are totally fake. The docker image is borrowed from viirya's fork of it, as the original project has gone without maintenance for a few years now.

terraform

500 41,118 9.9 Go

Terraform enables you to safely and predictably create, change, and improve infrastructure. It is a source-available tool that codifies APIs into declarative configuration files that can be shared amongst team members, treated as code, edited, reviewed, and versioned.

Infrastructure as Code software - Terraform

CPython

1,314 59,531 10.0 Python

The Python programming language

Language - Python

ApacheKafka

104 28 0.0

A curated re-sources list for awesome Apache Kafka

Stream Processing - Kafka, Spark Streaming

Docker Compose

386 32,367 9.6 Go

Define and run multi-container applications with Docker

Containerization - Docker, Docker Compose

WorkOS

workos.com sponsored

The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
nodejs-bigquery

43 455 7.9 TypeScript

Node.js client for Google Cloud BigQuery: A fast, economical and fully-managed enterprise data warehouse for large-scale data analytics.

Data Warehouse - BigQuery

Airflow

169 34,485 10.0 Python

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows

Orchestration - Airflow

spark-bigquery-connector

2 348 9.0 Java

BigQuery data source for Apache Spark: Read data from BigQuery into DataFrames, write DataFrames into BigQuery tables.
corp

12 413 6.2

Assets related to the operation of Fishtown Analytics.

Just a slight critique, but I noticed some of the dbt models are a bit hard to read. Especially your dim_users SCD2 model, which uses lots of nested subqueries and multiple columns on the same line. You may want to refer to this style guide from dbt Labs. I find CTEs are a lot easier to parse and read.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project