Completed my first Data Engineering project with Kafka, Spark, GCP, Airflow, dbt, Terraform, Docker and more!

This page summarizes the projects mentioned and recommended in the original post on /r/dataengineering

Our great sponsors
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • WorkOS - The modern identity platform for B2B SaaS
  • SaaSHub - Software Alternatives and Reviews
  • streamify

    A data engineering project with Kafka, Spark Streaming, dbt, Docker, Airflow, Terraform, GCP and much more!

  • I've documented on Git itself. It's slightly more focused on the setup part. But you can still get an idea on the data flow.

  • eventsim

    Event data simulator. Generates a stream of pseudo-random events from a set of users, designed to simulate web traffic.

  • Eventsim is a program that generates event data to replicate page requests for a fake music web site. The results look like real use data, but are totally fake. The docker image is borrowed from viirya's fork of it, as the original project has gone without maintenance for a few years now.

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • eventsim

    Event data simulator. Generates a stream of pseudo-random events from a set of users, designed to simulate web traffic. (by viirya)

  • Eventsim is a program that generates event data to replicate page requests for a fake music web site. The results look like real use data, but are totally fake. The docker image is borrowed from viirya's fork of it, as the original project has gone without maintenance for a few years now.

  • terraform

    Terraform enables you to safely and predictably create, change, and improve infrastructure. It is a source-available tool that codifies APIs into declarative configuration files that can be shared amongst team members, treated as code, edited, reviewed, and versioned.

  • Infrastructure as Code software - Terraform

  • CPython

    The Python programming language

  • Language - Python

  • ApacheKafka

    A curated re-sources list for awesome Apache Kafka

  • Stream Processing - Kafka, Spark Streaming

  • Docker Compose

    Define and run multi-container applications with Docker

  • Containerization - Docker, Docker Compose

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
  • nodejs-bigquery

    Node.js client for Google Cloud BigQuery: A fast, economical and fully-managed enterprise data warehouse for large-scale data analytics.

  • Data Warehouse - BigQuery

  • Airflow

    Apache Airflow - A platform to programmatically author, schedule, and monitor workflows

  • Orchestration - Airflow

  • spark-bigquery-connector

    BigQuery data source for Apache Spark: Read data from BigQuery into DataFrames, write DataFrames into BigQuery tables.

  • corp

    Assets related to the operation of Fishtown Analytics.

  • Just a slight critique, but I noticed some of the dbt models are a bit hard to read. Especially your dim_users SCD2 model, which uses lots of nested subqueries and multiple columns on the same line. You may want to refer to this style guide from dbt Labs. I find CTEs are a lot easier to parse and read.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts