Completed my first Data Engineering project with Kafka, Spark, GCP, Airflow, dbt, Terraform, Docker and more!

This page summarizes the projects mentioned and recommended in the original post on reddit.com/r/dataengineering

Our great sponsors
  • Sonar - Write Clean Python Code. Always.
  • InfluxDB - Access the most powerful time series database as a service
  • ONLYOFFICE ONLYOFFICE Docs — document collaboration in your environment
  • streamify

    A data engineering project with Kafka, Spark Streaming, dbt, Docker, Airflow, Terraform, GCP and much more!

    I've documented on Git itself. It's slightly more focused on the setup part. But you can still get an idea on the data flow.

  • eventsim

    Event data simulator. Generates a stream of pseudo-random events from a set of users, designed to simulate web traffic.

    Eventsim is a program that generates event data to replicate page requests for a fake music web site. The results look like real use data, but are totally fake. The docker image is borrowed from viirya's fork of it, as the original project has gone without maintenance for a few years now.

  • Sonar

    Write Clean Python Code. Always.. Sonar helps you commit clean code every time. With over 225 unique rules to find Python bugs, code smells & vulnerabilities, Sonar finds the issues while you focus on the work.

  • eventsim

    Event data simulator. Generates a stream of pseudo-random events from a set of users, designed to simulate web traffic. (by viirya)

    Eventsim is a program that generates event data to replicate page requests for a fake music web site. The results look like real use data, but are totally fake. The docker image is borrowed from viirya's fork of it, as the original project has gone without maintenance for a few years now.

  • terraform

    Terraform enables you to safely and predictably create, change, and improve infrastructure. It is an open source tool that codifies APIs into declarative configuration files that can be shared amongst team members, treated as code, edited, reviewed, and versioned.

    Infrastructure as Code software - Terraform

  • CPython

    The Python programming language

    Language - Python

  • ApacheKafka

    A curated re-sources list for awesome Apache Kafka

    Stream Processing - Kafka, Spark Streaming

  • Docker Compose

    Define and run multi-container applications with Docker

    Containerization - Docker, Docker Compose

  • InfluxDB

    Access the most powerful time series database as a service. Ingest, store, & analyze all types of time series data in a fully-managed, purpose-built database. Keep data forever with low-cost storage and superior data compression.

  • nodejs-bigquery

    Node.js client for Google Cloud BigQuery: A fast, economical and fully-managed enterprise data warehouse for large-scale data analytics.

    Data Warehouse - BigQuery

  • Airflow

    Apache Airflow - A platform to programmatically author, schedule, and monitor workflows

    Orchestration - Airflow

  • spark-bigquery-connector

    BigQuery data source for Apache Spark: Read data from BigQuery into DataFrames, write DataFrames into BigQuery tables.

  • corp

    Assets related to the operation of Fishtown Analytics.

    Just a slight critique, but I noticed some of the dbt models are a bit hard to read. Especially your dim_users SCD2 model, which uses lots of nested subqueries and multiple columns on the same line. You may want to refer to this style guide from dbt Labs. I find CTEs are a lot easier to parse and read.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts