Data Engineering Projects for Beginners

This page summarizes the projects mentioned and recommended in the original post on dev.to

Our great sponsors
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • WorkOS - The modern identity platform for B2B SaaS
  • SaaSHub - Software Alternatives and Reviews
  • uber-expenses-tracking

    The goal of this project is to track the expenses of Uber Rides and Uber Eats through data Engineering processes using technologies such as Apache Airflow, AWS Redshift and Power BI.

  • Tracking your Uber Rides and Uber Eats expenses through a data engineering process

  • pyDag

    Scheduling Big Data Workloads and Data Pipelines in the Cloud with pyDag

  • Scheduling Big Data Workloads and Data Pipelines in the Cloud with pyDag

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • pyspark-on-aws-emr

    The goal of this project is to offer an AWS EMR template using Spot Fleet and On-Demand Instances that you can use quickly. Just focus on writing pyspark code.

  • Building Big Data Pipelines in the Cloud with AWS EMR

  • wbz

    A parallel implementation of the bzip2 data compressor in python, this data compression pipeline is using algorithms like Burrows–Wheeler transform (BWT) and Move to front (MTF) to improve the Huffman compression. For now, this tool only will be focused on compressing .csv files, and other files on tabular format.

  • Building a Lossless Data Compression and Data Decompression Pipeline

  • apache-spark-docker

    Dockerizing an Apache Spark Standalone Cluster

  • Learn how to dockerize an Apache Spark Standalone Cluster

  • docker-livy

    Dockerizing and Consuming an Apache Livy environment

  • Dockerizing and Consuming an Apache Livy environment

  • data-engineer-challenge

    Challenge Data Engineer

  • Design, Development and Deployment of a simple Data Pipeline

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
  • data-engineering-challenge-th

    Dockerizing a Python Script for Web Scraping and consume the scraped data using FastApi (www.metroscubicos.com)

  • Dockerizing a Python Script for Faster Web Scraping

  • distance-metrics

    Distance metrics are one of the most important parts of some machine learning algorithms, supervised and unsupervised learning, it will help us to calculate and measure similarities between numerical values expressed as data points

  • Understanding Similarity Measures for Text Analysis

  • recommendation-system

    Build a Content-Based Movie Recommender System (TF-IDF, BM25, BERT)

  • Learn how to build a content-based Movie Recommender System

  • text-analysis-speeches-amlo

    Text analysis of the speeches, conferences and interviews of the current president of Mexico

  • A Text Analysis of Speeches

  • Dropout-Students-Prediction

    The goal of this project is to identify students at risk of dropping out the school

  • Dropout Students Prediction

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts