Data Engineering Projects for Beginners

Our great sponsors

InfluxDB - Power Real-Time Data Analytics at Scale

WorkOS - The modern identity platform for B2B SaaS

SaaSHub - Software Alternatives and Reviews

Our great sponsors

uber-expenses-tracking

2 94 2.6 Jupyter Notebook

The goal of this project is to track the expenses of Uber Rides and Uber Eats through data Engineering processes using technologies such as Apache Airflow, AWS Redshift and Power BI.

Tracking your Uber Rides and Uber Eats expenses through a data engineering process

pyDag

2 24 0.0 Python

Scheduling Big Data Workloads and Data Pipelines in the Cloud with pyDag

Scheduling Big Data Workloads and Data Pipelines in the Cloud with pyDag

InfluxDB

www.influxdata.com sponsored

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
pyspark-on-aws-emr

1 24 3.6 Python

The goal of this project is to offer an AWS EMR template using Spot Fleet and On-Demand Instances that you can use quickly. Just focus on writing pyspark code.

Building Big Data Pipelines in the Cloud with AWS EMR

wbz

1 13 0.0 Python

A parallel implementation of the bzip2 data compressor in python, this data compression pipeline is using algorithms like Burrows–Wheeler transform (BWT) and Move to front (MTF) to improve the Huffman compression. For now, this tool only will be focused on compressing .csv files, and other files on tabular format.

Building a Lossless Data Compression and Data Decompression Pipeline

apache-spark-docker

1 40 0.0 VBA

Dockerizing an Apache Spark Standalone Cluster

Learn how to dockerize an Apache Spark Standalone Cluster

docker-livy

1 11 0.0 HTML

Dockerizing and Consuming an Apache Livy environment

Dockerizing and Consuming an Apache Livy environment

data-engineer-challenge

3 25 1.8 Python

Challenge Data Engineer

Design, Development and Deployment of a simple Data Pipeline

WorkOS

workos.com sponsored

The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
data-engineering-challenge-th

1 13 0.0 Python

Dockerizing a Python Script for Web Scraping and consume the scraped data using FastApi (www.metroscubicos.com)

Dockerizing a Python Script for Faster Web Scraping

distance-metrics

1 4 3.6 Jupyter Notebook

Distance metrics are one of the most important parts of some machine learning algorithms, supervised and unsupervised learning, it will help us to calculate and measure similarities between numerical values expressed as data points

Understanding Similarity Measures for Text Analysis

recommendation-system

1 10 3.6 Python

Build a Content-Based Movie Recommender System (TF-IDF, BM25, BERT)

Learn how to build a content-based Movie Recommender System

text-analysis-speeches-amlo

1 8 3.2 Jupyter Notebook

Text analysis of the speeches, conferences and interviews of the current president of Mexico

A Text Analysis of Speeches

Dropout-Students-Prediction

1 18 10.0 HTML

The goal of this project is to identify students at risk of dropping out the school

Dropout Students Prediction

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Wanting to move away from SQL
2 projects | /r/dataengineering | 25 Feb 2022
Building a streaming SQL engine with Arrow and DataFusion
1 project | news.ycombinator.com | 18 Mar 2024
This Week In Python
5 projects | dev.to | 17 Mar 2024
Prefect: A workflow orchestration tool for data pipelines
1 project | news.ycombinator.com | 13 Mar 2024
Using IPython Jupyter Magic commands to improve the notebook experience
1 project | dev.to | 3 Mar 2024

Data Engineering Projects for Beginners

This page summarizes the projects mentioned and recommended in the original post on dev.to
data-engineering Big Data Python Bigquery apache-airflow
Post date: 15 Jun 2022

uber-expenses-tracking

pyDag

InfluxDB

pyspark-on-aws-emr

wbz

apache-spark-docker

docker-livy

data-engineer-challenge

WorkOS

data-engineering-challenge-th

distance-metrics

recommendation-system

text-analysis-speeches-amlo

Dropout-Students-Prediction

Related posts

Data Engineering Projects for Beginners

This page summarizes the projects mentioned and recommended in the original post on dev.to data-engineering Big Data Python Bigquery apache-airflow Post date: 15 Jun 2022

Related posts

This page summarizes the projects mentioned and recommended in the original post on dev.to
data-engineering Big Data Python Bigquery apache-airflow
Post date: 15 Jun 2022