How I've implemented the Medallion architecture using Apache Spark and Apache Hdoop

This page summarizes the projects mentioned and recommended in the original post on dev.to

Nutrient – The #1 PDF SDK Library, trusted by 10K+ developers
Other PDF SDKs promise a lot - then break. Laggy scrolling, poor mobile UX, tons of bugs, and lack of support cost you endless frustrations. Nutrient’s SDK handles billion-page workloads - so you don’t have to debug PDFs. Used by ~1 billion end users in more than 150 different countries.
www.nutrient.io
featured
CodeRabbit: AI Code Reviews for Developers
Revolutionize your code reviews with AI. CodeRabbit offers PR summaries, code walkthroughs, 1-click suggestions, and AST-based analysis. Boost productivity and code quality across all major languages with each PR.
coderabbit.ai
featured
  1. data-engineering-mini-projects

    I'm always learning. And I love learning through practice. So, I've created this repo to share with everybody what I learn.

    In this project, I'm exploring the Medallion Architecture which is a data design pattern that organizes data into different layers based on structure and/or quality. I'm creating a fictional scenario where a large enterprise that has several branches across the country. Each branch receives purchase orders from an app and deliver the goods to their customers. The enterprise wants to identify the branch that receives the most purchase requests and the branch that has the minimum average delivery time. To achieve that, I've used Apache Spark as a distributed compute engine and Apache Hadoop, in particular HDFS, as my data storage layer. Apache Spark ingest, processes, and stores the app's data on HDFS to be served to a custom dashboard app. You can find all about it, in this Github repo

  2. Nutrient

    Nutrient – The #1 PDF SDK Library, trusted by 10K+ developers. Other PDF SDKs promise a lot - then break. Laggy scrolling, poor mobile UX, tons of bugs, and lack of support cost you endless frustrations. Nutrient’s SDK handles billion-page workloads - so you don’t have to debug PDFs. Used by ~1 billion end users in more than 150 different countries.

    Nutrient logo
  3. PostgreSQL

    Mirror of the official PostgreSQL GIT repository. Note that this is just a *mirror* - we don't work with pull requests on github. To contribute, please see https://wiki.postgresql.org/wiki/Submitting_a_Patch

    I've simulated the retail app with a python script that randomly generates purchases and handle their lifecycle as NEW -> INTRANSIT -> DELIVERED. The app then stores that in a Postgresql database named retail that has two tables, orders and ordershistory. Separate users with appropriate privileges are created to the app to write the data to the database and to Spark, to read the data. The database is initiated with postgresql/init.sql script.

  4. superset

    Apache Superset is a Data Visualization and Data Exploration Platform

    Also, instead of the custom Dashboard app, a proper BI tool like Power BI, Tableau, Apache Superset, ..., etc. will be more powerful and flexible.

  5. Apache Spark

    Apache Spark - A unified analytics engine for large-scale data processing

    In this project, I'm exploring the Medallion Architecture which is a data design pattern that organizes data into different layers based on structure and/or quality. I'm creating a fictional scenario where a large enterprise that has several branches across the country. Each branch receives purchase orders from an app and deliver the goods to their customers. The enterprise wants to identify the branch that receives the most purchase requests and the branch that has the minimum average delivery time. To achieve that, I've used Apache Spark as a distributed compute engine and Apache Hadoop, in particular HDFS, as my data storage layer. Apache Spark ingest, processes, and stores the app's data on HDFS to be served to a custom dashboard app. You can find all about it, in this Github repo

  6. Apache Hadoop

    Apache Hadoop

    In this project, I'm exploring the Medallion Architecture which is a data design pattern that organizes data into different layers based on structure and/or quality. I'm creating a fictional scenario where a large enterprise that has several branches across the country. Each branch receives purchase orders from an app and deliver the goods to their customers. The enterprise wants to identify the branch that receives the most purchase requests and the branch that has the minimum average delivery time. To achieve that, I've used Apache Spark as a distributed compute engine and Apache Hadoop, in particular HDFS, as my data storage layer. Apache Spark ingest, processes, and stores the app's data on HDFS to be served to a custom dashboard app. You can find all about it, in this Github repo

  7. dagster

    An orchestration platform for the development, production, and observation of data assets.

    Instead of the custom orchestrator I used, a proper orchestration tool should replace it like Apache Airflow, Dagster, ..., etc.

  8. Airflow

    Apache Airflow - A platform to programmatically author, schedule, and monitor workflows

    Instead of the custom orchestrator I used, a proper orchestration tool should replace it like Apache Airflow, Dagster, ..., etc.

  9. CodeRabbit

    CodeRabbit: AI Code Reviews for Developers. Revolutionize your code reviews with AI. CodeRabbit offers PR summaries, code walkthroughs, 1-click suggestions, and AST-based analysis. Boost productivity and code quality across all major languages with each PR.

    CodeRabbit logo
NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts

  • Daft: Distributed DataFrame for Python

    2 projects | news.ycombinator.com | 29 Feb 2024
  • Daft: A High-Performance Distributed Dataframe Library for Multimodal Data

    1 project | news.ycombinator.com | 6 Jun 2023
  • Airflow's Problem

    6 projects | news.ycombinator.com | 2 Aug 2022
  • Discussion on Need of Feature Stores

    1 project | /r/mlops | 17 Jul 2022
  • DevOps Fundamentals for Deep Learning Engineers

    6 projects | /r/deeplearning | 20 Feb 2022

Did you know that Python is
the 2nd most popular programming language
based on number of references?