How to Use Apache Airflow to Get 1000+ Files From a Public Dataset

This page summarizes the projects mentioned and recommended in the original post on dev.to

Our great sponsors
  • WorkOS - The modern identity platform for B2B SaaS
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • SaaSHub - Software Alternatives and Reviews
  • data-engineering-zoomcamp

    Free Data Engineering course!

  • Parquet is a free and open source columnar storage format backed by Apache Software Foundation. It allows efficient compression and encoding within Hadoop ecosystem independent of frameworks or programming languages. Hence it is a common format for public databases and actually OEDI PV dataset also has a version in .parquet format. However, in line with the corresponding DataTalks Club tutorials and videos, the conversion process is involved to present variety of operations. Otherwise, it is possible to directly transfer .parquet files to GCP using Airflow or any other tool.

  • DataEng_Zoomcamp

  • Airflow is a highly configurable tool. Accordingly, it’s installation can be customized due to the requirements of each specific application. Moreover, it is a common practice to host it in a container to isolate from system interactions and dependency conflicts. Brief guide presented here is based on the official guide and the show case is originally presented by DataTalks Club in during Data Engineering Zoomcamp (2022 cohort). The code base and the configurations files used in this tutorial are available here. The case in this tutorial aims to

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
  • Redis

    Redis is an in-memory database that persists on disk. The data model is key-value, but many different kind of values are supported: Strings, Lists, Sets, Sorted Sets, Hashes, Streams, HyperLogLogs, Bitmaps.

  • redis - The redis - broker that forwards messages from scheduler to worker.

  • Airflow

    Apache Airflow - A platform to programmatically author, schedule, and monitor workflows

  • # First-time build can take upto 10 mins. FROM apache/airflow:2.2.3 ENV AIRFLOW_HOME=/opt/airflow USER root RUN apt-get update -qq && apt-get install vim -qqq # git gcc g++ -qqq COPY requirements.txt . RUN pip install --no-cache-dir -r requirements.txt # Ref: https://airflow.apache.org/docs/docker-stack/recipes.html SHELL ["/bin/bash", "-o", "pipefail", "-e", "-u", "-x", "-c"] ARG CLOUD_SDK_VERSION=322.0.0 ENV GCLOUD_HOME=/home/google-cloud-sdk ENV PATH="${GCLOUD_HOME}/bin/:${PATH}" RUN DOWNLOAD_URL="https://dl.google.com/dl/cloudsdk/channels/rapid/downloads/google-cloud-sdk-${CLOUD_SDK_VERSION}-linux-x86_64.tar.gz" \ && TMP_DIR="$(mktemp -d)" \ && curl -fL "${DOWNLOAD_URL}" --output "${TMP_DIR}/google-cloud-sdk.tar.gz" \ && mkdir -p "${GCLOUD_HOME}" \ && tar xzf "${TMP_DIR}/google-cloud-sdk.tar.gz" -C "${GCLOUD_HOME}" --strip-components=1 \ && "${GCLOUD_HOME}/install.sh" \ --bash-completion=false \ --path-update=false \ --usage-reporting=false \ --quiet \ && rm -rf "${TMP_DIR}" \ && gcloud --version WORKDIR $AIRFLOW_HOME COPY scripts scripts RUN chmod +x scripts USER $AIRFLOW_UID

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts