Our great sponsors
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
-
Redis
Redis is an in-memory database that persists on disk. The data model is key-value, but many different kind of values are supported: Strings, Lists, Sets, Sorted Sets, Hashes, Streams, HyperLogLogs, Bitmaps.
Parquet is a free and open source columnar storage format backed by Apache Software Foundation. It allows efficient compression and encoding within Hadoop ecosystem independent of frameworks or programming languages. Hence it is a common format for public databases and actually OEDI PV dataset also has a version in .parquet format. However, in line with the corresponding DataTalks Club tutorials and videos, the conversion process is involved to present variety of operations. Otherwise, it is possible to directly transfer .parquet files to GCP using Airflow or any other tool.
Airflow is a highly configurable tool. Accordingly, it’s installation can be customized due to the requirements of each specific application. Moreover, it is a common practice to host it in a container to isolate from system interactions and dependency conflicts. Brief guide presented here is based on the official guide and the show case is originally presented by DataTalks Club in during Data Engineering Zoomcamp (2022 cohort). The code base and the configurations files used in this tutorial are available here. The case in this tutorial aims to
redis - The redis - broker that forwards messages from scheduler to worker.
# First-time build can take upto 10 mins. FROM apache/airflow:2.2.3 ENV AIRFLOW_HOME=/opt/airflow USER root RUN apt-get update -qq && apt-get install vim -qqq # git gcc g++ -qqq COPY requirements.txt . RUN pip install --no-cache-dir -r requirements.txt # Ref: https://airflow.apache.org/docs/docker-stack/recipes.html SHELL ["/bin/bash", "-o", "pipefail", "-e", "-u", "-x", "-c"] ARG CLOUD_SDK_VERSION=322.0.0 ENV GCLOUD_HOME=/home/google-cloud-sdk ENV PATH="${GCLOUD_HOME}/bin/:${PATH}" RUN DOWNLOAD_URL="https://dl.google.com/dl/cloudsdk/channels/rapid/downloads/google-cloud-sdk-${CLOUD_SDK_VERSION}-linux-x86_64.tar.gz" \ && TMP_DIR="$(mktemp -d)" \ && curl -fL "${DOWNLOAD_URL}" --output "${TMP_DIR}/google-cloud-sdk.tar.gz" \ && mkdir -p "${GCLOUD_HOME}" \ && tar xzf "${TMP_DIR}/google-cloud-sdk.tar.gz" -C "${GCLOUD_HOME}" --strip-components=1 \ && "${GCLOUD_HOME}/install.sh" \ --bash-completion=false \ --path-update=false \ --usage-reporting=false \ --quiet \ && rm -rf "${TMP_DIR}" \ && gcloud --version WORKDIR $AIRFLOW_HOME COPY scripts scripts RUN chmod +x scripts USER $AIRFLOW_UID
Related posts
- Building in Public: Leveraging Tublian's AI Copilot for My Open Source Contributions
- Navigating Week Two: Insights and Experiences from My Tublian Internship Journey
- Best ETL Tools And Why To Choose
- Simplifying Data Transformation in Redshift: An Approach with DBT and Airflow
- Share Your favorite python related software!