Tips for scalable workflows on AWS

This page summarizes the projects mentioned and recommended in the original post on dev.to

Our great sponsors
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • WorkOS - The modern identity platform for B2B SaaS
  • SaaSHub - Software Alternatives and Reviews
  • aws-genomics-workflows

    Discontinued Genomics Workflows on AWS

    There are a lot of tools built with C/C++ using glibc shared libraries. The AWS CLI v2 is one of these tools. It is common for workflow engines running on AWS to bind mount the AWS CLI from the host instance into the container so that it is available for interacting with other AWS services like staging data from Amazon S3. Challenges arise when a tooling container is based on an image without glibc shared libraries as is the case with ultra-minimal base images like alpine and busybox. You can still use these ultra-minimal images, but you need to take extra steps to ensure that glibc shared libraries are available. For example, the AWS CLI v2 is distributed with the shared libraries it needs, and to make it work on an alpine based container, you can modify the LD_LIBRARY_PATH environment variable in the container environment to point to where these shared libraries are installed.

  • htslib

    C library for high-throughput sequencing data formats

    In contrast, processing can start immediately and only transfer what is necessary if tooling can read bytes of data directly from Amazon S3. Tools based on htslib can do this, so you can run something like:

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

  • aws-sdk

    Landing page for the AWS SDKs on GitHub

    One common pattern to integrate with AWS from a workflow job is to call additional services using the AWS CLI. Overall, this works well, but there are a few considerations one should note when doing so. First and foremost, a workflow job needs to know where the AWS CLI installed and how to use it. You can do this by either installing the AWS CLI on the host compute and bind mounting it into the container job, or including the AWS CLI as part of the container image. That said, see my notes above on keeping container images small for associated caveats. Second, while the AWS CLI is great for scripting, for more complex operations direct integration via the AWS SDK is a better fit.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts