Downloading files from S3 with multithreading and Boto3

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • WorkOS - The modern identity platform for B2B SaaS
  • SaaSHub - Software Alternatives and Reviews
  • s3parcp

    Faster than s3cp

  • Yes, you would. The CPU and memory overhead of multiprocessing for this application is why we ended up migrating away from boto3 and to the AWS Go SDK for this specific purpose (https://github.com/chanzuckerberg/s3parcp as I mentioned in another comment). We still use boto3 in other areas, but for maxing out the network connection, golang is far more scalable.

  • s3pd

    S3 parallel downloader with good performance

  • That's what my small python package does (https://github.com/NewbiZ/s3pd). Definitely not perfect (not saturating cores because no use of event loop per process) but download is split in multiple processes and stored in shared memory.

    I've been able to saturate 20GB NICs on Ec2 with it (32 cores)

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • aws-cli

    Universal Command Line Interface for Amazon Web Services

  • It is directly checking for s3 to s3[1] and indicates that it wants to copy...

    I've read over it and I'm reasonably sure that it's going to issue CopyObject, but it would take me actually getting out paper and pen to really track it down.

    The AWS CLI and Boto are a case study in overdoing class hierarchies. Not because there's any obvious AbstractSingletonProxyFactoryBean, but rather that there's no instance that stands out as "this is where they went wrong" and nevertheless the end result is a confusing mess of inheritance and objects.

    [1]: https://github.com/aws/aws-cli/blob/45b0063b2d0b245b17a57fd9...

  • s5cmd

    Parallel S3 and local filesystem execution tool.

  • Excellent walkthrough, love boto. We’ve recently been using s5cmd which we’ve found is ridiculously faster than boto without any extra boto tricks.

    https://github.com/peak/s5cmd

  • metaflow

    :rocket: Build and manage real-life ML, AI, and data science projects with ease!

  • It is a bit of a hidden gem but Metaflow includes a Boto-based highly parallelized, error-tolerant S3 client that Netflix uses routinely to get 10-20Gbps throughput between EC2 and S3.

    Technically it is independent from Metaflow, so you could use it as a stand-alone, high-performance S3 client.

    See docs here https://docs.metaflow.org/metaflow/data#store-and-load-objec...

    And code here https://github.com/Netflix/metaflow/tree/master/metaflow/dat...

    (I wrote it originally - AMA if curious)

  • botocore

    The low-level, core functionality of boto3 and the AWS CLI.

  • In this case I just wanted to keep it simple and do it the "native" way. I've been hearing a lot about aioboto3 and I'll surely check it out. However, it seems to have some limitations like https://github.com/boto/botocore/issues/458

  • aws-sdk-go

    AWS SDK for the Go programming language.

  • I'm not aware of a supported API in the golang SDK for this, but there is a sketch here: https://github.com/aws/aws-sdk-go/tree/main/example/service/...

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
  • rclone

    "rsync for cloud storage" - Google Drive, S3, Dropbox, Backblaze B2, One Drive, Swift, Hubic, Wasabi, Google Cloud Storage, Yandex Files

  • RClone does all that you require and much more.

    RClone: https://rclone.org/

  • s4cmd

    Super S3 command line tool

  • No mention of how it compares to s4cmd (which like s3cmd is python).

    https://github.com/bloomreach/s4cmd

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts