Our great sponsors
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
-
rclone
"rsync for cloud storage" - Google Drive, S3, Dropbox, Backblaze B2, One Drive, Swift, Hubic, Wasabi, Google Cloud Storage, Yandex Files
Yes, you would. The CPU and memory overhead of multiprocessing for this application is why we ended up migrating away from boto3 and to the AWS Go SDK for this specific purpose (https://github.com/chanzuckerberg/s3parcp as I mentioned in another comment). We still use boto3 in other areas, but for maxing out the network connection, golang is far more scalable.
That's what my small python package does (https://github.com/NewbiZ/s3pd). Definitely not perfect (not saturating cores because no use of event loop per process) but download is split in multiple processes and stored in shared memory.
I've been able to saturate 20GB NICs on Ec2 with it (32 cores)
It is directly checking for s3 to s3[1] and indicates that it wants to copy...
I've read over it and I'm reasonably sure that it's going to issue CopyObject, but it would take me actually getting out paper and pen to really track it down.
The AWS CLI and Boto are a case study in overdoing class hierarchies. Not because there's any obvious AbstractSingletonProxyFactoryBean, but rather that there's no instance that stands out as "this is where they went wrong" and nevertheless the end result is a confusing mess of inheritance and objects.
[1]: https://github.com/aws/aws-cli/blob/45b0063b2d0b245b17a57fd9...
Excellent walkthrough, love boto. We’ve recently been using s5cmd which we’ve found is ridiculously faster than boto without any extra boto tricks.
https://github.com/peak/s5cmd
It is a bit of a hidden gem but Metaflow includes a Boto-based highly parallelized, error-tolerant S3 client that Netflix uses routinely to get 10-20Gbps throughput between EC2 and S3.
Technically it is independent from Metaflow, so you could use it as a stand-alone, high-performance S3 client.
See docs here https://docs.metaflow.org/metaflow/data#store-and-load-objec...
And code here https://github.com/Netflix/metaflow/tree/master/metaflow/dat...
(I wrote it originally - AMA if curious)
In this case I just wanted to keep it simple and do it the "native" way. I've been hearing a lot about aioboto3 and I'll surely check it out. However, it seems to have some limitations like https://github.com/boto/botocore/issues/458
I'm not aware of a supported API in the golang SDK for this, but there is a sketch here: https://github.com/aws/aws-sdk-go/tree/main/example/service/...
RClone does all that you require and much more.
RClone: https://rclone.org/
No mention of how it compares to s4cmd (which like s3cmd is python).
https://github.com/bloomreach/s4cmd