Our great sponsors
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
-
rclone
"rsync for cloud storage" - Google Drive, S3, Dropbox, Backblaze B2, One Drive, Swift, Hubic, Wasabi, Google Cloud Storage, Yandex Files
-
mountpoint-s3
A simple, high-throughput file client for mounting an Amazon S3 bucket as a local file system.
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
It uses FUSE and there's three types of Kernel cache you could use with FUSE (although, it seems like gcsfuse is exposing only one):
1. Cache of file attributes in the Kernel (this is controlled by "stat-cache-ttl" value - https://github.com/GoogleCloudPlatform/gcsfuse/blob/7dc5c7ff...)
If you really expect a file system experience over GCS, please try JuiceFS [1], which scales to 10 billions of files pretty well with TiKV or FoundationDB as meta engine.
PS, I'm founder of JuiceFS.
[1] https://github.com/juicedata/juicefs
I don't think it is, instead each operation makes a request. You can use something like catfs https://github.com/kahing/catfs
https://github.com/Azure/azure-storage-fuse
It has some nice features like streaming with block level caching for fast readonly access
Why not rclone? It was discussed here yesterday as a replacement for sshfs - and supports GCS as well as dozens more backends.
https://rclone.org/
mountpoint-s3 is AWS’ first party solution for mounting s3 buckets as file systems: https://github.com/awslabs/mountpoint-s3
Haven’t used it but it looks cool, if a bit immature.
Hah nice! I developed https://github.com/ahmetb/azurefs back in 2012 when I was about to join to Azure. I'm glad Azure actually provides a supported and actively-maintained tool for this.
FUSE does not work well with a large number of small files (due to high metadata ops such as inode/dentry lookups).
ExtFUSE (optimized FUSE with eBPF) [1] can offer you a high performance. It caches metadata in the kernel to avoid lookups in user space.
1. https://github.com/extfuse/extfuse
It is not how you would want do it for a typical ML workload, where the samples have to get randomly permuted each epoch.
Instead, tar up the files in some random order, and put the tar file on a web server or bucket, then stream then in during the first epoch, while keeping track of their byte offsets in the tar file, which you cache locally, assuming ample local Flash storage. Then permute the list of offsets and use those when reading samples for the next epoch.
If you only have local HDD then you will need a more advanced data structure like the one provided by https://github.com/jacobgorm/mindcastle.io , which will allow you to write out permuted samples at close to disk sequential write bandwidth. See my talk at USENIX Vault 2019 for a full explanation, linked from https://vertigo.ai/mindcastle/
In case you're interested in scale-to-zero database hosting, a few months ago I paired gcsfuse with Seafowl [0][1], an early stage open source database written in Rust. Was a lot of fun balancing tradeoffs that are usually not possible with classical databases e.g. Postgres. Thank you gcsfuse contributors.
[0] https://seafowl.io
You may wish to investigate cloudflare's image API: https://developers.cloudflare.com/images/cloudflare-images/
If the reason you were unable to use a CDN cache was because your access patterns require a lot of varying end serializations (due to things like image manipulation, resizing, cropping, watermarking, etc.), then this API could be a huge money saver for you. It was for me.
OTOH if the cost was because compute isn't free and the corresponding cloudflare worker compute cost is too much, then yeah, that's a tough one... I don't have a packaged answer for you, but I would investigate something like ThumbHash: https://evanw.github.io/thumbhash/ - my intuition is that you can probably serve some highly optimized/interlaced/"hashed" placeholder. The advantage of thumbhash here could be that you can optimize the access pattern to be less spendy by simply storing all of your hashes in an optimized way, since they will be extremely small, like small enough to be included in an index for index-only scans ("covering indexes").