Gcsfuse: A user-space file system for interacting with Google Cloud Storage

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • WorkOS - The modern identity platform for B2B SaaS
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • SaaSHub - Software Alternatives and Reviews
  • gcsfuse

    A user-space file system for interacting with Google Cloud Storage

  • It uses FUSE and there's three types of Kernel cache you could use with FUSE (although, it seems like gcsfuse is exposing only one):

    1. Cache of file attributes in the Kernel (this is controlled by "stat-cache-ttl" value - https://github.com/GoogleCloudPlatform/gcsfuse/blob/7dc5c7ff...)

  • juicefs

    JuiceFS is a distributed POSIX file system built on top of Redis and S3.

  • If you really expect a file system experience over GCS, please try JuiceFS [1], which scales to 10 billions of files pretty well with TiKV or FoundationDB as meta engine.

    PS, I'm founder of JuiceFS.

    [1] https://github.com/juicedata/juicefs

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
  • catfs

    Cache AnyThing filesystem written in Rust

  • I don't think it is, instead each operation makes a request. You can use something like catfs https://github.com/kahing/catfs

  • azure-storage-fuse-aur

    AUR package for Azure Storage Blobfuse

  • https://github.com/Azure/azure-storage-fuse

    It has some nice features like streaming with block level caching for fast readonly access

  • rclone

    "rsync for cloud storage" - Google Drive, S3, Dropbox, Backblaze B2, One Drive, Swift, Hubic, Wasabi, Google Cloud Storage, Yandex Files

  • Why not rclone? It was discussed here yesterday as a replacement for sshfs - and supports GCS as well as dozens more backends.

    https://rclone.org/

  • mountpoint-s3

    A simple, high-throughput file client for mounting an Amazon S3 bucket as a local file system.

  • mountpoint-s3 is AWS’ first party solution for mounting s3 buckets as file systems: https://github.com/awslabs/mountpoint-s3

    Haven’t used it but it looks cool, if a bit immature.

  • s3fs

    S3 Filesystem

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • s3fs-fuse

    FUSE-based file system backed by Amazon S3

  • azurefs

    Mount Microsoft Azure Blob Storage as local filesystem in Linux (inactive)

  • Hah nice! I developed https://github.com/ahmetb/azurefs back in 2012 when I was about to join to Azure. I'm glad Azure actually provides a supported and actively-maintained tool for this.

  • extfuse

    Extension Framework for FUSE

  • FUSE does not work well with a large number of small files (due to high metadata ops such as inode/dentry lookups).

    ExtFUSE (optimized FUSE with eBPF) [1] can offer you a high performance. It caches metadata in the kernel to avoid lookups in user space.

    1. https://github.com/extfuse/extfuse

  • mindcastle.io

    Massively scalable, cloud-backed distributed block device for Linux and VMs

  • It is not how you would want do it for a typical ML workload, where the samples have to get randomly permuted each epoch.

    Instead, tar up the files in some random order, and put the tar file on a web server or bucket, then stream then in during the first epoch, while keeping track of their byte offsets in the tar file, which you cache locally, assuming ample local Flash storage. Then permute the list of offsets and use those when reading samples for the next epoch.

    If you only have local HDD then you will need a more advanced data structure like the one provided by https://github.com/jacobgorm/mindcastle.io , which will allow you to write out permuted samples at close to disk sequential write bandwidth. See my talk at USENIX Vault 2019 for a full explanation, linked from https://vertigo.ai/mindcastle/

  • seafowl

    Analytical database for data-driven Web applications 🪶

  • In case you're interested in scale-to-zero database hosting, a few months ago I paired gcsfuse with Seafowl [0][1], an early stage open source database written in Rust. Was a lot of fun balancing tradeoffs that are usually not possible with classical databases e.g. Postgres. Thank you gcsfuse contributors.

    [0] https://seafowl.io

  • thumbhash

    A very compact representation of an image placeholder

  • You may wish to investigate cloudflare's image API: https://developers.cloudflare.com/images/cloudflare-images/

    If the reason you were unable to use a CDN cache was because your access patterns require a lot of varying end serializations (due to things like image manipulation, resizing, cropping, watermarking, etc.), then this API could be a huge money saver for you. It was for me.

    OTOH if the cost was because compute isn't free and the corresponding cloudflare worker compute cost is too much, then yeah, that's a tough one... I don't have a packaged answer for you, but I would investigate something like ThumbHash: https://evanw.github.io/thumbhash/ - my intuition is that you can probably serve some highly optimized/interlaced/"hashed" placeholder. The advantage of thumbhash here could be that you can optimize the access pattern to be less spendy by simply storing all of your hashes in an optimized way, since they will be extremely small, like small enough to be included in an index for index-only scans ("covering indexes").

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts