Our great sponsors
-
mini-redis
Incomplete Redis client and server implementation using Tokio - for learning purposes only
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
tokio
A runtime for writing reliable asynchronous applications with Rust. Provides I/O, networking, scheduling, timers, ...
-
sqlx
🧰 The Rust SQL Toolkit. An async, pure Rust SQL crate featuring compile-time checked queries without a DSL. Supports PostgreSQL, MySQL, and SQLite. (by launchbadge)
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
Considering a local k/v database, AFAIR sled is nice (https://github.com/spacejam/sled). It seems to be async-friendly as well. It is some time ago that I had a look at embedded key/value storages in rust, so I'm not sure if this is what you world use in 2021.
You may be able to take inspiration from mini-redis, which is a learning resource created by the Tokio project. Its purpose is to show off many common patterns in async Rust, and a shared hashmap is one of them.
Now instead of collecting all of our Futures into a single collection, we spawn them onto the tokio runtime. If you're using the tokio runtime with a multi-threaded feature, i.e. the rt-multi-thread feature flag, then multiple futures will be run concurrently. Finally, as futures complete, they will send their result onto a channel, which we poll and insert values into our map. From what I know this is the fastest way (at a high level) to download a few thousand CSVs.
Note: What about the approach of using a thread poll, do we even need async? While a thread poll would provide an improvement over your current approach (and is easy to scale using rayon) in this situation it would be sub-optimal if the round trip time (RTT) for each request is long. There is a portion of time where your computer is doing nothing, e.g. as the request gets routed to S3 and as S3 gathers the data to send back, your computer is sitting idle. But if you use an async approach, during this idle time, your computer can send more requests.
Note: What about a disk backed hashmap? You also mentioned maybe needing some local DB to store this data. I don't know the best approach here, but my first intuition is to try using sqlite with sqlx.