-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
I got run times from the simplest single-threaded directory walk only 1.8x slower than git ls-files. The code is in https://github.com/c-blake/cligen/blob/master/cligen/dents.n... (just `dents find` does not require the special kernel batch system call module to be fast.)
I believe that GNU find is slow because it is specifically written to allow arbitrary filesystem depth as opposed to "open file descriptor limit-limited depth".
Meanwhile, I think the Rust fd is slow because of (probably counterproductive) multi-threading (at least it does 11,000 calls to futex).
> I believe that GNU find is slow because it is specifically written to allow arbitrary filesystem depth as opposed to "open file descriptor limit-limited depth".
I haven't benchmarked find specifically, but I believe the most common Rust library for the purpose, walkdir[1], also allows arbitrary file system recursion depth, and is extremely fast. It was fairly close to some "naive" limited depth code I wrote in C for the same purpose.
I'd be curious to see benchmarks of whether this actually makes a difference.
[1] https://github.com/BurntSushi/walkdir
I'm absolutely not an expert, but I feel like log-structured filesystems (https://en.wikipedia.org/wiki/Log-structured_file_system) are a natural fit for this kind of things: an index "just" has to read the latest written entries.
But if we're talking about the future, we're probably talking about btrfs and zfs, both of which have the internal machinery to give you a feed of "recently changed files" up to the beginning of the filesystem.
While writing this answer I stumbled upon https://github.com/rflament/loggedfs which is probably a very nice solution to this problem.