-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
Amdahl's Law will catch up with you really fast as you add threads with this strategy, but it's simple and is amenable to formats where you may have a delimiter in the middle of a record. For situations where you need maximum scaling and don't have the possibility of delimiters scattered into records, you can use the strategy I used to implement a faster numpy.loadtxt: https://github.com/saethlin/loadtxt/blob/master/src/inner.rs#L84 The general idea is that you divide the file among thread boundaries by splitting it on byte boundaries, then seeking from that byte offset to the end of the next record. This gets you non-interleaved sections so there's no duplicate parsing.
I don't use arrows csv parser. This is the code I am talking of https://github.com/ritchie46/polars/blob/master/polars/polars-io/src/fork/csv.rs
It looks like jemalloc will use madvise where appropriate to tell the OS it doesn't need pages resident it memory. Ctrl-f MADV_DONTNEED: https://github.com/jemalloc/jemalloc/blob/a943172b732e65da34a19469f31cd3ec70cf05b0/src/pages.c