-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
The core of their multiplication is in
https://github.com/HazyResearch/blocking-tutorial/blob/maste...
Where they implement the matrix inner product.
And their example implementations using that start at
https://github.com/HazyResearch/blocking-tutorial/blob/maste...
The inner multiply is all AVX2(w/FMA extension) so it's probably making some degree of LoC win from not needing to handle hardware capabilities or being portable, but still it is impressive.
It is also worth noting that the inner product is c++ templates - it could probably be turned into a horrific macro to make it pure C, but I'd say it isn't worth it. But then I prefer C++ to C :D
The response to this comment is a bit sad.
Yes, "we just needs a sufficiently advanced compiler" is a meme, but 1) not everyone knows that and 2) that's not what this comment is asking for:
> but here one domain specific language can have a huge impact
There's a big difference between a sufficiently advanced compiler for general C code, and a compiler for a domain-specific array processing language, for example.
Have a look at halide: https://halide-lang.org , where you specify the computation to perform and a schedule (loop order, blocking, vectorisation) separately. There's even an automatic scheduler that can find a good schedule given some example array sizes and parameters.
The fftw paper is worth a read, too -- this shows that for a restricted domain it's possible to automatically find code that works as well as or better than hand-tuned algorithms. Generalising this is of course very hard.
For deep learning I think a lot of this is already done, see XLA for example.
I'm unconvinced by the need for intrinsics in many cases (other than possibly for prefetch). I took out those out of the final version of [1] and got code that actually ran faster on a contemporary processor, targeting the same old SIMD.
1. https://github.com/flame/how-to-optimize-gemm/tree/master/sr...