BLAS-level CPU Performance in 100 Lines of C

InfluxDB - Power Real-Time Data Analytics at Scale

Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

www.influxdata.com

featured

SaaSHub - Software Alternatives and Reviews

SaaSHub helps you find the best software and product alternatives

www.saashub.com

featured

blocking-tutorial

5 127 0.0 C++

The core of their multiplication is in
https://github.com/HazyResearch/blocking-tutorial/blob/maste...
Where they implement the matrix inner product.
And their example implementations using that start at
https://github.com/HazyResearch/blocking-tutorial/blob/maste...
The inner multiply is all AVX2(w/FMA extension) so it's probably making some degree of LoC win from not needing to handle hardware capabilities or being portable, but still it is impressive.
It is also worth noting that the inner product is c++ templates - it could probably be turned into a horrific macro to make it pure C, but I'd say it isn't worth it. But then I prefer C++ to C :D

Halide

43 5,714 9.5 C++

a language for fast, portable data-parallel computation

The response to this comment is a bit sad.
Yes, "we just needs a sufficiently advanced compiler" is a meme, but 1) not everyone knows that and 2) that's not what this comment is asking for:
> but here one domain specific language can have a huge impact
There's a big difference between a sufficiently advanced compiler for general C code, and a compiler for a domain-specific array processing language, for example.
Have a look at halide: https://halide-lang.org , where you specify the computation to perform and a schedule (loop order, blocking, vectorisation) separately. There's even an automatic scheduler that can find a good schedule given some example array sizes and parameters.
The fftw paper is worth a read, too -- this shows that for a restricted domain it's possible to automatically find code that works as well as or better than hand-tuned algorithms. Generalising this is of course very hard.
For deep learning I think a lot of this is already done, see XLA for example.

InfluxDB

www.influxdata.com featured

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
how-to-optimize-gemm

2 1,618 0.0 C

I'm unconvinced by the need for intrinsics in many cases (other than possibly for prefetch). I took out those out of the final version of [1] and got code that actually ran faster on a contemporary processor, targeting the same old SIMD.
1. https://github.com/flame/how-to-optimize-gemm/tree/master/sr...

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Image convolution optimisation strategies.

4 projects | /r/OpenCL | 27 Jul 2021
Show HN: Flash Attention in ~100 lines of CUDA

2 projects | news.ycombinator.com | 16 Mar 2024
Halide v17.0.0

1 project | news.ycombinator.com | 1 Feb 2024
Implementing Mario's Stack Blur 15 times in C++ (with tests and benchmarks)

1 project | news.ycombinator.com | 10 Nov 2023
Deepmind Alphadev: Faster sorting algorithms discovered using deep RL

3 projects | news.ycombinator.com | 7 Jun 2023

BLAS-level CPU Performance in 100 Lines of C

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
halide gemm hexagon matrix-multiplication Compiler
Post date: 28 Feb 2022

blocking-tutorial

Halide

InfluxDB

how-to-optimize-gemm

Related posts

Image convolution optimisation strategies.

Show HN: Flash Attention in ~100 lines of CUDA

Halide v17.0.0

Implementing Mario's Stack Blur 15 times in C++ (with tests and benchmarks)

Deepmind Alphadev: Faster sorting algorithms discovered using deep RL

BLAS-level CPU Performance in 100 Lines of C

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com halide gemm hexagon matrix-multiplication Compiler Post date: 28 Feb 2022

blocking-tutorial

Halide

InfluxDB

how-to-optimize-gemm

Related posts

Image convolution optimisation strategies.

Show HN: Flash Attention in ~100 lines of CUDA

Halide v17.0.0

Implementing Mario's Stack Blur 15 times in C++ (with tests and benchmarks)

Deepmind Alphadev: Faster sorting algorithms discovered using deep RL

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
halide gemm hexagon matrix-multiplication Compiler
Post date: 28 Feb 2022