BLAS-level CPU Performance in 100 Lines of C

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

InfluxDB - Power Real-Time Data Analytics at Scale
Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
www.influxdata.com
featured
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com
featured
  • blocking-tutorial

  • The core of their multiplication is in

    https://github.com/HazyResearch/blocking-tutorial/blob/maste...

    Where they implement the matrix inner product.

    And their example implementations using that start at

    https://github.com/HazyResearch/blocking-tutorial/blob/maste...

    The inner multiply is all AVX2(w/FMA extension) so it's probably making some degree of LoC win from not needing to handle hardware capabilities or being portable, but still it is impressive.

    It is also worth noting that the inner product is c++ templates - it could probably be turned into a horrific macro to make it pure C, but I'd say it isn't worth it. But then I prefer C++ to C :D

  • Halide

    a language for fast, portable data-parallel computation

  • The response to this comment is a bit sad.

    Yes, "we just needs a sufficiently advanced compiler" is a meme, but 1) not everyone knows that and 2) that's not what this comment is asking for:

    > but here one domain specific language can have a huge impact

    There's a big difference between a sufficiently advanced compiler for general C code, and a compiler for a domain-specific array processing language, for example.

    Have a look at halide: https://halide-lang.org , where you specify the computation to perform and a schedule (loop order, blocking, vectorisation) separately. There's even an automatic scheduler that can find a good schedule given some example array sizes and parameters.

    The fftw paper is worth a read, too -- this shows that for a restricted domain it's possible to automatically find code that works as well as or better than hand-tuned algorithms. Generalising this is of course very hard.

    For deep learning I think a lot of this is already done, see XLA for example.

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • how-to-optimize-gemm

  • I'm unconvinced by the need for intrinsics in many cases (other than possibly for prefetch). I took out those out of the final version of [1] and got code that actually ran faster on a contemporary processor, targeting the same old SIMD.

    1. https://github.com/flame/how-to-optimize-gemm/tree/master/sr...

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts

  • Image convolution optimisation strategies.

    4 projects | /r/OpenCL | 27 Jul 2021
  • Show HN: Flash Attention in ~100 lines of CUDA

    2 projects | news.ycombinator.com | 16 Mar 2024
  • Halide v17.0.0

    1 project | news.ycombinator.com | 1 Feb 2024
  • Implementing Mario's Stack Blur 15 times in C++ (with tests and benchmarks)

    1 project | news.ycombinator.com | 10 Nov 2023
  • Deepmind Alphadev: Faster sorting algorithms discovered using deep RL

    3 projects | news.ycombinator.com | 7 Jun 2023