Optimize sgemm on RISC-V platform

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • WorkOS - The modern identity platform for B2B SaaS
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • SaaSHub - Software Alternatives and Reviews
  • sgemm_riscv

    This project records the process of optimizing SGEMM (single-precision floating point General Matrix Multiplication) on the riscv platform.

  • this sgemm progress is pretty impressive. i assume 'version 1' is one of the lower lines on the first graph? he has a graph lower down that shows it getting under 300 megaflops, closer to 50 megaflops for most problem sizes. getting up to 2 gigaflops is seriously impressive; that's 50% of the maximum theoretical possible throughput, which is really challenging on things that aren't vector supercomputers

    probably worth archiving this link: https://github.com/Zhao-Dongyu/sgemm_riscv

    i was astounded yesterday to find that gcc 12.2.0 fails at some basic optimizations on risc-v. (all of the following is with -Os)

    this trivial subroutine

        int sumarray(int *a, int n)

  • julia

    The Julia Programming Language

  • I don't believe there is any official documentation on this, but https://github.com/JuliaLang/julia/pull/49430 for example added prefetching to the marking phase of a GC which saw speedups on x86, but not on M1.

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
  • neural-engine

    Everything we actually know about the Apple Neural Engine (ANE)

  • yep. they have a neural engine that is separate from the CPU and GPU that does really fast matmuls https://github.com/hollance/neural-engine. it's basically completely undocumented.

  • amx

    Apple AMX Instruction Set

  • I am talking about the matrix/vector coprocessor (AMX). You can find some reverse-engineered documentation here: https://github.com/corsix/amx

    On M3 a singe matrix block can achieve ~ 1TFLOP on DGEMM, I assume it will be closer to 4TFLOPS for SGEMM. The Max variants have two such blocks. Didn't do precise benchmarking myself, but switching Python/R matrix libraries to use Apple's BLAS result in 5-6x perf improvement on matrix heavy code for me.

  • blis

    BLAS-like Library Instantiation Software Framework

  • There is a recent update to the blis alternative to BLAS that includes a number of RISC-V performance optimizations.

    https://github.com/flame/blis/pull/737

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts