Our great sponsors
-
This thought has been churning around in my mind for some years now — we focus too much on processing speed and reductions in time complexity that and not enough on increasing the size and efficiency of our cache and stack.
MM (especially MM on large type numbers like e.g. hashing algorithms) are very reliant on the cache because you can’t always fit that big of a number into a register. Side note, I was reading some Abseil code last night that did some funky but twiddling on ARM: https://github.com/abseil/abseil-cpp/blob/master/absl/hash/i...
Off the top of my head, isn’t it about 200ms to query, bus, and read something from memory? Just a thought, perhaps the cache and memory is where we should focus our efforts.
-
However, on recent CPUs 4x4 is small for the innermost block size of the non-trivial hierarchy you need. You can see examples under https://github.com/flame/blis/tree/master/config with an a priori procedure for determining them in https://www.cs.utexas.edu/users/flame/pubs/TOMS-BLIS-Analyti... (but compare with what's actually used for SKX, in particular). OpenBLAS will normally be similar, though it may come out somewhat faster, but it's easier to see in BLIS.
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.