AVX512/VBMI2: A Programmer’s Perspective

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • WorkOS - The modern identity platform for B2B SaaS
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • SaaSHub - Software Alternatives and Reviews
  • highway

    Performance-portable, length-agnostic SIMD with runtime dispatch

  • > _mm256_blend_pd is very fast instruction, a single cycle of latency. The highway’s emulation is going to be way more expensive.

    Actually it uses exactly the same instruction :) https://github.com/google/highway/blob/master/hwy/ops/x86_25...

    Thanks for sharing the list! Out of curiosity, can you expand on what _mm256_permute4x64_pd is used for?

    > _mm_addsub_ps (that one is vertical but still missing from highway)

    Indeed, we have not needed complex-valued functions yet. AFAIK NEON does not have such an instruction but it could be emulated with an extra XOR+constant. Will add to wishlist, we implement them whenever there's a use case.

    > I have tons of #ifdef there

    Seems reasonable if you only target v7 and ASIMD; but #ifdef is increasingly infeasible now that SVE and Risc-V V are coming, no?

    > to support NEON which differs substantially, there’s stuff like vrev64q_f32 and vextq_f32

    Those can actually be bridged just fine, they correspond to Shuffle2301 and CombineShiftRightLanes.

    > Even if you expose all the missing horizontal stuff in highway — won’t be much better than intrinsics. Such code ain’t gonna use AVX512 when available.

    Because of the hardcoded vector size? I agree that's best avoided, in your case perhaps by batching the matrix mul and applying SIMD over that dimension instead?

  • Vrmac

    Vrmac Graphics, a cross-platform graphics library for .NET. Supports 3D, 2D, and accelerated video playback. Works on Windows 10 and Raspberry Pi4.

  • > have you needed any non-vertical ops that are not in that list?

    Yes indeed. I rarely using SIMD for vertical-only ops, for such use cases GPUs are very often better than CPUs.

    I’ve already wrote an example in my previous comment. It’s possible to emulate with the stuff you have, however _mm256_blend_pd is very fast instruction, a single cycle of latency. The highway’s emulation is going to be way more expensive. You probably compiling your UpperHalf() into _mm256_extractf128_pd and Combine() into _mm256_insertf128_pd, that’s 2 instructions and (on Skylake) 6 cycles of latency instead of 1 cycle.

    6 cycles instead of 1 cycle is a large overhead in that context. That particular small matrix multiplication is called rather often. I only optimizing code when the profiler tells me so. For the majority of CPU bound code in that project, Eigen’s implementation is actually good enough.

    I’ve searched the source code of that project (CAM/CAE software). Here’s the list of the shuffle intrinsics I use, some of them a lot: _mm256_blend_pd, _mm_blend_ps, _mm_blend_epi32, _mm256_permute2f128_pd, _mm256_permute_ps, _mm256_permute4x64_pd, _mm256_permutevar8x32_ps, _mm256_permutevar8x32_epi32, _mm_permute_ps, _mm_permute_pd, _mm_insert_ps, _mm_movehdup_ps, _mm_moveldup_ps, _mm_loaddup_pd, _mm_extract_ps, _mm_dp_ps, _mm_extract_epi32, _mm_extract_epi64, _mm_shuffle_epi32.

    A similar list for this project https://github.com/Const-me/Vrmac (a GPU-centric library for 3D and 2D graphics, not using any AVX): _mm_shuffle_epi8, _mm_alignr_epi8, _mm_shuffle_epi32, _mm_shuffle_ps, _mm_addsub_ps (that one is vertical but still missing from highway), _mm_insert_epi32, _mm_insert_ps, _mm_extract_ps, _mm_extract_epi16, _mm_movehdup_ps, _mm_dp_ps. BTW the project is portable between AMD64 and ARMv7, I have tons of #ifdef there to support NEON which differs substantially, there’s stuff like vrev64q_f32 and vextq_f32, 64-bit SIMD vectors, and quite a few other instructions missing on AMD64.

    Even if you expose all the missing horizontal stuff in highway — won’t be much better than intrinsics. Such code ain’t gonna use AVX512 when available. Only going to inflate the software complexity for no good reason, by adding an unneeded layer of abstraction between the application’s code and the actual hardware.

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
  • DtsDecoder

  • > we have not needed complex-valued functions yet

    Wasn’t related to complex numbers, I was doing something like that, saving a few instructions: https://en.wikipedia.org/wiki/Distance_from_a_point_to_a_lin...

    > but #ifdef is increasingly infeasible now that SVE and Risc-V V are coming, no?

    I think that’s wishful thinking like all these years of Linux on desktop, or rewriting everything in Rust. I think ARM is good enough for most applications, except very small niches (very low power, very price-sensitive, mostly embedded).

    > Those can actually be bridged just fine

    Yeah, but other things can’t. For one, ARM has 8-byte SIMD vectors. Quite useful for a library which implements non-trivial algorithms processing 2D vector data with tons of FP32 2D vectors. Another thing, this whole source file https://github.com/Const-me/DtsDecoder/blob/master/Utils/sto... implementing an equivalent of a single vst3q_s16 NEON instruction. That code is called in a loop which does not do much else and is inlined, i.e. these 9 shuffle constants stay in vector registers.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts