-
Mask add looks neat! I wish there was a way to directly manipulate AVX512's mask registers in .NET intrinsics but for now we have to live with "recognized idioms".
Some months ago I wrote a similar ASCII in UTF-8 upcase/downcase implementation: https://github.com/U8String/U8String/blob/main/Sources/U8Str...
(the unrolled conversion for below vectorization lengths is required as short strings dominate most codebases so handling it fast is important - the switch compiles to jump table and then branchless fall-through to return)
For now it goes as wide as 256b as it already saturates e.g. Zen 3 or 4 which have only 256x4 SIMD units (even though Zen 4 can do fancy 512b shuffles natively and has very good 512b implementation). The core compiles to:
cmp rdx, 32 -
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
-
"SIMD Within A Register"
I think the implication is that you can pack multiple items into an ordinary register and effectively get SIMD even if you aren't using explicit SIMD instructions. E.g. if you pack a 31 and 32 bit number into a 64 bit register (you need 1 spare for a carry bit), you can do 2 adds with a single 64-bit add.
Games have used these tricks for graphics to pack RGB(A) values into 32 bit integers. E.g. this code from scummvm interpolates 2 16-bit RGB pixels (6 total components) packed into a 32-bit value. https://github.com/scummvm/scummvm/blob/master/graphics/scal...
-
-
Unfortunately those SIMD optimizations are only useful for strings that are aligned on 8 bytes address.
If your SIMD algorithm is applied on a non-aligned string, it is often slower than the original algorithm.
And splitting the algorith in 3 parts (handling the beginning up to an aligned address, then the aligned part, and then the less-than-8-bytes tail) takes even more instructions.
Here is a similar case on a false claim of a faster utf8.IsValid in Go, with benchmarks: https://github.com/sugawarayuuta/charcoal/pull/1
-
There's a debate on how unsafe/unsound this technique actually is. https://github.com/ogxd/gxhash/issues/82
I definitely see the conundrum since the dangerous code is such a huge performance gain.