> because without FDO (or PGO) the compiler has no idea how likely each branch is to be taken
So, the maximum amount of times you can hit '\0' is once in the string, because then the function returns, but you can hit the other characters many times, which seems to be information a compiler has access to without PGO.
PGO does help, of course, and on my machine gives me 2.80s, which is better than the code at the end of the `Rearranging blocks` section :)
> I assume that their test input (which isn't described in the post, and is also not in their GitHub repo)
It's described under `Benchmarking setup`, and is in the repository here: https://github.com/414owen/blog-code/blob/master/01-six-time...
Side note: There's a part two to this post (linked at the bottom) where I make the C code as fast as I possibly can, and it beats all the assembly in this post.
I never said writing assembly is (necessarily) a good idea, I just find optimizing it, and deciphering compiler output, an interesting challenge, and a good learning opportunity.
Run compilers interactively from your web browser and interact with the assembly
I tried https://godbolt.org/, and neither Clang nor GCC trunk give the same code for the two programs.
Pretty shocking for such a simple program.
Learn any GitHub repo in 59 seconds. Onboard AI learns any GitHub repo in minutes and lets you chat with it to locate functionality, understand different parts, and generate new code. Use it for free at www.getonboard.dev.
Performance-portable, length-agnostic SIMD with runtime dispatch
You could study Google's Highway library .
John "God" Carmack: C++ with a C flavor is still the best (also: Python performance "keeps hitting me in the face")
5 projects | /r/cpp | 21 Aug 2022
Designing a SIMD Algorithm from Scratch
3 projects | news.ycombinator.com | 28 Nov 2023
Permuting Bits with GF2P8AFFINEQB
1 project | news.ycombinator.com | 27 Sep 2023
AMD EPYC 97x4 “Bergamo” CPUs: 128 Zen 4c CPU Cores for Servers, Shipping Now
1 project | news.ycombinator.com | 24 Jun 2023
10~17x faster than what? A performance analysis of Intel' x86-SIMD-sort(AVX-512)
3 projects | news.ycombinator.com | 10 Jun 2023