Our great sponsors
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
mu
Soul of a tiny new machine. More thorough tests → More comprehensible and rewrite-friendly software → More resilient society. (by akkartik)
Yikes.
A lot of code uses _mm_rsqrt_ps (sometimes) followed by a Newton-raphson update to compute a "precise" 1/sqrt(x). Here's a good example of NEON's rsqrt being sufficiently different from Intel, that more iterations were necessary for Embree on ARM [1].
Because I only cared about vectorization a long time ago, and AMD was so uncompetitive then, I'd bet a lot of code assumes that the SSE rsqrtps values match.
[1] https://github.com/lighttransport/embree-aarch64/issues/20
(Too late for edit)
Looks like Eigen also defaults to EIGEN_FAST_MATH which makes Eigen's psqrt ("packet sqrt") use _mm256_rsqrt_ps instead of _mm256_sqrt_ps [1].
Interestingly, the thing they're trying to avoid (long latency of sqrt vs rsqrt) hasn't been true for a long time on Intel processors, but apparently is still true for AMD parts according to Agner Fog's tables [2] (though maybe I'm reading them wrong, there is no vsqrtps entry for Zen2/3).
[1] https://gitlab.com/libeigen/eigen/-/blob/a75122584594fb98db0...
[2] https://agner.org/optimize/instruction_tables.pdf
I wrote something up when I ran into these instructions last year: https://github.com/akkartik/mu/blob/main/linux/x86_approx.md
I investigated the differences between the rsqrt and rcp instructions on Intel and AMD platforms back in 2016, and drafted a note with my findings. See the file rsqrt_rcp/docs/rsqrt_rcp.pdf in the git repository https://github.com/jeff-arnold/math_routines.
Some conclusions: