Optimization Example: Mandelbrot Set (part 1)

Our great sponsors

InfluxDB - Power Real-Time Data Analytics at Scale

WorkOS - The modern identity platform for B2B SaaS

SaaSHub - Software Alternatives and Reviews

Our great sponsors

mandelset

1 5 0.0 JavaScript

A WebGL-based mandelbrot set explorer for desktop web browsers

Wow, this article opened my eyes to what is possible in optimizing code for a CPU.
I was expecting this article to rehash the usual algorithmic tricks for speeding up Mandelbrot set calculation -- filling in boxes where the four corners already have the same color, checking for repeating iteration cycles to fill in black, etc.
But I was very surprised to learn about how important CPU pipelines are, and therefore about "sheep race optimization" and interleaving, and then about the possibility of using vectorization on CPU. And in the end I'm utterly astonished that adding everything up results in the algorithm being calculated eight times faster. I would never have dreamed that such a level of speed improvement would be possible on such a simply, straightforward mathematical algorithm.
Of course, in the real world everybody today is going to calculate the Mandelbrot set using the GPU (as the author notes -- this is just an exercise). But now it has me tremendously curious as to whether there are similar optimizations that can be applied on the GPU side?
And especially whether those are accessible if done in WebGL/WebGPU? There are quite a number of Mandelbrot set generators available in the browser that use the GPU for calculations. The best I've found so far is:
http://mandelset.ru/ - https://github.com/iskandarov-egor/mandelset

FractalShark

1 6 9.4 C++

FractalShark - a fast Mandelbrot Set renderer for Nvidia GPUs

"Are there similar optimizations that can be applied on the GPU side" is a fascinating question.
There are actually two communities trying to write the fastest mandelbrot renderer with very different definitions of fastest. Weirdly, the two communities basically don't acknowledge each other- I don't mean there's animosity, I mean they seem to just not show up in each other's google searches- total lack of mention.
On one side is "What is the fastest way to render the mandelbrot set in floating point precision," such as the parent article. This is where you might see tricks like filling in boxes, directly checking for repeating cycles etc, but advanced attempts always seem to get into writing assembly and carefully managing CPU multipliers- The OP's article is a wonderful example of optimizing in this community. I don't really know of many high-man-hour attempts from this branch that use GPU rendering, since it's more about delving into the CPU's details and being faster than other codebases in the same constraints than about achieving a real world task that would otherwise be too slow. The real center of this community is https://benchmarksgame-team.pages.debian.net/benchmarksgame/..., where not only is GPU banned because it would be missing the point, but optimizations like checking whether four corners are all the same color are also verboten, as it's all about how fast the inner loop can spin.
The second community is trying to go deep, deep into the mandelbrot set- way past where it makes sense to represent the location of a pixel as a floating point number. Here speed is a practical concern. Images like "Evolution of Trees" https://www.deviantart.com/dinkydauset/art/Evolution-of-tree... used to take weeks to render, and it's been a no-holds-barred brawl to reduce that time- of course the GPU and alternative algorithms are allowed. The fastest codebases that I know of are FractalShark https://github.com/mattsaccount364/FractalShark and Kallas Fraktaler 3 https://mathr.co.uk/kf/kf.html#top and both have CUDA kernels, but I don't think that these kernels have been optimized to the point of finding and fixing sheep races.
The reason is twofold- first, the algorithmic progress has been lightning fast in recent years. The biggest breakthrough, bilinear approximation with rebasing, was discovered in 2022- and it (roughly, the actual runtime is currently unknown) takes the time to calculate an image from O(num pixels * num iterations) to O((num pixels * log (num iterations)) + num iterations). As a result, there just hasn't been time since 2022 to optimize the constant factors to the level in OP, especially now that the algorithm is so much more complicated than multiply, add, repeat. Second, with just lightly optimized CUDA for the num pixels log num iteration term, the overall runtime is currently dominated by the term with no dependency on the number of pixels. No one knows how to parallelize that term (at least not to many cores- fractal shark can use three cores I believe), so improvements in the CUDA kernel don't translate to unlocking deeper renders.
The "render deep images fast" community mostly hangs out on fractalforums.org- I highly recommend swinging by, it's a wonderful and active forum (note- heavily moderated to stay on topic- makes hackernews look like the wild west)

InfluxDB

www.influxdata.com sponsored

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
veccore

1 79 4.8 C++

C++ Library for Portable SIMD Vectorization

In VecCore (a small C++ SIMD abstraction library on top of Vc and std::simd), I created some simple examples to show how to use the library to optimize code using SIMD in a somewhat generic way. You can find it on GitHub at https://github.com/root-project/veccore
I have examples for Julia sets and the Mandelbrot set, including an implementation with AVX2 intrinsics.
These days with std::simd more widely available there's less of a reason to use VecCore, but the examples may still be educational enough.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project