Our great sponsors
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
You might want to create a kernel in Halide to check a reasonable tuned kernel: https://github.com/halide/Halide/blob/master/apps/blur/halide_blur_generator.cpp
Efficient matrix multiplications or convolutions on CPU will use layered tiling to optimize registers, L1, L2 and TLB, L3 cache (if it exist). This improve speed by over 150x vs naive triple for-loop matrix multiplication and the same thing applies to convolution. See overview https://www.cs.utexas.edu/users/flame/laff/pfhp/week3-goto.html and actual exercises https://github.com/flame/blislab
As explained in https://github.com/NervanaSystems/maxas/wiki/SGEMM you need to do the same on GPUs.
Thanks, here is the code : https://github.com/Omeganx/Image-Convolutaion-OpenCL (I removed the other code I was using to make it focused around the convolution code)