monolish
ParallelReductionsBenchmark
Our great sponsors
monolish | ParallelReductionsBenchmark | |
---|---|---|
1 | 2 | |
189 | 59 | |
0.0% | - | |
7.8 | 4.6 | |
6 months ago | 5 months ago | |
C++ | C++ | |
Apache License 2.0 | - |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
monolish
ParallelReductionsBenchmark
-
Failing to Reach 204 GB/S DDR4 Bandwidth
For the single threaded version, they have a data hazard on the sums that could be smoothed out with a little loop unrolling and separate variables.
But in the [threaded version](https://github.com/unum-cloud/ParallelReductions/blob/fd16d9...) they have separate slots for an accumulator but it's still in a shared vector, which most likely has the issue I described.
What are some alternatives?
oneMKL - oneAPI Math Kernel Library (oneMKL) Interfaces
MatX - An efficient C++17 GPU numerical computing library with Python-like syntax
fms_blas - Lightweight BLAS (and some LAPACK) wrapper.
ispc - IntelĀ® Implicit SPMD Program Compiler
BitLib - Provides a bit-vector, an optimized replacement of the infamous std::vector<:b:ool>. In addition to the bit-vector, the library also provides implementations of STL algorithms tailored for bit-vectors.
gpuowl - GPU Mersenne primality test.
stdgpu - stdgpu: Efficient STL-like Data Structures on the GPU
alpaka - Abstraction Library for Parallel Kernel Acceleration :llama:
cuda_memtest - Fork of CUDA GPU memtest :eyeglasses:
Scalix - Scalix is a data parallel compute library that automatically scales to the available compute resources.
eaminer - Heterogeneous Ethereum Miner with support for AMD, Intel and Nvidia GPUs using SYCL, OpenCL and CUDA backends