NCCL VS ArrayFire

Compare NCCL vs ArrayFire and see what are their differences.

NCCL

Optimized primitives for collective multi-GPU communication (by NVIDIA)
Our great sponsors
  • WorkOS - The modern identity platform for B2B SaaS
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • SaaSHub - Software Alternatives and Reviews
NCCL ArrayFire
3 6
2,808 4,404
4.0% 1.2%
5.8 7.8
2 days ago 23 days ago
C++ C++
GNU General Public License v3.0 or later BSD 3-clause "New" or "Revised" License
The number of mentions indicates the total number of mentions that we've tracked plus the number of user suggested alternatives.
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.

NCCL

Posts with mentions or reviews of NCCL. We have used some of these posts to build our list of alternatives and similar projects. The last one was on 2023-06-06.
  • MPI jobs to test
    2 projects | /r/HPC | 6 Jun 2023
    % rm -rf /tmp/nccl ; git clone --recursive https://github.com/NVIDIA/nccl.git ; cd nccl ; git grep MPI Cloning into 'nccl'... remote: Enumerating objects: 2769, done. remote: Counting objects: 100% (336/336), done. remote: Compressing objects: 100% (140/140), done. remote: Total 2769 (delta 201), reused 287 (delta 196), pack-reused 2433 Receiving objects: 100% (2769/2769), 3.04 MiB | 3.37 MiB/s, done. Resolving deltas: 100% (1820/1820), done. README.md:NCCL (pronounced "Nickel") is a stand-alone library of standard communication routines for GPUs, implementing all-reduce, all-gather, reduce, broadcast, reduce-scatter, as well as any send/receive based communication pattern. It has been optimized to achieve high bandwidth on platforms using PCIe, NVLink, NVswitch, as well as networking using InfiniBand Verbs or TCP/IP sockets. NCCL supports an arbitrary number of GPUs installed in a single node or across multiple nodes, and can be used in either single- or multi-process (e.g., MPI) applications. src/collectives/broadcast.cc:/* Deprecated original "in place" function, similar to MPI */
  • NVLink and Dual 3090s
    1 project | /r/nvidia | 4 May 2022
    If it's rendering, you don't really need SLI, you need to install NCCL so that GPUs memory can be pooled: https://github.com/NVIDIA/nccl
  • Distributed Training Made Easy with PyTorch-Ignite
    7 projects | dev.to | 10 Aug 2021
    backends from native torch distributed configuration: nccl, gloo, mpi.

ArrayFire

Posts with mentions or reviews of ArrayFire. We have used some of these posts to build our list of alternatives and similar projects. The last one was on 2023-04-27.
  • Learn WebGPU
    9 projects | news.ycombinator.com | 27 Apr 2023
    Loads of people have stated why easy GPU interfaces are difficult to create, but we solve many difficult things all the time.

    Ultimately I think CPUs are just satisfactory for the vast vast majority of workloads. Servers rarely come with any GPUs to speak of. The ecosystem around GPUs is unattractive. CPUs have SIMD instructions that can help. There are so many reasons not to use GPUs. By the time anyone seriously considers using GPUs they're, in my imagination, typically seriously starved for performance, and looking to control as much of the execution details as possible. GPU programmers don't want an automagic solution.

    So I think the demand for easy GPU interfaces is just very weak, and therefore no effort has taken off. The amount of work needed to make it as easy to use as CPUs is massive, and the only reason anyone would even attempt to take this on is to lock you in to expensive hardware (see CUDA).

    For a practical suggestion, have you taken a look at https://arrayfire.com/ ? It can run on both CUDA and OpenCL, and it has C++, Rust and Python bindings.

  • seeking C++ library for neural net inference, with cross platform GPU support
    1 project | /r/Cplusplus | 12 Sep 2022
    What about Arrayfire. https://github.com/arrayfire/arrayfire
  • [D] Deep Learning Framework for C++.
    7 projects | /r/MachineLearning | 12 Jun 2022
    Low-overhead — not our goal, but Flashlight is on par with or outperforming most other ML/DL frameworks with its ArrayFire reference tensor implementation, especially on nonstandard setups where framework overhead matters
  • [D] Neural Networks using a generic GPU framework
    2 projects | /r/MachineLearning | 4 Jan 2022
    Looking for frameworks with Julia + OpenCL I found array fire. It seems quite good, bonus points for rust bindings. I will keep looking for more, Julia completely fell off my radar.
  • Windows 11 va bloquer les bidouilles qui facilitent l'emploi d'un navigateur alternatif à Edge
    1 project | /r/france | 25 Nov 2021
  • Arrayfire progressive performance decline?
    1 project | /r/rust | 9 Jun 2021
    Your Problem may be the lazy evaluation, see this issue: https://github.com/arrayfire/arrayfire/issues/1709

What are some alternatives?

When comparing NCCL and ArrayFire you can also consider the following projects:

gloo - Collective communications library with various primitives for multi-machine training.

Thrust - [ARCHIVED] The C++ parallel algorithms library. See https://github.com/NVIDIA/cccl

C++ Actor Framework - An Open Source Implementation of the Actor Model in C++

Boost.Compute - A C++ GPU Computing Library for OpenCL

VexCL - VexCL is a C++ vector expression template library for OpenCL/CUDA/OpenMP

HPX - The C++ Standard Library for Parallelism and Concurrency

Pytorch - Tensors and Dynamic neural networks in Python with strong GPU acceleration

xla - Enabling PyTorch on XLA Devices (e.g. Google TPU)

CUB - THIS REPOSITORY HAS MOVED TO github.com/nvidia/cub, WHICH IS AUTOMATICALLY MIRRORED HERE.

Easy Creation of GnuPlot Scripts from C++ - A simple C++17 lib that helps you to quickly plot your data with GnuPlot

Taskflow - A General-purpose Parallel and Heterogeneous Task Programming System