Optimized primitives for collective multi-GPU communication
Why do you think that https://github.com/pytorch/xla is a good alternative to NCCL
Optimized primitives for collective multi-GPU communication
Why do you think that https://github.com/pytorch/xla is a good alternative to NCCL