SaaSHub helps you find the best software and product alternatives Learn more →
Triton Alternatives
Similar projects and alternatives to triton
-
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
-
ollama
Get up and running with Kimi-K2.6, GLM-5.1, MiniMax, DeepSeek, gpt-oss, Qwen, Gemma and other models.
-
-
-
-
-
-
-
TensorRT
NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT.
-
-
-
-
-
-
-
-
-
-
-
triton discussion
triton reviews and mentions
-
Why Your PyTorch Training Crawls on a Beefy GPU (And How to Fix It)
If you need more control, you can write fused kernels yourself in Triton. For a pointwise chain, it's usually not worth it — torch.compile will fuse those for you. For attention or other patterns with cross-element communication, hand-written kernels (or things like FlashAttention) are still where the big wins live.
- Mojo 1.0 Beta
- I fixed a segfault in Triton that broke every RTX 5070/5080/5090
- Triton Plugins
- Fp8 runs ~100 tflops faster when the kernel name has "cutlass" in it
-
Gluon: a GPU programming language based on the same compiler stack as Triton
It sounds like they share that goal. Gluon is a thing because the Triton team realized over the last few months that Blackwell is a significant departure from the Hopper, and achieving >80% SoL kernels is becoming intractable as the triton middle-end simply can't keep up.
Some more info in this issue: https://github.com/triton-lang/triton/issues/7392
-
Nvidia is dominating the S&P 500 more than any company in at least 44 years
AMD already has [Composable Kernels](https://github.com/ROCm/composable_kernel?tab=readme-ov-file...), and supports for example [Triton](https://github.com/triton-lang/triton?tab=readme-ov-file#tri...). Then there is also [HIP](https://github.com/ROCm/HIP?tab=readme-ov-file#what-is-this-...), and there are tools to automatically convert from from CUDA to HIP. But since CUDA is the de-facto standard, there is always friction to use something else (unless you need to support also AMD stack).
Making something just CUDA-compatible is non-trivial, and since Nvidia decides its direction and new features then the alternatives would always be lagging behind. Currently there are also major hardware differences between Nvidia and AMD, which may make highly optimized CUDA code inefficient or even buggy.
- FP8 is ~100 tflops faster when the kernel name has "cutlass" in it
- Fp8 runs faster when the kernel name has "cutlass" in it
-
Show HN: Efficient `Torch.cdist` Using Triton
For my research, I needed to use `torch.cdist` https://docs.pytorch.org/docs/stable/generated/torch.cdist.h... with p != 2.
Turns out torch's implementation for cdist with p != 2 is prohibitively slow. I came up with an alternative implementation using triton https://triton-lang.org/, which supports backprop and so can be used for training too.
Should be pretty plug-and-play, if you're using `torch.cdist` too it'd be awesome if you could try it with your code and share some feedback.
Big thanks to both torch's and triton's teams for making it easy to write and integrate custom ops, I'm new to this and looking back, the whole processed was streamlined very effectively.
-
A note from our sponsor - SaaSHub
www.saashub.com | 9 Jun 2026
Stats
triton-lang/triton is an open source project licensed under MIT License which is an OSI approved license.
The primary programming language of triton is MLIR.