dfdx
triton
dfdx | triton | |
---|---|---|
22 | 42 | |
1,826 | 16,109 | |
0.6% | 2.4% | |
5.8 | 9.9 | |
12 months ago | 7 days ago | |
Rust | MLIR | |
GNU General Public License v3.0 or later | MIT License |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
dfdx
- Shape Typing in Python
-
Candle: Torch Replacement in Rust
I keep checking the progress on dfdx for this reason. It does what I (and, I assume from context, you) want: Provides static checking of tensor shapes. Which is fantastic. Not quite as much inference as I'd like but I love getting compile-time errors that I forgot to transpose before a matmul.
It depends on the generic_const_exprs feature which is still, to quote, "highly experimental":
https://github.com/rust-lang/rust/issues/76560
Definitely not for production use, but it gives a flavor for where things can head in the medium term, and it's .. it's nice. You could imagine future type support allowing even more inference for some intermediate shapes, of course, but even what it has now is really nice. Like this cute little convnet example:
https://github.com/coreylowman/dfdx/blob/main/examples/night...
- Dfdx: Shape Checked Deep Learning in Rust
- Are there some machine or deep learning crates on Rust?
-
[Discussion] What crates would you like to see?
And for transformers, it's really early days for dfdx, but it's a library that aims to sit basically at the Pytorch level of abstraction, that the difference is it's not just coded in Rust, but it follows the Rust-y/functional-y philosophy of "if it compiles it runs".
-
rapl: Rank Polymorphic array library for Rust.
Wow that is super interesting. I actually tried to use GATs at first to be generic over shapes, but I couldn't do it, I'm sure it would be possible in the future though. There is this library dfdx that does something similar to what you mentioned, but it feels a little clumsy to me.
-
Announcing cudarc and fully GPU accelerated dfdx: ergonomic deep learning ENTIRELY in rust, now with CUDA support and tensors with mixed compile and runtime dimensions!
Awesome, I added an issue here https://github.com/coreylowman/dfdx/issues/597. We can discuss more there! The first step will just be adding the device and implementing tensor creation methods for it.
-
In which circumstances is C++ better than Rust?
The next release of dfdx includes a CUDA device and implements many ops. The same dev created a new crate, cudarc, for a wrapper around CUDA toolkit.
- This year I tried solving AoC using Rust, here are my impressions coming from Python!
-
Deep Learning in Rust: Burn 0.4.0 released and plans for 2023
A question I have is: what are the philosophical/design differences with dfdx? As someone who's played around with dfdx and only skimmed the README of burn, it seems like dfdx leans into Rust's type system/type inference for compile time checking of as much as is possible to check at compile time. I wonder if you've gotten a chance to look at dfdx and would like to outline what you think the differences are. Thanks!
triton
- FP8 is ~100 tflops faster when the kernel name has "cutlass" in it
- Fp8 runs faster when the kernel name has "cutlass" in it
-
Show HN: Efficient `Torch.cdist` Using Triton
For my research, I needed to use `torch.cdist` https://docs.pytorch.org/docs/stable/generated/torch.cdist.h... with p != 2.
Turns out torch's implementation for cdist with p != 2 is prohibitively slow. I came up with an alternative implementation using triton https://triton-lang.org/, which supports backprop and so can be used for training too.
Should be pretty plug-and-play, if you're using `torch.cdist` too it'd be awesome if you could try it with your code and share some feedback.
Big thanks to both torch's and triton's teams for making it easy to write and integrate custom ops, I'm new to this and looking back, the whole processed was streamlined very effectively.
-
A Beautiful Technique for Some XOR Related Problems
no but it will help you understand ML compilers :)
https://github.com/triton-lang/triton/blob/main/include/trit...
- Triton Support for Blackwell
- Automatic Warp Specialization in Triton
-
Making AMD GPUs competitive for LLM inference
I want to double-down on this statement, and call attention to the competitive nature of it. Specifically, I have recently tried to set up Triton on arm hardware. One might presume Nvidia would give attention to an architecture they develop, but the way forward is not easy. For some version of Ubuntu, you might have the correct version of python ( usually older than packaged ) but current LTS is out of luck for guidance or packages.
https://github.com/triton-lang/triton/issues/4978
-
How to run llama 405b bf16 with gh200s
k=1 ssh_k # ssh into first machine pip install -U pip setuptools wheel ninja cmake setuptools_scm git config --global feature.manyFiles true # faster clones git clone https://github.com/triton-lang/triton.git ~/shared/triton cd ~/shared/triton/python git checkout 755d4164 # <-- optional, tested versions # Note that ninja already parallelizes everything to the extent possible, # so no sense trying to change the cmake flags or anything. python setup.py bdist_wheel pip install --no-deps dist/*.whl # good idea to download this too for later python -c 'import triton; print("triton ok")'
-
Triton Fork for Windows Support
Things might have changed since then, but I personally contributed a few Windows support-related changes back in 2021 as an independent contributor: https://github.com/triton-lang/triton/pulls?q=is%3Apr+is%3Ac...
- An Interview with AMD CEO Lisa Su About Solving Hard Problems
What are some alternatives?
burn - Burn is a next generation Deep Learning Framework that doesn't compromise on flexibility, efficiency and portability.
cutlass - CUDA Templates for Linear Algebra Subroutines
executorch - On-device AI across mobile, embedded and edge for PyTorch
cuda-python - CUDA Python: Performance meets Productivity
textsynth - A (unofficial) Rust wrapper for the TextSynth API.
Halide - a language for fast, portable data-parallel computation