triton
flexible-vectors
Our great sponsors
triton | flexible-vectors | |
---|---|---|
30 | 4 | |
10,981 | 43 | |
7.9% | - | |
9.9 | 2.8 | |
3 days ago | 20 days ago | |
C++ | WebAssembly | |
MIT License | GNU General Public License v3.0 or later |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
triton
- OpenAI Triton: language and compiler for highly efficient Deep-Learning
-
Show HN: Ollama for Linux – Run LLMs on Linux with GPU Acceleration
There's a ton of cool opportunity in the runtime layer. I've been keeping my eye on the compiler-based approaches. From what I've gathered many of the larger "production" inference tools use compilers:
- https://github.com/openai/triton
- Core Functionality for AMD #1983
- Project name easily confused with Nvidia triton
-
Nvidia's CUDA Monopoly
Does anyone have more inside knowledge from OpenAI or AMD on AMDGPU support for Triton?
I see this:
https://github.com/openai/triton/issues/1073
But it's not clear to me if we will see AMD GPUs as first class citizens for pytorch in the future?
- @soumithchintala (Cofounded and lead @PyTorch at Meta) on Twitter: I'm fairly puzzled by $NVDA skyrocketing... (cont.)
-
The tiny corp raised $5.1M
I thought this was a good overview of the idea Triton can circumvent the CUDA moat: https://www.semianalysis.com/p/nvidiaopenaitritonpytorch
It also looks like they added MLIR backend to Triton though I wonder if Mojo has advantages since it was built on MLIR? https://github.com/openai/triton/pull/1004
-
Anyone hosting a local LLM server
I'm pretty happy with the setup, because it allows me to keep all the AI stuff and its dozens of conda envs and repos etc. seperate from my normal setup and "portable". It may have some performance impact (although I don't personally notice any significant difference to running it "natively" on windows), and it may enable some extra functionality, such as access to OpenAi's Triton etc., but that's currently neither here nor there.
- Triton: Runtime for highly efficient custom Deep-Learning primitives
-
Mojo – a new programming language for all AI developers
Very cool development. There is too much busy work going from development to test to production. This will help to unify everything. OpenAI Triton https://github.com/openai/triton/ is going for a similar goal. But this is a more fundamental approach.
flexible-vectors
-
Mojo – a new programming language for all AI developers
Wonderful language. Only complaint (so far) : SIMD should be named Vector and dispatched to whatever SIMD/vector pipeline the host offers, similar to Flexible Vectors proposal in WASM: https://github.com/WebAssembly/flexible-vectors/blob/main/pr...
-
AVX 512 will be the future
Abstract vectorization instructions in wasm will make life a lot easier
https://github.com/WebAssembly/flexible-vectors/blob/main/pr... great proposal!
Mapping to whatever hardware is available as some sort of micro library
-
Take More Screenshots
I think SIMD was a distraction to our conversation, most code doesn't use it and in the future the length agnostic, flexible vectors; https://github.com/WebAssembly/flexible-vectors/blob/master/... are a better solution. They are a lot like RVV; https://github.com/riscv/riscv-v-spec, research around vector processing is why RISC-V exists in the first place!
I was trying to find the smallest Rust Wasm interpreters I could find, I should have read the source first, I only really use wasmtime, but this one looks very interesting, zero deps, zero unsafe.
16.5kloc of Rust https://github.com/rhysd/wain
The most complete wasm env for small devices is wasm3
20kloc of C https://github.com/wasm3/wasm3
I get what you are saying as to be so small that there isn't a place of bugs to hide.
> “There are two ways of constructing a software design: One way is to make it so simple that there are obviously no deficiencies, and the other way is to make it so complicated that there are no obvious deficiencies. The first method is far more difficult.” CAR Hoare
Even a 100 line program can't be guaranteed to be free of bugs. These programs need embedded tests to ensure that the layer below them is functioning as intended. They cannot and should not run open loop. Speaking of 300+ reimplementations, I am sure that RISC-V has already exceeded that. The smallest readable implementation is like 200 lines of code; https://github.com/BrunoLevy/learn-fpga/blob/master/FemtoRV/...
I don't think Wasm suffers from the base extension issue you bring up. It will get larger, but 1.0 has the right algebraic properties to be useful forever. Wasm does require an environment, for archival purposes that environment should be written in Wasm, with api for instantiating more envs passed into the first env. There are two solutions to the Wasm generating and calling Wasm problem. First would be a trampoline, where one returns Wasm from the first Wasm program which is then re-instantiated by the outer env. The other would be to pass in the api to create new Wasm envs over existing memory buffers.
See, https://copy.sh/v86/
MS-DOS, NES or C64 are useful for archival purposes because they are dead, frozen in time along with a large corpus of software. But there is a ton of complexity in implementing those systems with enough fidelity to run software.
Lua, Typed Assembly; https://en.wikipedia.org/wiki/Typed_assembly_language and Sector Lisp; https://github.com/jart/sectorlisp seem to have the right minimalism and compactness for archival purposes. Maybe it is sectorlisp+rv32+wasm.
If there are directions you would like Wasm to go, I really recommend attending the Wasm CG meetings.
https://github.com/WebAssembly/meetings
When it comes to an archival system, I'd like it to be able to run anything from an era, not just specially crafted binaries. I think Wasm meets that goal.
https://gist.github.com/dabeaz/7d8838b54dba5006c58a40fc28da9...
-
Exploring SIMD performance improvements in WebAssembly
Thanks! Good points, I think in general the fixed-width "packed" SIMD ISAs have the downsides that you mentioned.
But it seems that WebAssembly doesn't have length-agnostic SIMD instructions yet. There is an open proposal to add this though: https://github.com/WebAssembly/flexible-vectors
What are some alternatives?
cuda-python - CUDA Python Low-level Bindings
wain - WebAssembly implementation from scratch in Safe Rust with zero dependencies
Halide - a language for fast, portable data-parallel computation
rust-wasm - A simple and spec-compliant WebAssembly interpreter
GPU-Puzzles - Solve puzzles. Learn CUDA.
wai - a wasm interpreter written by rust
dfdx - Deep learning in Rust, with shape checked tensors and neural networks
tropy - Research photo management
web-llm - Bringing large-language models and chat to web browsers. Everything runs inside the browser with no server support.
WasmCert-Isabelle - A mechanisation of Wasm in Isabelle.
cutlass - CUDA Templates for Linear Algebra Subroutines
simd-wasm-profiling - Exploring SIMD performance improvements in WebAssembly