open-gpu-kernel-modules
Pytorch
open-gpu-kernel-modules | Pytorch | |
---|---|---|
2 | 348 | |
759 | 79,328 | |
9.6% | 1.7% | |
7.3 | 10.0 | |
7 days ago | 3 days ago | |
C | Python | |
GNU General Public License v3.0 or later | BSD 1-Clause License |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
open-gpu-kernel-modules
-
How Meta trains large language models at scale
Who is "they"?
RTX 4090s are terrible for this task. Off the top of my head:
- VRAM (obviously). Isn't that where the racks come in? Not really. Nvidia famously removed something as basic as NVLink between two cards from the 3090 to the 4090. When it comes to bandwidth between cards (crucial) even 16 lanes of PCIe 4 isn't fast enough. When you start talking about "racks" unless you're running on server grade CPUs (contributing to cost vs power vs density vs perf) you're not going to have nearly enough PCIe lanes to get very far. Even P2P over PCIe requires a hack geohot developed[0] and needless to say that's umm, less than confidence inspiring for what you would lay out ($$$) in terms of hardware, space, cooling, and power. The lack of ECC is a real issue as well.
- Form factor. Remember PCIe lanes, etc? The RTX 4090 is a ~three slot beast when using air cooling and needless to say rigging up something like the dual slot water cooled 4090s I have at scale is another challenge altogether... How are people going to wire this up? What do the enclosures/racks/etc look like? This isn't like crypto mining where cheap 1x PCIe 1x risers can be used without dramatically limiting performance.
- Performance. As grandparent comment noted 4090s are not designed for this workload. In typical usage for training I see them as 10-20% faster than an RTX 3090 at much higher cost. Compared to my H100 with SXM4 it's ridiculously slow.
- Market segmentation. Nvidia really knows what they're doing here... There are all kinds of limitations you run into with how the hardware is designed (like Tensor Core performance for inference especially).
- Issues at scale. Look at the Meta post - their biggest issues are things that are dramatically worse with consumer cards like the RTX 4090, especially when you're running with some kind of goofy PCIe cabling issue (like risers).
- Power. No matter what power limiting you employ an RTX 4090 is pretty bad for power/performance ratio. The card isn't fundamentally designed for these tasks - it's designed to run screaming for a few hours a day so gamers can push as many FPS at high res as possible. Training, inference, etc is a different beast and the performance vs power ratio for these tasks is terrible compared to A/H100. Now lets talk about the physical cabling, PSU, etc issues. Yes miners had hacks for this as well but it's yet another issue.
- Fan design. There isn't a single "blower" style RTX 4090 on the market. There was a dual-slot RTX 3090 at one point (I have a bunch of them) but Nvidia made Gigabyte pull them from the market because people were using them for this. Figuring out some kind of air-cooling setup with the fan and cooling design of the available RTX 4090 cards sounds like a complete nightmare...
- Licensing issues. Again, laying out the $$$ for this with a deployment that almost certainly violates the Nvidia EULA is a risky investment.
Three RTX 4090s (at 9 slots) to get "only" 72GB of VRAM, talking over PCIe, using 48 PCIe lanes, multi-node over sloooow ethernet (hitting CPU - slower and yet more power), using what likely ends up at ~900 watts (power limited) for significantly reduced throughput and less VRAM is ridiculous.
I'm all for creativity but deploying "racks" of 4090s for AI tasks is (frankly) flat-out stupid.
[0] - https://github.com/tinygrad/open-gpu-kernel-modules
- Tinygrad: Hacked 4090 driver to enable P2P
Pytorch
-
Top 17 Fast-Growing Github Repo of 2024
PyTorch
-
AMD's MI300X Outperforms Nvidia's H100 for LLM Inference
> their own custom stack to interact with GPUs
lol completely made up.
are you conflating CUDA the platform with the C/C++ like language that people write into files that end with .cu? because while some people are indeed not writing .cu files, absolutely no one is skipping the rest of the "stack".
source: i work at one of these "mega corps". hell if you don't believe me go look at how many CUDA kernels pytorch has https://github.com/pytorch/pytorch/tree/main/aten/src/ATen/n....
> Everybody thinks it’s CUDA that makes Nvidia the dominant player.
it 100% does
-
Awesome List
PyTorch - An open source machine learning framework. PyTorch Tutorials - Tutorials and documentation.
-
Understanding GPT: How To Implement a Simple GPT Model with PyTorch
In this guide, we provided a comprehensive, step-by-step explanation of how to implement a simple GPT (Generative Pre-trained Transformer) model using PyTorch. We walked through the process of creating a custom dataset, building the GPT model, training it, and generating text. This hands-on implementation demonstrates the fundamental concepts behind the GPT architecture and serves as a foundation for more complex applications. By following this guide, you now have a basic understanding of how to create, train, and utilize a simple GPT model. This knowledge equips you to experiment with different configurations, larger datasets, and additional techniques to enhance the model's performance and capabilities. The principles and techniques covered here will help you apply transformer models to various NLP tasks, unlocking the potential of deep learning in natural language understanding and generation. The methodologies presented align with the advancements in transformer models introduced by Vaswani et al. (2017), emphasizing the power of self-attention mechanisms in processing sequences of data more effectively than traditional approaches (Vaswani et al., 2017). This understanding opens pathways to explore and innovate in the field of natural language processing using cutting-edge deep learning techniques (Kingma & Ba, 2015).
-
Building a Simple Chatbot using GPT model - part 2
PyTorch is a powerful and flexible deep learning framework that offers a rich set of features for building and training neural networks.
-
Clusters Are Cattle Until You Deploy Ingress
Oddly enough, sometimes, the best way to learn is by putting forth incorrect opinions or questions. Recently, while wrestling with AI project complexities, I pondered aloud whether all Docker images with AI models would inevitably be bulky due to PyTorch dependencies. To my surprise, this sparked many helpful responses, offering insights into optimizing image sizes. Being willing to be wrong opens up avenues for rapid learning.
-
Tinygrad 0.9.0
Tinygrad targets consumer hardware (to be precise, only Radeon 7900XTX and nothing else[1]), while ROCm does not actually provide good support for such hardware. For example, last release of hipBLASLt-6.1.1 library has deep integration with PyTorch[1], while working only on AMD Instinct hardware. And even for the professional hardware out there, the support period is ridiculous: AMD Instinct MI100 (2020) is not supported. Only 4 years and tens of thousands of dollars worth of hardware is going to the trash, yay!
And to be more precise, they still use some core libraries from ROCm stack[3], they just don't use all these fancy multi-gigabyte[4] hardware-limited rocBLAS/hipBLASlt/rocWMMA/rocRAND/etc. libraries.
[1] https://tinygrad.org/#tinybox
[2] https://github.com/pytorch/pytorch/issues/119081
[3] https://github.com/tinygrad/tinygrad/blob/v0.9.0/tinygrad/ru...
[4] https://repo.radeon.com/rocm/yum/6.1.1/main/
- PyTorch 2.3: User-Defined Triton Kernels, Tensor Parallelism in Distributed
-
Clasificador de imágenes con una red neuronal convolucional (CNN)
PyTorch (https://pytorch.org/)
-
AI enthusiasm #9 - A multilingual chatbot📣🈸
torch is a package to manage tensors and dynamic neural networks in python (GitHub)
What are some alternatives?
Flux.jl - Relax! Flux is the ML library that doesn't make you tensor
mediapipe - Cross-platform, customizable ML solutions for live and streaming media.
Apache Spark - Apache Spark - A unified analytics engine for large-scale data processing
flax - Flax is a neural network library for JAX that is designed for flexibility.
tinygrad - You like pytorch? You like micrograd? You love tinygrad! ❤️ [Moved to: https://github.com/tinygrad/tinygrad]
Pandas - Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
Deep Java Library (DJL) - An Engine-Agnostic Deep Learning Framework in Java
tensorflow - An Open Source Machine Learning Framework for Everyone
stable-baselines3 - PyTorch version of Stable Baselines, reliable implementations of reinforcement learning algorithms.
ROCm - AMD ROCm™ Software - GitHub Home [Moved to: https://github.com/ROCm/ROCm]
tesseract-ocr - Tesseract Open Source OCR Engine (main repository)
OpenCV - Open Source Computer Vision Library