Render neural network into CUDA/HIP code

Our great sponsors

InfluxDB - Power Real-Time Data Analytics at Scale

WorkOS - The modern identity platform for B2B SaaS

SaaSHub - Software Alternatives and Reviews

Our great sponsors

AITemplate

37 4,455 8.7 Python

AITemplate is a Python framework which renders neural network into high performance CUDA/HIP C++ code. Specialized for FP16 TensorCore (NVIDIA GPU) and MatrixCore (AMD GPU) inference.
tinygrad

58 17,800 9.7 Python

Discontinued You like pytorch? You like micrograd? You love tinygrad! ❤️ [Moved to: https://github.com/tinygrad/tinygrad] (by geohot)

at first glance i thought may its like tinygrad. but looks has many ops than that tiny grad but most maps to underlying hardware provided ops?
i wonder how well tinygrad's apporach will work out, ops fusion sounds easy, just a walk a graph, pattern match it and lower to hardware provided ops?
Anyway if anyone wants to understand the philosophy behind tinygrad, this file is great start https://github.com/geohot/tinygrad/blob/master/docs/abstract...

InfluxDB

www.influxdata.com sponsored

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
llama.cpp

769 56,891 10.0 C++

LLM inference in C/C++

lol
LLaMA is a memory-bound kernel, where the dominant factor in execution time is how fast the processor transfers weights from RAM to registers. LLaMA.cpp uses a misaligned memory pattern that's painful to the RAM I/O interface and requires many CPU instructions to re-align. It also has access to 1/2 the bandwidth of the GPU on Apple's highest-end chips, making LLaMA *theoretically* 2x faster without algorithm changes.
Funny how you announced it the exact day after I open-sourced a high-bandwidth 4-bit GEMV kernel for LLaMA. If anyone wants to see how to achieve 180+ GB/s on the M1/M2 Max, you can reproduce the methodology here:
https://github.com/ggerganov/llama.cpp/pull/1642#issuecommen...
I later explained the technique to write very optimized kernels for a specific precision. *It's very unlikely my code was copied verbatim*, but it was probably used as inspiration and/or a reference. I also disclosed a high-precision means of measuring bandwidth utilization, which is critical for making such GPU kernels.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Building Job Consultant Bot with Lyzr SDK
1 project | dev.to | 29 Apr 2024
Show HN: Find similar folders based on folder name, folder size, and count
1 project | news.ycombinator.com | 29 Apr 2024
Show HN: I made a privacy friendly and simple app to track my menstruation
3 projects | news.ycombinator.com | 29 Apr 2024
Show HN: Open-Source Image Model Leaderboard with Public Preference Data
1 project | news.ycombinator.com | 29 Apr 2024
Show HN: Define and implement any function on the fly with LLMs
1 project | news.ycombinator.com | 29 Apr 2024

Render neural network into CUDA/HIP code

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com Post date: 2 Jun 2023

AITemplate

tinygrad

InfluxDB

llama.cpp

Related posts