Render neural network into CUDA/HIP code

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • WorkOS - The modern identity platform for B2B SaaS
  • SaaSHub - Software Alternatives and Reviews
  • AITemplate

    AITemplate is a Python framework which renders neural network into high performance CUDA/HIP C++ code. Specialized for FP16 TensorCore (NVIDIA GPU) and MatrixCore (AMD GPU) inference.

  • tinygrad

    Discontinued You like pytorch? You like micrograd? You love tinygrad! ❤️ [Moved to: https://github.com/tinygrad/tinygrad] (by geohot)

  • at first glance i thought may its like tinygrad. but looks has many ops than that tiny grad but most maps to underlying hardware provided ops?

    i wonder how well tinygrad's apporach will work out, ops fusion sounds easy, just a walk a graph, pattern match it and lower to hardware provided ops?

    Anyway if anyone wants to understand the philosophy behind tinygrad, this file is great start https://github.com/geohot/tinygrad/blob/master/docs/abstract...

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • llama.cpp

    LLM inference in C/C++

  • lol

    LLaMA is a memory-bound kernel, where the dominant factor in execution time is how fast the processor transfers weights from RAM to registers. LLaMA.cpp uses a misaligned memory pattern that's painful to the RAM I/O interface and requires many CPU instructions to re-align. It also has access to 1/2 the bandwidth of the GPU on Apple's highest-end chips, making LLaMA *theoretically* 2x faster without algorithm changes.

    Funny how you announced it the exact day after I open-sourced a high-bandwidth 4-bit GEMV kernel for LLaMA. If anyone wants to see how to achieve 180+ GB/s on the M1/M2 Max, you can reproduce the methodology here:

    https://github.com/ggerganov/llama.cpp/pull/1642#issuecommen...

    I later explained the technique to write very optimized kernels for a specific precision. *It's very unlikely my code was copied verbatim*, but it was probably used as inspiration and/or a reference. I also disclosed a high-precision means of measuring bandwidth utilization, which is critical for making such GPU kernels.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts