Our great sponsors
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
This repo introduces an interesting example of customizing an ML operator in CUDA. The author developed an RWKV language model using sort of a one-dimensional depthwise convolution custom operator. The model in itself does not involve large amounts of computation, but still runs slow because PyTorch does not have native support for it. So, the author customized the operator in CUDA and used a set of optimization techniques, such as loop fusion and Shared Memory, achieving a performance 20x better than he did with PyTorch.
Pure PyTorch padding