How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog - discu.eu

Hacker News

How to optimize a CUDA matmul kernel for cuBLAS-like performance (2022) https://siboehm.com/articles/22/CUDA-MMM 33 comments 26/7/2024

How to Optimize a CUDA Matmul Kernel for CuBLAS-Like Performance: A Worklog https://siboehm.com/articles/22/CUDA-MMM 16 comments 5/1/2023

Linking pages

Optimizing a WebGPU Matmul Kernel for 1TFLOP+ Performance https://zanussbaum.substack.com/p/optimizing-a-webgpu-matmul-kernel 80 comments
Fast LLM Inference From Scratch https://andrewkchan.dev/posts/yalm.html 57 comments
How to make LLMs go fast https://vgel.me/posts/faster-inference/ 54 comments
'I paid for the whole GPU, I am going to use the whole GPU': A high-level guide to GPU utilization | Modal Blog https://modal.com/blog/gpu-utilization-guide 45 comments
How To Write A Fast Matrix Multiplication From Scratch With Tensor Cores | Alex Armbruster https://alexarmbr.github.io/2024/08/10/How-To-Write-A-Fast-Matrix-Multiplication-From-Scratch-With-Tensor-Cores.html 17 comments
Beating cuBLAS in Single-Precision General Matrix Multiplication https://salykova.github.io/sgemm-gpu 8 comments
GitHub - arekpaterek/Faster_SGEMM_CUDA: FP32 matrix multiplication of large square matrices in some cases faster than cuBLAS. https://github.com/arekpaterek/Faster_SGEMM_CUDA 5 comments
Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind https://www.dwarkeshpatel.com/p/sholto-douglas-trenton-bricken 3 comments
Autotune for GPU Kernels: Ensuring Consistent Peak Performance https://burn.dev/blog/autotune-for-gpu-kernels 1 comment
Outperforming cuBLAS on H100: a Worklog https://cudaforfun.substack.com/p/outperforming-cublas-on-h100-a-worklog 1 comment
GitHub - clu0/unet.cu: UNet diffusion model in pure CUDA https://github.com/clu0/unet.cu 0 comments
GitHub - AnswerDotAI/gpu.cpp: A lightweight library for portable low-level GPU computation using WebGPU. https://github.com/AnswerDotAI/gpu.cpp 0 comments
CUDA Matrix Multiplication Optimization - Lei Mao's Log Book https://leimao.github.io/article/CUDA-Matrix-Multiplication-Optimization/ 0 comments
Implementing a fast Tensor Core matmul on the Ada Architecture | spatters.ca https://www.spatters.ca/mma-matmul 0 comments

Linked pages

Related searches:

Search whole site: site:siboehm.com

Search title: How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog

See how to search.

Submit link to: