Outperforming cuBLAS on H100: a Worklog - discu.eu

Linking pages

'I paid for the whole GPU, I am going to use the whole GPU': A high-level guide to GPU utilization | Modal Blog https://modal.com/blog/gpu-utilization-guide 45 comments

Linked pages

GPUs Go Brrr · Hazy Research https://hazyresearch.stanford.edu/blog/2024-05-12-tk 267 comments
Hilbert curve - Wikipedia https://en.wikipedia.org/wiki/Hilbert_curve 66 comments
How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog https://siboehm.com/articles/22/CUDA-MMM 49 comments
NVIDIA Hopper Architecture In-Depth | NVIDIA Technical Blog https://developer.nvidia.com/blog/nvidia-hopper-architecture-in-depth/ 20 comments
[2407.08608] FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision https://arxiv.org/abs/2407.08608 6 comments
PTX ISA :: CUDA Toolkit Documentation https://docs.nvidia.com/cuda/parallel-thread-execution/index.html 4 comments
Strangely, Matrix Multiplications on GPUs Run Faster When Given "Predictable" Data! [short] https://www.thonking.ai/p/strangely-matrix-multiplications 2 comments
bfloat16 floating-point format - Wikipedia https://en.wikipedia.org/wiki/Bfloat16_floating-point_format 1 comment
Dissecting the Ampere GPU Architecture through Microbenchmarking | GTC Digital April 2021 | NVIDIA On-Demand https://www.nvidia.com/en-us/on-demand/session/gtcspring21-s33322/ 0 comments

Related searches:

Search whole site: site:cudaforfun.substack.com

Search title: Outperforming cuBLAS on H100: a Worklog

See how to search.

Submit link to: