PTX ISA :: CUDA Toolkit Documentation

Linking pages

DeepSeek's AI breakthrough bypasses Nvidia's industry-standard CUDA, uses assembly-like PTX programming instead | Tom's Hardware https://www.tomshardware.com/tech-industry/artificial-intelligence/deepseeks-ai-breakthrough-bypasses-industry-standard-cuda-uses-assembly-like-ptx-programming-instead 441 comments
Rust CUDA project update | Rust GPU https://rust-gpu.github.io/blog/2025/03/18/rust-cuda-update/ 72 comments
GitHub - deepseek-ai/DeepGEMM: DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling https://github.com/deepseek-ai/DeepGEMM 67 comments
Rebooting the Rust CUDA project | Rust GPU https://rust-gpu.github.io/blog/2025/01/27/rust-cuda-reboot 51 comments
Decorator JITs - Python as a DSL - Eli Bendersky's website https://eli.thegreenplace.net/2025/decorator-jits-python-as-a-dsl/ 44 comments
Dynamic Register Allocation on AMD's RDNA 4 GPU Architecture https://chipsandcheese.com/p/dynamic-register-allocation-on-amds 41 comments
On GPUs, ranges, latency, and superoptimisers · Paweł Dziepak https://pdziepak.github.io/2019/09/01/on-gpus-ranges-latency-and-superoptimisers/ 38 comments
How To Write A Fast Matrix Multiplication From Scratch With Tensor Cores | Alex Armbruster https://alexarmbr.github.io/2024/08/10/How-To-Write-A-Fast-Matrix-Multiplication-From-Scratch-With-Tensor-Cores.html 17 comments
Nvidia GPU on bare metal NixOS Kubernetes cluster explained – Fang-Pen's coding note https://fangpenlin.com/posts/2025/03/01/nvidia-gpu-on-bare-metal-nixos-k8s-explained/ 15 comments
Overview - CUDA Python 12.0.0 documentation https://nvidia.github.io/cuda-python/overview.html 11 comments
Beating cuBLAS in Single-Precision General Matrix Multiplication https://salykova.github.io/sgemm-gpu 8 comments
TornadoVM: Accelerating Java with GPUs and FPGAs https://www.infoq.com/articles/tornadovm-java-gpu-fpga/ 5 comments
Benchmarking and Dissecting the Nvidia Hopper GPU Architecture https://arxiv.org/html/2402.13499v1 4 comments
The Longest Nvidia PTX Instruction | Ash's Blog https://ashvardanian.com/posts/longest-ptx-instruction/ 3 comments
CPP_from_1998_to_2020/Cpp-Technical-Note.md at main · burlachenkok/CPP_from_1998_to_2020 · GitHub https://github.com/burlachenkok/CPP_from_1998_to_2020/blob/main/Cpp-Technical-Note.pdf 2 comments
rNdN: Fast Query Compilation for NVIDIA GPUs | ACM Transactions on Architecture and Code Optimization https://dl.acm.org/doi/10.1145/3603503 1 comment
GitHub - gvilums/ptoxide: Virtual machine for executing CUDA PTX without a GPU https://github.com/gvilums/ptoxide 1 comment
Outperforming cuBLAS on H100: a Worklog https://cudaforfun.substack.com/p/outperforming-cublas-on-h100-a-worklog 1 comment
Modular: Democratizing AI Compute, Part 4: CUDA is the incumbent, but is it any good? https://www.modular.com/blog/democratizing-ai-compute-part-4-cuda-is-the-incumbent-but-is-it-any-good 1 comment
GitHub - NVIDIA/cutlass: CUDA Templates for Linear Algebra Subroutines https://github.com/NVIDIA/cutlass 0 comments