Linking pages
- DeepSeek's AI breakthrough bypasses Nvidia's industry-standard CUDA, uses assembly-like PTX programming instead | Tom's Hardware https://www.tomshardware.com/tech-industry/artificial-intelligence/deepseeks-ai-breakthrough-bypasses-industry-standard-cuda-uses-assembly-like-ptx-programming-instead 441 comments
- Rebooting the Rust CUDA project | Rust GPU https://rust-gpu.github.io/blog/2025/01/27/rust-cuda-reboot 51 comments
- Decorator JITs - Python as a DSL - Eli Bendersky's website https://eli.thegreenplace.net/2025/decorator-jits-python-as-a-dsl/ 42 comments
- On GPUs, ranges, latency, and superoptimisers · Paweł Dziepak https://pdziepak.github.io/2019/09/01/on-gpus-ranges-latency-and-superoptimisers/ 38 comments
- Overview - CUDA Python 12.0.0 documentation https://nvidia.github.io/cuda-python/overview.html 11 comments
- Beating cuBLAS in Single-Precision General Matrix Multiplication https://salykova.github.io/sgemm-gpu 8 comments
- TornadoVM: Accelerating Java with GPUs and FPGAs https://www.infoq.com/articles/tornadovm-java-gpu-fpga/ 5 comments
- Benchmarking and Dissecting the Nvidia Hopper GPU Architecture https://arxiv.org/html/2402.13499v1 4 comments
- The Longest Nvidia PTX Instruction | Ash's Blog https://ashvardanian.com/posts/longest-ptx-instruction/ 3 comments
- CPP_from_1998_to_2020/Cpp-Technical-Note.md at main · burlachenkok/CPP_from_1998_to_2020 · GitHub https://github.com/burlachenkok/CPP_from_1998_to_2020/blob/main/Cpp-Technical-Note.pdf 2 comments
- rNdN: Fast Query Compilation for NVIDIA GPUs | ACM Transactions on Architecture and Code Optimization https://dl.acm.org/doi/10.1145/3603503 1 comment
- GitHub - gvilums/ptoxide: Virtual machine for executing CUDA PTX without a GPU https://github.com/gvilums/ptoxide 1 comment
- Outperforming cuBLAS on H100: a Worklog https://cudaforfun.substack.com/p/outperforming-cublas-on-h100-a-worklog 1 comment
- Modular: Democratizing AI Compute, Part 4: CUDA is the incumbent, but is it any good? https://www.modular.com/blog/democratizing-ai-compute-part-4-cuda-is-the-incumbent-but-is-it-any-good 1 comment
- GitHub - NVIDIA/cutlass: CUDA Templates for Linear Algebra Subroutines https://github.com/NVIDIA/cutlass 0 comments
- Level up Your Java Performance with TornadoVM https://www.infoq.com/articles/java-performance-tornadovm/ 0 comments
- XLA: Optimizing Compiler for Machine Learning | TensorFlow https://www.tensorflow.org/xla 0 comments
- Mixed-input matrix multiplication performance optimizations – Google Research Blog https://blog.research.google/2024/01/mixed-input-matrix-multiplication.html 0 comments
- Electronics | Free Full-Text | Parallel Implementation of Lightweight Secure Hash Algorithm on CPU and GPU Environments https://www.mdpi.com/2079-9292/13/5/896 0 comments
- How To Write A Fast Matrix Multiplication From Scratch With Tensor Cores | Alex Armbruster https://alexarmbr.github.io/2024/08/10/How-To-Write-A-Fast-Matrix-Multiplication-From-Scratch-With-Tensor-Cores.html 0 comments
Related searches:
Search whole site: site:docs.nvidia.com
Search title: PTX ISA :: CUDA Toolkit Documentation
See how to search.