GitHub - NVIDIA/cutlass: CUDA Templates for Linear Algebra Subroutines

Linking pages

GitHub - huggingface/candle: Minimalist ML framework for Rust https://github.com/huggingface/candle 205 comments
GitHub - ashvardanian/less_slow.cpp: Learning how to write "Less Slow" code in C++ 20, C 99, CUDA, PTX, & Assembly, from numerics & SIMD to coroutines, ranges, exception handling, networking and user-space IO https://github.com/ashvardanian/less_slow.cpp 145 comments
Introducing Triton: Open-Source GPU Programming for Neural Networks https://openai.com/blog/triton/ 116 comments
GitHub - deepseek-ai/FlashMLA https://github.com/deepseek-ai/FlashMLA 108 comments
GitHub - facebookincubator/AITemplate: AITemplate is a Python framework which renders neural network into high performance CUDA/HIP C++ code. Specialized for FP16 TensorCore (NVIDIA GPU) and MatrixCore (AMD GPU) inference. https://github.com/facebookincubator/AITemplate 71 comments
GitHub - deepseek-ai/DeepGEMM: DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling https://github.com/deepseek-ai/DeepGEMM 67 comments
How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog https://siboehm.com/articles/22/CUDA-MMM 49 comments
DeepSeek-V3 Technical Report https://arxiv.org/html/2412.19437v1 42 comments
How To Write A Fast Matrix Multiplication From Scratch With Tensor Cores | Alex Armbruster https://alexarmbr.github.io/2024/08/10/How-To-Write-A-Fast-Matrix-Multiplication-From-Scratch-With-Tensor-Cores.html 17 comments
GitHub - mikeroyal/Unreal-Engine-Guide: Unreal Engine 5 Guide. Learn to develop games for Windows, Linux, macOS, iOS, Android, Xbox Series X|S, PlayStation 4 & 5, Nintendo Switch. https://github.com/mikeroyal/Unreal-Engine-Guide#linux-development 12 comments
GitHub - mikeroyal/Neuromorphic-Computing-Guide: Learn about the Neumorphic engineering process of creating large-scale integration (VLSI) systems containing electronic analog circuits to mimic neuro-biological architectures. https://github.com/mikeroyal/Neuromorphic-Computing-Guide 7 comments
Implement Flash Attention Backend in SGLang - Basics and KV Cache · Biao's Blog https://hebiao064.github.io/fa3-attn-backend-basic 5 comments
The Longest Nvidia PTX Instruction | Ash's Blog https://ashvardanian.com/posts/longest-ptx-instruction/ 3 comments
GitHub - mikeroyal/Machine-Learning-Guide: Machine learning Guide. Learn all about Machine Learning Tools, Libraries, Frameworks, and Training Models. https://github.com/mikeroyal/Machine-Learning-Guide 2 comments
Strangely, Matrix Multiplications on GPUs Run Faster When Given "Predictable" Data! [short] https://www.thonking.ai/p/strangely-matrix-multiplications 2 comments
GitHub - mikeroyal/Game-Console-Dev-Guide: Game Console Dev Guide. Learn to develop games for Xbox Series X|S, PlayStation 4 & 5, Nintendo Switch, Steam Deck, and Apple Silicon. https://github.com/mikeroyal/Game-Console-Dev-Guide 1 comment
Modular: Democratizing AI Compute, Part 4: CUDA is the incumbent, but is it any good? https://www.modular.com/blog/democratizing-ai-compute-part-4-cuda-is-the-incumbent-but-is-it-any-good 1 comment
Serve an interactive language model app with latency-optimized TensorRT-LLM (LLaMA 3 8B) | Modal Docs https://modal.com/docs/examples/trtllm_latency 1 comment
GitHub - NVlabs/tiny-cuda-nn: Lightning fast C++/CUDA neural network framework https://github.com/NVlabs/tiny-cuda-nn 0 comments
GitHub - mikeroyal/CUDA-Guide: CUDA Guide https://github.com/mikeroyal/CUDA-Guide 0 comments

Linking pages

Linked pages