Hacker News
- Using the Lamborghini of inference engines for serverless Llama 3 https://modal.com/docs/examples/trtllm_latency 1 comment
Linked pages
- Making Deep Learning go Brrrr From First Principles https://horace.io/brrr_intro.html 20 comments
- [2309.06180] Efficient Memory Management for Large Language Model Serving with PagedAttention https://arxiv.org/abs/2309.06180 16 comments
- Method Chaining is Awesome https://quanticdev.com/articles/method-chaining 1 comment
- Doherty’s Threshold Is a Lie | Flashover https://www.flashover.blog/posts/dohertys-threshold-is-a-lie 0 comments
- GitHub - NVIDIA/cutlass: CUDA Templates for Linear Algebra Subroutines https://github.com/NVIDIA/cutlass 0 comments
- Doherty Threshold | Laws of UX https://lawsofux.com/doherty-threshold/ 0 comments
Related searches:
Search whole site: site:modal.com
Search title: Serve an interactive language model app with latency-optimized TensorRT-LLM (LLaMA 3 8B) | Modal Docs
See how to search.