Linking pages
- Throughput is Not All You Need: Maximizing Goodput in LLM Serving using Prefill-Decode Disaggregation | Hao AI Lab @ UCSD https://hao-ai-lab.github.io/blogs/distserve/ 1 comment
- Dissecting Batching Effects in GPT Inference https://le.qun.ch/en/blog/2023/05/13/transformer-batching/ 0 comments
- The Easiest Part of LLM Applications is the LLM https://generatingconversation.substack.com/p/the-easiest-part-of-llm-applications 0 comments
- GitHub - AIoT-MLSys-Lab/Efficient-LLMs-Survey: Efficient Large Language Models: A Survey https://github.com/AIoT-MLSys-Lab/Efficient-LLMs-Survey 0 comments
- [Paper Review] Efficient Memory Management for Large Language Model Serving with PagedAttention https://newsletter.micahlerner.com/p/paper-review-efficient-memory-management 0 comments
- Efficient Memory Management for Large Language Model Serving with PagedAttention https://www.micahlerner.com/2024/01/11/efficient-memory-management-for-large-language-model-serving-with-pagedattention.html 0 comments
- Accelerating AI Inference with Google Cloud TPUs and GPUs | Google Cloud Blog https://cloud.google.com/blog/products/compute/accelerating-ai-inference-with-google-cloud-tpus-and-gpus 0 comments
Related searches:
Search whole site: site:www.usenix.org
Search title: Orca: A Distributed Serving System for Transformer-Based Generative Models | USENIX
See how to search.