Orca: A Distributed Serving System for Transformer-Based Generative Models | USENIX - discu.eu

Linking pages

Throughput is Not All You Need: Maximizing Goodput in LLM Serving using Prefill-Decode Disaggregation | Hao AI Lab @ UCSD https://hao-ai-lab.github.io/blogs/distserve/ 1 comment
Efficient LLM Scheduling by Learning to Rank | Hao AI Lab @ UCSD https://hao-ai-lab.github.io/blogs/vllm-ltr/ 1 comment
GitHub - AmberLJC/LLMSys-PaperList: Large Language Model (LLM) Systems Paper List https://github.com/AmberLJC/LLMSys-PaperList/ 1 comment
Dissecting Batching Effects in GPT Inference https://le.qun.ch/en/blog/2023/05/13/transformer-batching/ 0 comments
The Easiest Part of LLM Applications is the LLM https://generatingconversation.substack.com/p/the-easiest-part-of-llm-applications 0 comments
GitHub - AIoT-MLSys-Lab/Efficient-LLMs-Survey: Efficient Large Language Models: A Survey https://github.com/AIoT-MLSys-Lab/Efficient-LLMs-Survey 0 comments
[Paper Review] Efficient Memory Management for Large Language Model Serving with PagedAttention https://newsletter.micahlerner.com/p/paper-review-efficient-memory-management 0 comments
Efficient Memory Management for Large Language Model Serving with PagedAttention https://www.micahlerner.com/2024/01/11/efficient-memory-management-for-large-language-model-serving-with-pagedattention.html 0 comments
Accelerating AI Inference with Google Cloud TPUs and GPUs | Google Cloud Blog https://cloud.google.com/blog/products/compute/accelerating-ai-inference-with-google-cloud-tpus-and-gpus 0 comments

Related searches:

Search whole site: site:www.usenix.org

Search title: Orca: A Distributed Serving System for Transformer-Based Generative Models | USENIX

See how to search.

Submit link to: