[2309.06180] Efficient Memory Management for Large Language Model Serving with PagedAttention - discu.eu

Hacker News

Efficient Memory Management for Large Language Model Serving with PagedAttention https://arxiv.org/abs/2309.06180 16 comments 14/9/2023

Linking pages

How to make LLMs go fast https://vgel.me/posts/faster-inference/ 54 comments
Snowflake Arctic - LLM for Enterprise AI https://www.snowflake.com/blog/arctic-open-efficient-foundation-language-models-snowflake/ 6 comments
At the Intersection of LLMs and Kernels - Research Roundup https://charlesfrye.github.io/programming/2023/11/10/llms-systems.html 4 comments
LoRAX: The Open Source Framework for Serving 100s of Fine-Tuned LLMs in Production - Predibase - Predibase https://predibase.com/blog/lorax-the-open-source-framework-for-serving-100s-of-fine-tuned-llms-in 3 comments
GitHub - HazyResearch/aisys-building-blocks: Building blocks for foundation models. https://github.com/HazyResearch/aisys-building-blocks 1 comment
vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention | vLLM Blog https://blog.vllm.ai/2023/06/20/vllm.html 0 comments
Cascade Inference: Memory Bandwidth Efficient Shared Prefix Batch Decoding | FlashInfer https://flashinfer.ai/2024/02/02/cascade-inference.html 0 comments
Welcome to vLLM! — vLLM https://docs.vllm.ai/en/latest/ 0 comments

Related searches:

Search whole site: site:arxiv.org

Search title: [2309.06180] Efficient Memory Management for Large Language Model Serving with PagedAttention

See how to search.

Submit link to: