Hacker News
- Continuous batching to increase LLM inference throughput and reduce p50 latency https://www.anyscale.com/blog/continuous-batching-llm-inference 20 comments
Linking pages
- Non-determinism in GPT-4 is caused by Sparse MoE - 152334H https://152334h.github.io/blog/non-determinism-in-gpt-4/ 181 comments
- Etched is Making the Biggest Bet in AI https://www.etched.com/announcing-etched 20 comments
- Knowing Enough About MoE to Explain Dropped Tokens in GPT-4 - 152334H https://152334h.github.io/blog/knowing-enough-about-moe/ 1 comment
- Throughput is Not All You Need: Maximizing Goodput in LLM Serving using Prefill-Decode Disaggregation | Hao AI Lab @ UCSD https://hao-ai-lab.github.io/blogs/distserve/ 1 comment
- What to Expect From Retrievel-Augmented Generation and Self-hosted LLMs | MyScale | Blog https://myscale.com/blog/what-to-expect-rag/ 0 comments
- Understanding how LLM inference works with llama.cpp https://www.omrimallis.com/posts/understanding-how-llm-inference-works-with-llama-cpp/ 0 comments
- Welcome to vLLM! — vLLM https://docs.vllm.ai/en/latest/ 0 comments
Related searches:
Search whole site: site:anyscale.com
Search title: How continuous batching enables 23x throughput in LLM inference while reducing p50 latency | Anyscale
See how to search.