Mixture-of-Experts (MoE): The Birth and Rise of Conditional Computation

Linking pages

A Visual Guide to Mixture of Experts (MoE) https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-mixture-of-experts 6 comments
Scaling Laws for LLMs: From GPT-3 to o3 https://cameronrwolfe.substack.com/p/llm-scaling-laws 0 comments
Mixture-of-Experts (MoE) LLMs - by Cameron R. Wolfe, Ph.D. https://cameronrwolfe.substack.com/p/moe-llms 0 comments
nanoMoE: Mixture-of-Experts (MoE) LLMs from Scratch in PyTorch https://cameronrwolfe.substack.com/p/nano-moe 0 comments

Linked pages

Mistral 7B | Mistral AI | Open source models https://mistral.ai/news/announcing-mistral-7b/ 618 comments
Mixtral of experts | Mistral AI | Open source models https://mistral.ai/news/mixtral-of-experts/ 300 comments
[2401.04088] Mixtral of Experts https://arxiv.org/abs/2401.04088 150 comments
[1701.06538] Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer https://arxiv.org/abs/1701.06538 125 comments
Understanding LSTM Networks -- colah's blog https://colah.github.io/posts/2015-08-Understanding-LSTMs/ 64 comments
But what is a convolution? - YouTube https://www.youtube.com/watch?v=KuXjwB4LzSA 24 comments
What Every User Should Know About Mixed Precision Training in PyTorch | PyTorch https://pytorch.org/blog/what-every-user-should-know-about-mixed-precision-training-in-pytorch/ 24 comments
Coefficient of variation - Wikipedia https://en.wikipedia.org/wiki/Coefficient_of_variation 21 comments
Directed acyclic graph - Wikipedia https://en.wikipedia.org/wiki/Directed_acyclic_graph 12 comments
https://arxiv.org/abs/2101.03961 4 comments
Mixture of Experts Explained https://huggingface.co/blog/moe 2 comments
Open Release of Grok-1 https://x.ai/blog/grok-os 2 comments
Data Parallelism VS Model Parallelism in Distributed Deep Learning Training - Lei Mao's Log Book https://leimao.github.io/blog/Data-Parallelism-vs-Model-Paralelism/ 0 comments
[2202.08906] ST-MoE: Designing Stable and Transferable Sparse Expert Models https://arxiv.org/abs/2202.08906 0 comments
c4 · Datasets at Hugging Face https://huggingface.co/datasets/c4 0 comments
Decoder-Only Transformers: The Workhorse of Generative LLMs https://cameronrwolfe.substack.com/p/decoder-only-transformers-the-workhorse 0 comments