Hacker News
Linking pages
Linked pages
- [1701.06538] Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer https://arxiv.org/abs/1701.06538 125 comments
- [1503.02531] Distilling the Knowledge in a Neural Network https://arxiv.org/abs/1503.02531 5 comments
- https://arxiv.org/abs/2101.03961 4 comments
- T5 https://huggingface.co/docs/transformers/model_doc/t5 3 comments
- [2308.00951] From Sparse to Soft Mixtures of Experts https://arxiv.org/abs/2308.00951 3 comments
- https://arxiv.org/abs/2202.01169#deepmind 2 comments
- GitHub - stanford-futuredata/megablocks https://github.com/stanford-futuredata/megablocks 1 comment
- [1911.02150] Fast Transformer Decoding: One Write-Head is All You Need https://arxiv.org/abs/1911.02150 1 comment
- [2203.15556] Training Compute-Optimal Large Language Models https://arxiv.org/abs/2203.15556 0 comments
- [2001.08361] Scaling Laws for Neural Language Models https://arxiv.org/abs/2001.08361 0 comments
- [2010.11929] An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale https://arxiv.org/abs/2010.11929 0 comments
- Sparse Expert Models (Switch Transformers, GLAM, and more... w/ the Authors) - YouTube https://youtu.be/ccBMRryxGog 0 comments
- How does GPT-3 spend its 175B parameters? - by Robert Huben https://aizi.substack.com/p/how-does-gpt-3-spend-its-175b-parameters 0 comments
- [2202.08906] ST-MoE: Designing Stable and Transferable Sparse Expert Models https://arxiv.org/abs/2202.08906 0 comments
- [2305.14705] Mixture-of-Experts Meets Instruction Tuning:A Winning Combination for Large Language Models https://arxiv.org/abs/2305.14705 0 comments
- c4 · Datasets at Hugging Face https://huggingface.co/datasets/c4 0 comments
Related searches:
Search whole site: site:blog.javid.io
Search title: Mixtures of Experts - Javid Lakha
See how to search.