- [R] M6-10T: A Sharing-Delinking Paradigm for Efficient Multi-Trillion Parameter Pretraining https://arxiv.org/abs/2110.03888 3 comments machinelearning
Linking pages
- It Looks Like You’re Trying To Take Over The World · Gwern.net https://www.gwern.net/fiction/Clippy 33 comments
- GitHub - arpita8/Awesome-Mixture-of-Experts-Papers: Survey: A collection of AWESOME papers and resources on the latest research in Mixture of Experts. https://github.com/arpita8/Awesome-Mixture-of-Experts-Papers 17 comments
Would you like to stay up to date with Computer science? Checkout Computer science
Weekly.
Related searches:
Search whole site: site:arxiv.org
Search title: [2110.03888] M6-10T: A Sharing-Delinking Paradigm for Efficient Multi-Trillion Parameter Pretraining
See how to search.