Linking pages
- SlimPajama: A 627B token cleaned and deduplicated version of RedPajama - Cerebras https://www.cerebras.net/blog/slimpajama-a-627b-token-cleaned-and-deduplicated-version-of-redpajama 7 comments
- GitHub - NVIDIA/NeMo-Curator: Scalable data pre processing and curation toolkit for LLMs https://github.com/NVIDIA/NeMo-Curator 0 comments
Related searches:
Search whole site: site:arxiv.org
Search title: [2303.09540] SemDeDup: Data-efficient learning at web-scale through semantic deduplication
See how to search.