Hacker News
- The Pile: An 800GB dataset of diverse text for language modeling (2020) https://arxiv.org/abs/2101.00027 70 comments
- The Pile: An 800GB Dataset of Diverse Text for Language Modeling https://arxiv.org/abs/2101.00027 5 comments
- Open source dataset for NLP https://arxiv.org/abs/2101.00027 5 comments languagetechnology
Linking pages
- Sarah Silverman is suing OpenAI and Meta for copyright infringement. - The Verge https://www.theverge.com/2023/7/9/23788741/sarah-silverman-openai-meta-chatgpt-llama-copyright-infringement-chatbots-artificial-intelligence-ai 2226 comments
- Apple, Nvidia, Anthropic Used Thousands of Swiped YouTube Videos to Train AI | WIRED https://www.wired.com/story/youtube-training-data-apple-nvidia-anthropic/ 135 comments
- GitHub - EleutherAI/gpt-neo: An implementation of model parallel GPT-2 and GPT-3-style models using the mesh-tensorflow library. https://github.com/EleutherAI/gpt-neo/ 127 comments
- Announcing GPT-NeoX-20B | EleutherAI Blog https://blog.eleuther.ai/announcing-20b/ 70 comments
- GitHub - EleutherAI/gpt-neox: An implementation of model parallel autoregressive transformers on GPUs, based on the DeepSpeed library. https://github.com/EleutherAI/gpt-neox 67 comments
- Understanding Large Language Models - by Sebastian Raschka https://magazine.sebastianraschka.com/p/understanding-large-language-models 53 comments
- Ahead of AI #11: New Foundation Models https://magazine.sebastianraschka.com/p/ahead-of-ai-11-new-foundation-models 34 comments
- Pile-T5 | EleutherAI Blog https://blog.eleuther.ai/pile-t5/ 15 comments
- GitHub - CodedotAl/gpt-code-clippy: Full description can be found here: https://discuss.huggingface.co/t/pretrain-gpt-neo-for-open-source-github-copilot-model/7678?u=ncoop57 https://github.com/CodedotAl/gpt-code-clippy 13 comments
- Fruit Of The Poisonous LLaMA? – Terence Eden’s Blog https://shkspr.mobi/blog/2023/07/fruit-of-the-poisonous-llama/ 9 comments
- Medical large language models are vulnerable to data-poisoning attacks | Nature Medicine https://www.nature.com/articles/s41591-024-03445-1 7 comments
- How to convert the SalesForce CodeGen models to GPT-J · GitHub https://gist.github.com/moyix/7896575befbe1b99162ccfec8d135566 3 comments
- GLM-130B: An Open Bilingual Pre-Trained Model | GLM-130B https://keg.cs.tsinghua.edu.cn/glm-130b/posts/glm-130b/ 2 comments
- Text Data Augmentation for Deep Learning | Journal of Big Data | Full Text https://journalofbigdata.springeropen.com/articles/10.1186/s40537-021-00492-0 1 comment
- Rotary Embeddings: A Relative Revolution | EleutherAI Blog https://blog.eleuther.ai/rotary-embeddings/ 1 comment
- Foundation Models: The future (still) isn't happening fast enough https://www.madrona.com/foundation-models/ 1 comment
- How Good Are the Latest Open LLMs? And Is DPO Better Than PPO? https://magazine.sebastianraschka.com/p/how-good-are-the-latest-open-llms 1 comment
- GPT-3's free alternative GPT-Neo is something to be excited about | VentureBeat https://venturebeat.com/2021/05/15/gpt-3s-free-alternative-gpt-neo-is-something-to-be-excited-about/ 0 comments
- Retrieving Real-World Email Addresses From Pretrained Natural Language Models - Unite.AI https://www.unite.ai/retrieving-real-world-email-addresses-from-pretrained-natural-language-models/ 0 comments
- Microsoft's Massive New Language AI Is Triple the Size of OpenAI’s GPT-3 https://singularityhub.com/2021/10/13/microsofts-massive-new-language-ai-is-triple-the-size-of-openais-gpt-3/ 0 comments
Would you like to stay up to date with Computer science? Checkout Computer science
Weekly.
Related searches:
Search whole site: site:arxiv.org
Search title: [2101.00027] The Pile: An 800GB Dataset of Diverse Text for Language Modeling
See how to search.