Linking pages
- LLMs Know More Than What They Say - by Ruby Pai https://arjunbansal.substack.com/p/llms-know-more-than-what-they-say 18 comments
- GitHub - SalvatoreRa/ML-news-of-the-week: A collection of the the best ML and AI news every week (research, news, resources) https://github.com/SalvatoreRa/ML-news-of-the-week 8 comments
Linked pages
- [2401.05566] Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training https://arxiv.org/abs/2401.05566 18 comments
- [2309.15840] How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions https://arxiv.org/abs/2309.15840 14 comments
- Towards Monosemanticity: Decomposing Language Models With Dictionary Learning https://transformer-circuits.pub/2023/monosemantic-features/index.html 5 comments
- [1906.01820] Risks from Learned Optimization in Advanced Machine Learning Systems https://arxiv.org/abs/1906.01820 1 comment
- [2112.00861] A General Language Assistant as a Laboratory for Alignment https://arxiv.org/abs/2112.00861 0 comments
- Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training \ Anthropic https://www.anthropic.com/news/sleeper-agents-training-deceptive-llms-that-persist-through-safety-training 0 comments
Related searches:
Search whole site: site:www.anthropic.com
Search title: Simple probes can catch sleeper agents \ Anthropic
See how to search.