Simple probes can catch sleeper agents \ Anthropic - discu.eu

Linking pages

LLMs Know More Than What They Say - by Ruby Pai https://arjunbansal.substack.com/p/llms-know-more-than-what-they-say 18 comments
GitHub - SalvatoreRa/ML-news-of-the-week: A collection of the the best ML and AI news every week (research, news, resources) https://github.com/SalvatoreRa/ML-news-of-the-week 8 comments

Linked pages

[2401.05566] Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training https://arxiv.org/abs/2401.05566 18 comments
[2309.15840] How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions https://arxiv.org/abs/2309.15840 14 comments
Towards Monosemanticity: Decomposing Language Models With Dictionary Learning https://transformer-circuits.pub/2023/monosemantic-features/index.html 5 comments
[1906.01820] Risks from Learned Optimization in Advanced Machine Learning Systems https://arxiv.org/abs/1906.01820 1 comment
[2112.00861] A General Language Assistant as a Laboratory for Alignment https://arxiv.org/abs/2112.00861 0 comments
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training \ Anthropic https://www.anthropic.com/news/sleeper-agents-training-deceptive-llms-that-persist-through-safety-training 0 comments

Related searches:

Search whole site: site:www.anthropic.com

Search title: Simple probes can catch sleeper agents \ Anthropic

See how to search.

Submit link to: