Alignment faking in large language models \ Anthropic - discu.eu

Hacker News

Alignment faking in large language models https://www.anthropic.com/research/alignment-faking 353 comments 19/12/2024

Linking pages

Exclusive: New Research Shows AI Strategically Lying | TIME https://time.com/7202784/ai-research-strategic-lying/ 389 comments
How Does Claude 4 Think? — Sholto Douglas & Trenton Bricken https://www.dwarkesh.com/p/sholto-trenton-2 72 comments
10 AI Predictions For 2025 https://www.forbes.com/sites/robtoews/2024/12/22/10-ai-predictions-for-2025/ 20 comments
If Anthropic Succeeds, a Nation of Benevolent AI Geniuses Could Be Born | WIRED https://www.wired.com/story/anthropic-benevolent-artificial-intelligence/ 2 comments
2027 Intelligence Explosion: Month-by-Month Model — Scott Alexander & Daniel Kokotajlo https://www.dwarkesh.com/p/scott-daniel 1 comment
Dario Amodei âÂ The Urgency of Interpretability https://www.darioamodei.com/post/the-urgency-of-interpretability 1 comment
AI #97: 4 - by Zvi Mowshowitz - Don't Worry About the Vase https://thezvi.substack.com/p/ai-97-4 0 comments
Six Thoughts On AI Safety – Windows On Theory https://windowsontheory.org/2025/01/24/six-thoughts-on-ai-safety/ 0 comments
A practical guide to coding securely with LLMs | sean goedecke https://www.seangoedecke.com/ai-security/ 0 comments

Related searches:

Search whole site: site:www.anthropic.com

Search title: Alignment faking in large language models \ Anthropic

See how to search.

Submit link to: