Hacker News
- Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet https://transformer-circuits.pub/2024/scaling-monosemanticity/ 124 comments
- Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html 2 comments futurology
- Anthropic demonstrates the ability to extract features from a medium-sized model, some of which correlate to lying and power-seeking https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html 5 comments futurology
Linking pages
- World-first research dissects an AI's mind, and starts editing its thoughts https://newatlas.com/technology/ai-thinking-patterns/ 360 comments
- Golden Gate Claude \ Anthropic https://www.anthropic.com/news/golden-gate-claude 66 comments
- Mapping the Mind of a Large Language Model \ Anthropic https://www.anthropic.com/news/mapping-mind-language-model 2 comments
- Mapping the Mind of a Large Language Model \ Anthropic https://www.anthropic.com/research/mapping-mind-language-model 1 comment
- Prism: mapping interpretable concepts and features in a latent space of language | thesephist.com https://thesephist.com/posts/prism/ 1 comment
- Golden Gate Claude \ Anthropic https://www.anthropic.com/news/golden-gate-claude?p=2 0 comments
- I am the Golden Gate Bridge - by Zvi Mowshowitz https://thezvi.substack.com/p/i-am-the-golden-gate-bridge 0 comments
- Golden Gate Claude: What is it? - Claude101 https://claude101.com/golden-gate-claude/ 0 comments
- Some applied research problems in machine learning | thesephist.com https://thesephist.com/posts/applied-research-problems-2024/ 0 comments
- The engineering challenges of scaling interpretability \ Anthropic https://www.anthropic.com/research/engineering-challenges-interpretability 0 comments
- An Intuitive Explanation of Sparse Autoencoders for Mechanistic Interpretability of LLMs | Adam Karvonen https://adamkarvonen.github.io/machine_learning/2024/06/11/sae-intuitions.html 0 comments
- Moonshots, Malice, and Mitigations | Jesse’s Window https://blog.jessewalling.com/superintelligence/agi/ethics/whateverism/2024/06/26/moonshots-malice-and-mitigations.html 0 comments
Related searches:
Search whole site: site:transformer-circuits.pub
Search title: Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet
See how to search.