Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet - discu.eu

Hacker News

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet https://transformer-circuits.pub/2024/scaling-monosemanticity/ 124 comments 21/5/2024

Reddit

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html 2 comments 24/5/2024 futurology
Anthropic demonstrates the ability to extract features from a medium-sized model, some of which correlate to lying and power-seeking https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html 5 comments 23/5/2024 futurology

Linking pages

Related searches:

Search whole site: site:transformer-circuits.pub

Search title: Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

See how to search.

Submit link to: