Hacker News
- Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet https://transformer-circuits.pub/2024/scaling-monosemanticity/ 124 comments
- Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html 2 comments futurology
- Anthropic demonstrates the ability to extract features from a medium-sized model, some of which correlate to lying and power-seeking https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html 5 comments futurology
Linking pages
- World-first research dissects an AI's mind, and starts editing its thoughts https://newatlas.com/technology/ai-thinking-patterns/ 360 comments
- GitHub - PaulPauls/llama3_interpretability_sae: A complete end-to-end pipeline for LLM interpretability with sparse autoencoders (SAEs) using Llama 3.2, written in pure PyTorch and fully reproducible. https://github.com/PaulPauls/llama3_interpretability_sae 98 comments
- Golden Gate Claude \ Anthropic https://www.anthropic.com/news/golden-gate-claude 66 comments
- Economics is a Field of Software Engineering https://www.maximum-progress.com/p/economics-is-a-field-of-software 8 comments
- What is AI interpretability? Artificial intelligence researchers are reverse-engineering ChatGPT, Claude, and Gemini. - Vox https://www.vox.com/future-perfect/362759/ai-interpretability-openai-claude-gemini-neuroscience 7 comments
- Gemma Scope: helping the safety community shed light on the inner workings of language models - Google DeepMind https://deepmind.google/discover/blog/gemma-scope-helping-the-safety-community-shed-light-on-the-inner-workings-of-language-models/ 4 comments
- An Intuitive Explanation of Sparse Autoencoders for LLM Interpretability | Adam Karvonen https://adamkarvonen.github.io/machine_learning/2024/06/11/sae-intuitions.html 3 comments
- Mapping the Mind of a Large Language Model \ Anthropic https://www.anthropic.com/news/mapping-mind-language-model 2 comments
- Prism: mapping interpretable concepts and features in a latent space of language | thesephist.com https://thesephist.com/posts/prism/ 1 comment
- A primer on sparse autoencoders - by Nick Jiang https://nickjiang.substack.com/p/a-primer-on-sparse-autoencoders 1 comment
- Return - by ITNAmatter - Engineering the Future https://jonathanpolitzki.substack.com/p/return 1 comment
- Scaling Automatic Neuron Description | Transluce AI https://transluce.org/neuron-descriptions 1 comment
- Structure, People, and Chaos - by ITNAmatter https://jonathanpolitzki.substack.com/p/structure-people-and-chaos 1 comment
- Golden Gate Claude \ Anthropic https://www.anthropic.com/news/golden-gate-claude?p=2 0 comments
- I am the Golden Gate Bridge - by Zvi Mowshowitz https://thezvi.substack.com/p/i-am-the-golden-gate-bridge 0 comments
- Golden Gate Claude: What is it? - Claude101 https://claude101.com/golden-gate-claude/ 0 comments
- Some applied research problems in machine learning | thesephist.com https://thesephist.com/posts/applied-research-problems-2024/ 0 comments
- The engineering challenges of scaling interpretability \ Anthropic https://www.anthropic.com/research/engineering-challenges-interpretability 0 comments
- Moonshots, Malice, and Mitigations | Jesse’s Window https://blog.jessewalling.com/superintelligence/agi/ethics/whateverism/2024/06/26/moonshots-malice-and-mitigations.html 0 comments
- "Mechanistic interpretability" for LLMs, explained https://seantrott.substack.com/p/mechanistic-interpretability-for 0 comments
Related searches:
Search whole site: site:transformer-circuits.pub
Search title: Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet
See how to search.