Hacker News
- Towards Monosemanticity: Decomposing Language Models with Dictionary Learning https://transformer-circuits.pub/2023/monosemantic-features/index.html 5 comments
Linking pages
- God Help Us, Let's Try To Understand The Paper On AI Monosemanticity https://www.astralcodexten.com/p/god-help-us-lets-try-to-understand 205 comments
- GitHub - openai/transformer-debugger https://github.com/openai/transformer-debugger 120 comments
- GitHub - PaulPauls/llama3_interpretability_sae: A complete end-to-end pipeline for LLM interpretability with sparse autoencoders (SAEs) using Llama 3.2, written in pure PyTorch and fully reproducible. https://github.com/PaulPauls/llama3_interpretability_sae 97 comments
- Representation Engineering Mistral-7B an Acid Trip https://vgel.me/posts/representation-engineering/ 75 comments
- Anthropic \ Decomposing Language Models Into Understandable Components https://www.anthropic.com/index/decomposing-language-models-into-understandable-components 62 comments
- AI Is a Black Box. Anthropic Figured Out a Way to Look Inside | WIRED https://www.wired.com/story/anthropic-black-box-ai-research-neurons-features/ 62 comments
- Manipulating Chess-GPT’s World Model | Adam Karvonen https://adamkarvonen.github.io/machine_learning/2024/03/20/chess-gpt-interventions.html 36 comments
- Monosemanticity at Home: My Attempt at Replicating Anthropic's Interpretability Research from Scratch https://jakeward.substack.com/p/monosemanticity-at-home-my-attempt 31 comments
- Gemma Scope: helping the safety community shed light on the inner workings of language models - Google DeepMind https://deepmind.google/discover/blog/gemma-scope-helping-the-safety-community-shed-light-on-the-inner-workings-of-language-models/ 4 comments
- Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind https://www.dwarkeshpatel.com/p/sholto-douglas-trenton-bricken 3 comments
- GitHub - JShollaj/awesome-llm-interpretability: A curated list of Large Language Model (LLM) Interpretability resources. https://github.com/JShollaj/awesome-llm-interpretability 1 comment
- Prism: mapping interpretable concepts and features in a latent space of language | thesephist.com https://thesephist.com/posts/prism/ 1 comment
- A primer on sparse autoencoders - by Nick Jiang https://nickjiang.substack.com/p/a-primer-on-sparse-autoencoders 1 comment
- Unlocking the “black box” - by Alex Lindsay and Greg Dale https://aipoliticalpulse.substack.com/p/unlocking-the-black-box 0 comments
- Neuroscience is pre-paradigmatic. Consciousness is why https://www.theintrinsicperspective.com/p/neuroscience-is-pre-paradigmatic 0 comments
- The case for open source AI https://press.airstreet.com/p/the-case-for-open-source-ai 0 comments
- Dictionary Learning with Sparse AutoEncoders | Kola Ayonrinde https://www.kolaayonrinde.com/blog/2023/11/03/dictionary-learning.html 0 comments
- Simple probes can catch sleeper agents \ Anthropic https://www.anthropic.com/research/probes-catch-sleeper-agents 0 comments
- The engineering challenges of scaling interpretability \ Anthropic https://www.anthropic.com/research/engineering-challenges-interpretability 0 comments
- An Intuitive Explanation of Sparse Autoencoders for Mechanistic Interpretability of LLMs | Adam Karvonen https://adamkarvonen.github.io/machine_learning/2024/06/11/sae-intuitions.html 0 comments
Related searches:
Search whole site: site:transformer-circuits.pub
Search title: Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
See how to search.