Hacker News
- Refusal in language models is mediated by a single direction https://arxiv.org/abs/2406.11717 44 comments
Linking pages
Related searches:
Search whole site: site:arxiv.org
Search title: [2406.11717] Refusal in Language Models Is Mediated by a Single Direction
See how to search.