[2209.13085] Defining and Characterizing Reward Hacking - discu.eu

Linking pages

Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind https://www.dwarkeshpatel.com/p/sholto-douglas-trenton-bricken 3 comments
We Need to Control AI Agents Now - The Atlantic https://www.theatlantic.com/technology/archive/2024/07/ai-agents-safety-risks/678864/ 2 comments
Even Superhuman Go AIs Have Surprising Failures Modes | FAR AI https://far.ai/post/2023-07-superhuman-go-ais/ 0 comments

Related searches:

Search whole site: site:arxiv.org

Search title: [2209.13085] Defining and Characterizing Reward Hacking

See how to search.

Submit link to: