Hacker News
- Reward Hacking in Reinforcement Learning https://lilianweng.github.io/posts/2024-11-28-reward-hacking/ 1 comment
Linked pages
- The world’s fastest framework for building websites |Hugo http://gohugo.io/ 396 comments
- Goodhart's law - Wikipedia http://en.wikipedia.org/wiki/Goodhart%27s_law 221 comments
- [2310.13548] Towards Understanding Sycophancy in Language Models https://arxiv.org/abs/2310.13548 72 comments
- Cyclomatic complexity - Wikipedia https://en.wikipedia.org/wiki/Cyclomatic_complexity 50 comments
- [1803.03453] The Surprising Creativity of Digital Evolution: A Collection of Anecdotes from the Evolutionary Computation and Artificial Life Research Communities https://arxiv.org/abs/1803.03453 37 comments
- Faulty Reward Functions in the Wild https://openai.com/blog/faulty-reward-functions 17 comments
- [2105.14111] Goal Misgeneralization in Deep Reinforcement Learning https://arxiv.org/abs/2105.14111 13 comments
- GitHub - openai/procgen: Procgen Benchmark: Procedurally-Generated Game-Like Gym-Environments https://github.com/openai/procgen 8 comments
- A (Long) Peek into Reinforcement Learning | Lil'Log https://lilianweng.github.io/posts/2018-02-19-rl-overview/ 8 comments
- [1905.10615] Adversarial Policies: Attacking Deep Reinforcement Learning https://arxiv.org/abs/1905.10615 7 comments
- [2409.12822] Language Models Learn to Mislead Humans via RLHF https://arxiv.org/abs/2409.12822 7 comments
- [1606.06565] Concrete Problems in AI Safety https://arxiv.org/abs/1606.06565 3 comments
- Specification gaming examples in AI - master list - Google Drive https://docs.google.com/spreadsheets/d/e/2PACX-1vRPiprOaC3HsCf5Tuum8bRfzYUiKLRqJmbOoC-32JorNdfyTiRRsR7Ea5eWtvsWzuxo8bjOxCG84dAg/pubhtml 2 comments
- https://ai.stanford.edu/~ang/papers/icml00-irl.pdf 1 comment
- [1602.04938] "Why Should I Trust You?": Explaining the Predictions of Any Classifier http://arxiv.org/abs/1602.04938 1 comment
- [2004.07780] Shortcut Learning in Deep Neural Networks https://arxiv.org/abs/2004.07780 1 comment
- [2201.03544] The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models https://arxiv.org/abs/2201.03544 1 comment
- [1705.08417] Reinforcement Learning with a Corrupted Reward Channel https://arxiv.org/abs/1705.08417 0 comments
- [2210.10760] Scaling Laws for Reward Model Overoptimization https://arxiv.org/abs/2210.10760 0 comments
- Chatbot Arena Conversation Dataset Release | LMSYS Org https://lmsys.org/blog/2023-07-20-dataset/ 0 comments
Related searches:
Search whole site: site:lilianweng.github.io
Search title: Reward Hacking in Reinforcement Learning | Lil'Log
See how to search.