Reward Hacking in Reinforcement Learning | Lil'Log - discu.eu

Linking pages

Demystifying Reasoning Models - by Cameron R. Wolfe, Ph.D. https://cameronrwolfe.substack.com/p/demystifying-reasoning-models 0 comments
What We Mean When We Say “Think” | Drew Breunig https://www.dbreunig.com/2025/04/11/what-we-mean-when-we-say-think.html 0 comments

Linked pages

The world’s fastest framework for building websites |Hugo http://gohugo.io/ 396 comments
Goodhart's law - Wikipedia http://en.wikipedia.org/wiki/Goodhart%27s_law 221 comments
[2310.13548] Towards Understanding Sycophancy in Language Models https://arxiv.org/abs/2310.13548 72 comments
Cyclomatic complexity - Wikipedia https://en.wikipedia.org/wiki/Cyclomatic_complexity 50 comments
[1803.03453] The Surprising Creativity of Digital Evolution: A Collection of Anecdotes from the Evolutionary Computation and Artificial Life Research Communities https://arxiv.org/abs/1803.03453 37 comments
A (Long) Peek into Reinforcement Learning | Lil'Log https://lilianweng.github.io/posts/2018-02-19-rl-overview/ 24 comments
Faulty Reward Functions in the Wild https://openai.com/blog/faulty-reward-functions 17 comments
[2105.14111] Goal Misgeneralization in Deep Reinforcement Learning https://arxiv.org/abs/2105.14111 13 comments
GitHub - openai/procgen: Procgen Benchmark: Procedurally-Generated Game-Like Gym-Environments https://github.com/openai/procgen 8 comments
[1905.10615] Adversarial Policies: Attacking Deep Reinforcement Learning https://arxiv.org/abs/1905.10615 7 comments
[2409.12822] Language Models Learn to Mislead Humans via RLHF https://arxiv.org/abs/2409.12822 7 comments
[1606.06565] Concrete Problems in AI Safety https://arxiv.org/abs/1606.06565 3 comments
Specification gaming examples in AI - master list - Google Drive https://docs.google.com/spreadsheets/d/e/2PACX-1vRPiprOaC3HsCf5Tuum8bRfzYUiKLRqJmbOoC-32JorNdfyTiRRsR7Ea5eWtvsWzuxo8bjOxCG84dAg/pubhtml 2 comments
https://ai.stanford.edu/~ang/papers/icml00-irl.pdf 1 comment
[1602.04938] "Why Should I Trust You?": Explaining the Predictions of Any Classifier http://arxiv.org/abs/1602.04938 1 comment
[2004.07780] Shortcut Learning in Deep Neural Networks https://arxiv.org/abs/2004.07780 1 comment
[2201.03544] The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models https://arxiv.org/abs/2201.03544 1 comment
[1705.08417] Reinforcement Learning with a Corrupted Reward Channel https://arxiv.org/abs/1705.08417 0 comments
[2210.10760] Scaling Laws for Reward Model Overoptimization https://arxiv.org/abs/2210.10760 0 comments
Chatbot Arena Conversation Dataset Release | LMSYS Org https://lmsys.org/blog/2023-07-20-dataset/ 0 comments

Related searches:

Search whole site: site:lilianweng.github.io

Search title: Reward Hacking in Reinforcement Learning | Lil'Log

See how to search.

Submit link to: