RLHF progress: Scaling DPO to 70B, DPO vs PPO update, Tülu 2, Zephyr-β, meaningful evaluation, data contamination - discu.eu

Linking pages

RLHF learning resources in 2024 - by Nathan Lambert https://www.interconnects.ai/p/rlhf-resources 0 comments
Why reward models are key for alignment - by Nathan Lambert https://www.interconnects.ai/p/why-reward-models-matter 0 comments

Linked pages

[2305.18290] Direct Preference Optimization: Your Language Model is Secretly a Reward Model https://arxiv.org/abs/2305.18290 8 comments
LMSys Chatbot Arena Leaderboard - a Hugging Face Space by lmsys https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard 3 comments
Specifying objectives in RLHF - by Nathan Lambert https://www.interconnects.ai/p/specifying-objectives-in-rlhf 0 comments
[2310.12036] A General Theoretical Paradigm to Understand Learning from Human Preferences https://arxiv.org/abs/2310.12036#deepmind 0 comments
lightonai/alfred-40b-1023 · Hugging Face https://huggingface.co/lightonai/alfred-40b-1023 0 comments
kyutai: open science AI lab http://kyutai.org/ 0 comments
allenai/tulu-2-dpo-70b · Hugging Face https://huggingface.co/allenai/tulu-2-dpo-70b 0 comments
[2306.05685] Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena https://arxiv.org/abs/2306.05685 0 comments

Related searches:

Search whole site: site:interconnects.ai

Search title: RLHF progress: Scaling DPO to 70B, DPO vs PPO update, Tülu 2, Zephyr-β, meaningful evaluation, data contamination

See how to search.

Submit link to: