[2402.01781] When Benchmarks are Targets: Revealing the Sensitivity of Large Language Model Leaderboards - discu.eu

Reddit

[R] Skeptical about LLM benchmarks telling the whole story? This paper shows how tiny tweaks to tests like MMLU can shuffle model rankings like a deck of cards. 🃏 https://arxiv.org/abs/2402.01781 12 comments 10/2/2024 machinelearning

Linking pages

GitHub - alopatenko/LLMEvaluation: A comprehensive guide to LLM evaluation methods designed to assist in identifying the most suitable evaluation techniques for various use cases, promote the adoption of best practices in LLM assessment, and critically assess the effectiveness of these evaluation methods. https://github.com/alopatenko/LLMEvaluation 0 comments

Would you like to stay up to date with Computer science? Checkout Computer science Weekly.

Related searches:

Search whole site: site:arxiv.org

Search title: [2402.01781] When Benchmarks are Targets: Revealing the Sensitivity of Large Language Model Leaderboards

See how to search.

Submit link to: