- [R] Skeptical about LLM benchmarks telling the whole story? This paper shows how tiny tweaks to tests like MMLU can shuffle model rankings like a deck of cards. 🃏 https://arxiv.org/abs/2402.01781 12 comments machinelearning
Linking pages
- GitHub - alopatenko/LLMEvaluation: A comprehensive guide to LLM evaluation methods designed to assist in identifying the most suitable evaluation techniques for various use cases, promote the adoption of best practices in LLM assessment, and critically assess the effectiveness of these evaluation methods. https://github.com/alopatenko/LLMEvaluation 0 comments
Would you like to stay up to date with Computer science? Checkout Computer science
Weekly.
Related searches:
Search whole site: site:arxiv.org
Search title: [2402.01781] When Benchmarks are Targets: Revealing the Sensitivity of Large Language Model Leaderboards
See how to search.