Hacker News
- HellaSwag: 36% of this popular large language model benchmark contains errors https://www.surgehq.ai/blog/hellaswag-or-hellabad-36-of-this-popular-llm-benchmark-contains-errors 8 comments
- 36% of HellaSwag benchmark contains errors [D] https://www.surgehq.ai/blog/hellaswag-or-hellabad-36-of-this-popular-llm-benchmark-contains-errors 6 comments machinelearning
Linking pages
Linked pages
- 30% of Google's Emotions Dataset is Mislabeled https://www.surgehq.ai/blog/30-percent-of-googles-reddit-emotions-dataset-is-mislabeled 280 comments
- Human Evaluation of Large Language Models: How Good is Hugging Face's BLOOM? https://www.surgehq.ai/blog/how-good-is-hugging-faces-bloom-a-real-world-human-evaluation-of-language-models 28 comments
- GitHub - google/BIG-bench: Beyond the Imitation Game collaborative benchmark for measuring and extrapolating the capabilities of language models https://github.com/google/BIG-bench 0 comments
- How Surge AI Built OpenAI's GSM8K Dataset of 8,500 Math Problems https://www.surgehq.ai/blog/how-we-built-it-openais-gsm8k-dataset-of-8500-math-problems 0 comments
- Stanford CRFM https://crfm.stanford.edu/2022/11/17/helm.html 0 comments
Would you like to stay up to date with Computer science? Checkout Computer science
Weekly.
Related searches:
Search whole site: site:www.surgehq.ai
Search title: HellaSwag or HellaBad? 36% of this popular LLM benchmark contains errors
See how to search.