MMLU Benchmark (Multi-task Language Understanding)

Linking pages

The AI wars heat up with Claude 3, claimed to have “near-human” abilities | Ars Technica https://arstechnica.com/information-technology/2024/03/the-ai-wars-heat-up-with-claude-3-claimed-to-have-near-human-abilities/ 212 comments
Will AI target your job next? - by Ash Jafari - AI Future https://aifuture.substack.com/p/will-ai-target-your-job-next 155 comments
Our Humble Attempt at “How Much Data Do You Need to Fine-Tune” https://barryzhang.substack.com/p/our-humble-attempt-at-fine-tuning 37 comments
It’s getting harder to measure just how good AI is getting | Vox https://www.vox.com/future-perfect/394336/artificial-intelligence-openai-o3-benchmarks-agi 11 comments
7 Lessons from building a small-scale AI application https://www.thelis.org/blog/lessons-from-ai 6 comments
The first GPT-4-class AI model anyone can download has arrived: Llama 405B | Ars Technica https://arstechnica.com/information-technology/2024/07/the-first-gpt-4-class-ai-model-anyone-can-download-has-arrived-llama-405b/ 3 comments
Growing needs for accessing state-of-the-art reward models https://robotic.substack.com/p/open-rlhf-reward-models 0 comments
Running LLMs in the Browser with Rust + WebGPU https://fleetwood.dev/posts/running-llms-in-the-browser 0 comments
Google Might Have a Moat - by Will Seltzer - Intuitive AI https://intuitiveai.substack.com/p/google-might-have-a-moat 0 comments
Outsider Thinking and the Age of AI - Quantable Analytics https://www.quantable.com/working/outsider-thinking-and-ai/ 0 comments
GitHub - alopatenko/LLMEvaluation: A comprehensive guide to LLM evaluation methods designed to assist in identifying the most suitable evaluation techniques for various use cases, promote the adoption of best practices in LLM assessment, and critically assess the effectiveness of these evaluation methods. https://github.com/alopatenko/LLMEvaluation 0 comments
Hot takes on o3 | Tom Hipwell https://tomhipwell.co/blog/o3/ 0 comments
AI Models Are Getting Smarter. New Tests Are Racing to Catch Up | TIME https://time.com/7203729/ai-evaluations-safety/ 0 comments
Evaluating LLMs - Notes on a NeurIPS'24 Tutorial https://blog.quipu-strands.com/eval-llms 0 comments