Hacker News
- AI leaderboards are no longer useful. It's time to switch to Pareto curves https://www.aisnakeoil.com/p/ai-leaderboards-are-no-longer-useful 14 comments
Linking pages
- AI scaling myths - by Arvind Narayanan and Sayash Kapoor https://www.aisnakeoil.com/p/ai-scaling-myths 21 comments
- New paper: AI agents that matter https://www.aisnakeoil.com/p/new-paper-ai-agents-that-matter 10 comments
- AI #63: Introducing Alpha Fold 3 - by Zvi Mowshowitz https://thezvi.substack.com/p/ai-63-introducing-alpha-fold-3 0 comments
Linked pages
- https://openai.com/blog/new-models-and-developer-products-announced-at-devday 568 comments
- Cheaper, Better, Faster, Stronger | Mistral AI | Frontier AI in your hands https://mistral.ai/news/mixtral-8x22b/ 243 comments
- [2402.05120] More Agents Is All You Need https://arxiv.org/abs/2402.05120 206 comments
- [2303.11366] Reflexion: an autonomous agent with dynamic memory and self-reflection https://arxiv.org/abs/2303.11366 189 comments
- [1807.03341] Troubling Trends in Machine Learning Scholarship https://arxiv.org/abs/1807.03341 62 comments
- Leakage and the Reproducibility Crisis in ML-based Science https://reproducible.cs.princeton.edu/ 37 comments
- SWE-bench http://www.swebench.com/ 6 comments
- Holistic Evaluation of Language Models (HELM) https://crfm.stanford.edu/helm/lite/latest/ 4 comments
- Pricing – Replicate https://replicate.com/pricing 3 comments
- [2303.08774] GPT-4 Technical Report https://arxiv.org/abs/2303.08774 1 comment
- The Shift from Models to Compound AI Systems – The Berkeley Artificial Intelligence Research Blog https://bair.berkeley.edu/blog/2024/02/18/compound-ai-systems/ 1 comment
- Berkeley Function Calling Leaderboard (aka Berkeley Tool Calling Leaderboard) https://gorilla.cs.berkeley.edu/leaderboard.html 1 comment
- [2402.16906] LDB: A Large Language Model Debugger via Verifying Runtime Execution Step-by-step https://arxiv.org/abs/2402.16906 1 comment
- [2203.07814] Competition-Level Code Generation with AlphaCode https://arxiv.org/abs/2203.07814 0 comments
- Pricing — TOGETHER https://together.ai/pricing 0 comments
- HumanEval Benchmark (Code Generation) | Papers With Code https://paperswithcode.com/sota/code-generation-on-humaneval 0 comments
- Evaluating LLMs is a minefield https://www.cs.princeton.edu/~arvindn/talks/evaluating_llms_minefield/ 0 comments
- The end of the “best open LLM” - by Nathan Lambert https://www.interconnects.ai/p/compute-efficient-open-llms 0 comments
- https://twitter.com/OpenAIDevs/status/1779922566091522492 0 comments
- [2404.12272] Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences https://arxiv.org/abs/2404.12272 0 comments
Related searches:
Search whole site: site:aisnakeoil.com
Search title: AI leaderboards are no longer useful. It's time to switch to Pareto curves.
See how to search.