Hacker News
- New secret math benchmark stumps AI models and PhDs alike https://arstechnica.com/ai/2024/11/new-secret-math-benchmark-stumps-ai-models-and-phds-alike/ 2 comments
- Researchers introduce FrontierMath, a benchmark of hundreds of original and unpublished mathematics problems crafted and vetted by expert mathematicians. Current state-of-the-art AI models can only solve under 2% of problems. This offers a rigorous test bed that can quantify progress of AI systems. https://arstechnica.com/ai/2024/11/new-secret-math-benchmark-stumps-ai-models-and-phds-alike/ 10 comments futurology
- Research organization Epoch AI released FrontierMath, a new mathematics benchmark that has been turning heads in the AI world because it contains hundreds of expert-level problems that leading AI models solve less than 2 percent of the time https://arstechnica.com/ai/2024/11/new-secret-math-benchmark-stumps-ai-models-and-phds-alike/ 2 comments technology
Linked pages
- For the second time this year, NASA’s JPL center cuts its workforce - Ars Technica https://arstechnica.com/space/2024/11/for-the-second-time-this-year-nasas-jpl-center-cuts-its-workforce/ 193 comments
- LLMs can’t perform “genuine logical reasoning,” Apple researchers suggest - Ars Technica https://arstechnica.com/ai/2024/10/llms-cant-perform-genuine-logical-reasoning-apple-researchers-suggest/ 162 comments
- FrontierMath: Evaluating Advanced Mathematical Reasoning in AI | Epoch AI | Epoch AI https://epochai.org/frontiermath/the-benchmark 105 comments
- Google claims math breakthrough with proof-solving AI models | Ars Technica https://arstechnica.com/information-technology/2024/07/google-ai-earns-silver-medal-equivalent-at-international-mathematical-olympiad/ 17 comments
- OpenAI’s new “reasoning” AI models are here: o1-preview and o1-mini | Ars Technica https://arstechnica.com/information-technology/2024/09/openais-new-reasoning-ai-models-are-here-o1-preview-and-o1-mini/ 0 comments
Would you like to stay up to date with Computer science? Checkout Computer science
Weekly.
Related searches:
Search whole site: site:arstechnica.com
Search title: New secret math benchmark stumps AI models and PhDs alike - Ars Technica
See how to search.