Linking pages
- Why Instagram is Losing Gen Z: We Asked 100 Users to Compare TikTok vs. Reels https://www.surgehq.ai/blog/tiktok-vs-instagram-reels-personalized-human-evaluation 263 comments
- Human Evaluation of Large Language Models: How Good is Hugging Face's BLOOM? https://www.surgehq.ai/blog/how-good-is-hugging-faces-bloom-a-real-world-human-evaluation-of-language-models 28 comments
- Evaluating ChatGPT vs. Google on 500 Search Queries https://www.surgehq.ai/blog/googles-existential-threat-chatgpt-matches-googles-performance-on-informational-search-queries-and-smashes-it-on-coding 21 comments
- HellaSwag or HellaBad? 36% of this popular LLM benchmark contains errors https://www.surgehq.ai/blog/hellaswag-or-hellabad-36-of-this-popular-llm-benchmark-contains-errors 14 comments
Linked pages
- Holy $#!t: Are popular toxicity models simply profanity detectors? https://www.surgehq.ai/blog/are-popular-toxicity-models-simply-profanity-detectors 298 comments
- Pathways Language Model (PaLM): Scaling to 540 Billion Parameters for Breakthrough Performance – Google AI Blog https://ai.googleblog.com/2022/04/pathways-language-model-palm-scaling-to.html 279 comments
- Solving (some) formal math olympiad problems https://openai.com/blog/formal-math/ 22 comments
- Solving Math Word Problems https://openai.com/blog/grade-school-math/ 20 comments
- [2201.11903] Chain of Thought Prompting Elicits Reasoning in Large Language Models https://arxiv.org/abs/2201.11903 1 comment
Related searches:
Search whole site: site:www.surgehq.ai
Search title: How Surge AI Built OpenAI's GSM8K Dataset of 8,500 Math Problems
See how to search.