Linking pages
Linked pages
- [2311.09247] Comparing Humans, GPT-4, and GPT-4V On Abstraction and Reasoning Tasks https://arxiv.org/abs/2311.09247 197 comments
- GitHub - vectara/hallucination-leaderboard: Leaderboard Comparing LLM Performance at Producing Hallucinations when Summarizing Short Documents https://github.com/vectara/hallucination-leaderboard 132 comments
- [2304.15004] Are Emergent Abilities of Large Language Models a Mirage? https://arxiv.org/abs/2304.15004 130 comments
- GitHub - mosaicml/composer: Train neural networks up to 7x faster https://github.com/mosaicml/composer 84 comments
- [2403.18802] Long-form factuality in large language models https://arxiv.org/abs/2403.18802 76 comments
- Open LLM Leaderboard - a Hugging Face Space by HuggingFaceH4 https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard 51 comments
- [2212.09251] Discovering Language Model Behaviors with Model-Written Evaluations https://arxiv.org/abs/2212.09251 50 comments
- [2303.17564] BloombergGPT: A Large Language Model for Finance https://arxiv.org/abs/2303.17564 47 comments
- Rivet https://rivet.ironcladapp.com/ 30 comments
- [2401.05566] Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training https://arxiv.org/abs/2401.05566 18 comments
- GitHub - openai/evals https://github.com/openai/evals 16 comments
- GitHub - confident-ai/deepeval: The Evaluation Framework for LLMs https://github.com/confident-ai/deepeval 16 comments
- [2402.01781] When Benchmarks are Targets: Revealing the Sensitivity of Large Language Model Leaderboards https://arxiv.org/abs/2402.01781 13 comments
- https://chat.lmsys.org/?leaderboard= 10 comments
- [2107.03374] Evaluating Large Language Models Trained on Code https://arxiv.org/abs/2107.03374 8 comments
- [2109.07958] TruthfulQA: Measuring How Models Mimic Human Falsehoods https://arxiv.org/abs/2109.07958 7 comments
- Chatbot Arena: Benchmarking LLMs in the Wild with Elo Ratings | LMSYS Org https://lmsys.org/blog/2023-05-03-arena/ 7 comments
- GitHub - truera/trulens: Evaluation and Tracking for LLM Experiments https://github.com/truera/trulens 7 comments
- A.I. Has a Measurement Problem - The New York Times https://www.nytimes.com/2024/04/15/technology/ai-models-measurement.html 7 comments
- SWE-bench http://www.swebench.com/ 6 comments