GitHub - alopatenko/LLMEvaluation: A comprehensive guide to LLM evaluation methods designed to assist in identifying the most suitable evaluation techniques for various use cases, promote the adoption of best practices in LLM assessment, and critically assess the effectiveness of these evaluation methods.

Linking pages

aie-book/resources.md at main · chiphuyen/aie-book · GitHub https://github.com/chiphuyen/aie-book/blob/main/resources.md 0 comments

Linked pages

[2311.09247] Comparing Humans, GPT-4, and GPT-4V On Abstraction and Reasoning Tasks https://arxiv.org/abs/2311.09247 197 comments
GitHub - vectara/hallucination-leaderboard: Leaderboard Comparing LLM Performance at Producing Hallucinations when Summarizing Short Documents https://github.com/vectara/hallucination-leaderboard 132 comments
[2304.15004] Are Emergent Abilities of Large Language Models a Mirage? https://arxiv.org/abs/2304.15004 130 comments
GitHub - mosaicml/composer: Train neural networks up to 7x faster https://github.com/mosaicml/composer 84 comments
[2403.18802] Long-form factuality in large language models https://arxiv.org/abs/2403.18802 76 comments
Open LLM Leaderboard - a Hugging Face Space by HuggingFaceH4 https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard 51 comments
[2212.09251] Discovering Language Model Behaviors with Model-Written Evaluations https://arxiv.org/abs/2212.09251 50 comments
[2303.17564] BloombergGPT: A Large Language Model for Finance https://arxiv.org/abs/2303.17564 47 comments
Rivet https://rivet.ironcladapp.com/ 30 comments
GitHub - confident-ai/deepeval: The LLM Evaluation Framework https://github.com/confident-ai/deepeval 23 comments
[2401.05566] Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training https://arxiv.org/abs/2401.05566 18 comments
GitHub - openai/evals https://github.com/openai/evals 16 comments
[2402.01781] When Benchmarks are Targets: Revealing the Sensitivity of Large Language Model Leaderboards https://arxiv.org/abs/2402.01781 13 comments
https://chat.lmsys.org/?leaderboard= 10 comments
[2107.03374] Evaluating Large Language Models Trained on Code https://arxiv.org/abs/2107.03374 8 comments
[2109.07958] TruthfulQA: Measuring How Models Mimic Human Falsehoods https://arxiv.org/abs/2109.07958 7 comments
Chatbot Arena: Benchmarking LLMs in the Wild with Elo Ratings | LMSYS Org https://lmsys.org/blog/2023-05-03-arena/ 7 comments
GitHub - truera/trulens: Evaluation and Tracking for LLM Experiments https://github.com/truera/trulens 7 comments
A.I. Has a Measurement Problem - The New York Times https://www.nytimes.com/2024/04/15/technology/ai-models-measurement.html 7 comments
SWE-bench Leaderboard http://www.swebench.com/ 6 comments