view article Article CO₂ Emissions and Models Performance: Insights from the Open LLM Leaderboard Jan 9 • 18
view article Article Rethinking LLM Evaluation with 3C3H: AraGen Benchmark and Leaderboard Dec 4, 2024 • 31
view article Article Letting Large Models Debate: The First Multilingual LLM Debate Competition Nov 20, 2024 • 30
view article Article BigCodeBench: Benchmarking Large Language Models on Solving Practical and Challenging Programming Tasks Jun 18, 2024 • 43
view article Article Falcon 2: An 11B parameter pretrained language model and VLM, trained on over 5000B tokens tokens and 11 languages May 24, 2024 • 25
view article Article CyberSecEval 2 - A Comprehensive Evaluation Framework for Cybersecurity Risks and Capabilities of Large Language Models May 24, 2024 • 21
view article Article Bringing the Artificial Analysis LLM Performance Leaderboard to Hugging Face May 3, 2024 • 13
view article Article The Open Medical-LLM Leaderboard: Benchmarking Large Language Models in Healthcare Apr 19, 2024 • 131
view article Article Introducing the LiveCodeBench Leaderboard - Holistic and Contamination-Free Evaluation of Code LLMs Apr 16, 2024 • 15