Hugging Face's Open LLM Leaderboard v2 increases difficulty and delivers fairer scores
Hugging Face has launched Open LLM Leaderboard v2, featuring new benchmarks, normalized scoring, and community-driven features to provide more challenging and meaningful evaluations of large language models.
LLM performance is reaching a plateau and popular benchmarks are increasingly overused or saturated. As a result, evaluating and comparing different models has also become progressively more difficult. To address these challenges, Hugging Face has completely revamped its Open LLM Leaderboard, a resource that reliably evaluates different reference models using the same benchmarks in the same setup. As a testament to its popularity, the original version of the Open LLM Leaderboard has seen over 2 million unique visitors in the past 10 months.
To address challenges including model contamination (cases in which models are trained on benchmark data or data very similar to it) and benchmark saturation, the Open LLM Leaderboard v2 includes the following features:
- New challenging, error-free benchmarks that test capabilities of interest for real-world model performance:
- MMLU-Pro (Massive Multitask Language Understanding - Pro): A higher-quality and more challenging version of the original MMLU. This benchmark was designed after the original one was found noisy (some questions were unanswerable), and too easy (because of model contamination).
- GPQA (Google-Proof Q&A): A highly challenging knowledge dataset created by domain experts with questions that are difficult to answer for non-experts, but easy to answer for specialists. GPQA is only accessible through gating mechanisms to avoid data leaks and model contamination.
- MuSR (Multistep Soft Reasoning): Tests complex reasoning on long contexts with murder mysteries, object placement questions, or team allocation optimizations that are around 1K words in length. The MuSR tests reasoning and long-range context parsing. Not many models can score better than random in this benchmark.
- MATH Lv 5 (Mathematics Aptitude Test of Heuristics, Level 5 subset): Focuses on high-school level competition problems and accepts only generations that fit a specific output format as correct, the refined benchmark takes only the hardest entries in the full MATH dataset.
- IFEval (Instruction Following Evaluation): A recent benchmark centered on models' instruction-following capabilities rather than the content generated. It tests how well models can follow specific instructions such as "include keyword x" or "use format y".
- BBH (Big Bench Hard): A distillation of challenging problems in the BigBench dataset.
- Normalized Scoring: Scores are now normalized between the random baseline (0 points) and the maximum possible score (100 points), providing a fairer comparison across different benchmarks.
- Updated Evaluation Suite: Developed in collaboration with EleutherAI, the underlying evaluation harness improves reproducibility and adds new features like delta weights support and chat templates.
- Maintainer's Highlight: A new category showcasing high-quality models selected by the community and Hugging Face team, aimed at prioritizing the most useful models for evaluation.
- Community Voting: Users can now vote for models they want to see evaluated, helping prioritize the most anticipated submissions. If a model gets significant votes while the Hugging Face cluster is full, the team will consider running the model manually to accelerate its evaluation.
- Improved Interface: A faster, more user-friendly interface powered by a new Gradio leaderboard component.
Early testing of the new leaderboard consisted of adding and evaluating the models in the "maintainer’s highlights" section. This yielded interesting results, with some models such as Qwen-2-72B instruct, Meta’s Llama3-70B instruct, 01-ai’s Yi-1.5-34B chat, Cohere’s Command R + model, and lastly Smaug-72B, from AbacusAI retaining a stable rating across versions of the leaderboard. Qwen-2-72B-Instruct is the current leading model, with an impressive average score of 43.02, showcasing strong performance across various tasks. Meta's Llama-3-70B-Instruct follows in second place, although interestingly, it shows a significant drop in performance on the GPQA benchmark compared to its pre-trained counterpart (4.92 vs 19.67).
Moreover, the new benchmarks reveal some surprising trends. For instance, some chat-tuned models perform worse on mathematical tasks than their base versions, suggesting that certain fine-tuning procedures, such as making models more verbose, might impair specific capabilities. Another more general trend that surfaced by analyzing data across all versions of the leaderboard reveals that there is a move towards smaller, more efficient models achieving higher performance over time. As expected, the Open LLM Leaderboard v2 sets a lower starting point for model performance score-wise, which should be the perfect stage for tracking progress in the field over the coming months.