Papers
arxiv:2502.06703

Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling

Published on Feb 10
· Submitted by akhaliq on Feb 11
#2 Paper of the day
Authors:
,
,
,

Abstract

Test-Time Scaling (TTS) is an important method for improving the performance of Large Language Models (LLMs) by using additional computation during the inference phase. However, current studies do not systematically analyze how policy models, Process Reward Models (PRMs), and problem difficulty influence TTS. This lack of analysis limits the understanding and practical use of TTS methods. In this paper, we focus on two core questions: (1) What is the optimal approach to scale test-time computation across different policy models, PRMs, and problem difficulty levels? (2) To what extent can extended computation improve the performance of LLMs on complex tasks, and can smaller language models outperform larger ones through this approach? Through comprehensive experiments on MATH-500 and challenging AIME24 tasks, we have the following observations: (1) The compute-optimal TTS strategy is highly dependent on the choice of policy model, PRM, and problem difficulty. (2) With our compute-optimal TTS strategy, extremely small policy models can outperform larger models. For example, a 1B LLM can exceed a 405B LLM on MATH-500. Moreover, on both MATH-500 and AIME24, a 0.5B LLM outperforms GPT-4o, a 3B LLM surpasses a 405B LLM, and a 7B LLM beats o1 and DeepSeek-R1, while with higher inference efficiency. These findings show the significance of adapting TTS strategies to the specific characteristics of each task and model and indicate that TTS is a promising approach for enhancing the reasoning abilities of LLMs.

Community

TLDR; This paper explores how we can make weak models stronger by using TTS approaches. The paper finds, that the "compute optimal" depends on the TTS, the PRM and the Policy Model. There is no right answer for every setup.

The paper also suggests that further work be done into how to do "weak-to-strong" supervision instead of "strong-to-weak" supervision (what we currently do, where he have a 405B model train a 70B or 7B model), as doing so would allow for more efficient and autonomous improvements in AI...

Interesting read!

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2502.06703 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2502.06703 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2502.06703 in a Space README.md to link it from this page.

Collections including this paper 3