Useful information
Prime News delivers timely, accurate news and insights on global events, politics, business, and technology
Useful information
Prime News delivers timely, accurate news and insights on global events, politics, business, and technology
Join our daily and weekly newsletters to obtain the latest updates and exclusive content on the coverage of the industry leader. Get more information
Very small language models (SLM) can overcome large language models (LLM) in reasoning tasks, according to a New study by the Shanghai AI laboratory. The authors show that with the correct tools and the test time of proof time, a SLM with one billion parameters can exceed a 405B LLM at complicated mathematics reference points.
The ability to implement SLM in complex reasoning tasks can be very useful since companies seek new ways to use these new models in different environments and applications.
The trial time scale (TTS) is the process of giving LLMS Cylces Extra Cylces during inference to improve its performance in several tasks. The main reasoning models, such as Openai O1 and Deepseek-R1, use “internal TTS”, which means they are trained to “think” slowly by generating a long chain of thought chain tokens (COT).
An alternative approach is “external TTS”, where model performance is improved with (as the name indicates) external help. External TTS is suitable for reusing output models for reasoning tasks without further adjusting them. An external TTS configuration generally consists of a “policy model”, which is the main LLM that generates the answer and a process reward model (PRM) that evaluates the responses of the policy model. These two components are coupled through a sampling or search method.
The easiest configuration is “Best-OF-N”, where the policy model generates multiple answers and the PRM selects one or more better answers to compose the final response. The most advanced external TTS methods use the search. In “Beam Search”, the model divides the answer into multiple steps.
For each step, it shows multiple answers and executes them through the PRM. Then choose one or more appropriate candidates and generate the next step of the answer. And, in the “Search for various verifiers” (DVT), the model generates several branches of responses to create a more diverse set of candidate responses before synthesizing them in a final response.
Choosing the correct TTS strategy depends on multiple factors. The authors of the study carried out a systematic investigation of how different policy models and PRM affect the efficiency of TTS methods.
Its findings show that efficiency depends largely on politics and PRM models. For example, for small policy models, search -based methods exceed N. However, for large policy models, the best of N is more effective because models have better reasoning capabilities and not They need a reward model to verify every step of their reasoning.
Their findings also show that the correct TTS strategy depends on the difficulty of the problem. For example, for small policy models with less than 7b parameters, the best of N works best for easy problems, while the search for beam works better for more difficult problems. For policy models that have between 7b and 32b of parameters, the diverse search for trees works well for easy and medium problems, and the search for beam works better for difficult problems. But for large policy models (72b and more parameters), Best-OF-N is the optimal method for all levels of difficulty.
According to these findings, developers can create computing TTS strategies that take into account the policy model, PRM and the difficulty of the problem to make the most of the computing budget to solve reasoning problems.
For example, the researchers found that a model called-3.2-3b with the computing TTS strategy exceeds the call-3.1-405b in Math-500 and Aime24, two reference points of complicated mathematics. This shows that a SLM can overcome a model that is 135 times larger when the TTS Computer strategy is used.
In other experiments, they found that a QWEN2.5 model with 500 million parameters can exceed GPT-4O with the correct TTS strategy. Using the same strategy, the 1.5B distilled version of Deepseek-R1 surpassed O1-Previa and O1-mini in Math-500 and Aime24.
By accounting for training and inference calculation budgets, the results show that with optimal computing scale strategies, SLM can exceed larger models with 100-1000X less failing.
The results of the researchers show that the optimal computing TTS significantly improves the reasoning capabilities of the language models. However, as the policy model increases, the improvement of TTS gradually decreases.
“This suggests that the effectiveness of TTS is directly related to the reasoning capacity of the policy model,” the researchers write. “Specifically, for models with weak reasoning skills, the test calculation time scale leads to substantial improvement, while for models with strong reasoning skills, gain is limited.”
The study validates that SLM can work better than larger models when applying optimal computing test time scale methods. While this study focuses on mathematics reference points, researchers plan to expand their study to other reasoning tasks, such as coding and chemistry.