Useful information
Prime News delivers timely, accurate news and insights on global events, politics, business, and technology
Useful information
Prime News delivers timely, accurate news and insights on global events, politics, business, and technology
Join our daily and weekly newsletters to get the latest updates and exclusive content on industry-leading AI coverage. More information
In a new case study, Hugging Face researchers have demonstrated how small language models (SLMs) can be configured to outperform much larger models. Their findings show that a Llama 3 model with 3B parameters can outperform the 70B version of the model on complex mathematical problems.
Hug face has fully documented the entire process and provides a roadmap for companies that want to create their own custom reasoning models.
The work is inspired by OpenAI o1, which uses additional “thinking” to solve complex math, coding, and reasoning problems.
The key idea behind models like o1 is to scale “computation at test time,” which effectively means using more computation cycles during inference to test and verify different answers and reasoning paths before producing the final answer. Scaling the calculation at test time is especially useful when there is not enough memory to run a large model.
Since o1 is a private model and OpenAI has remained quiet about its internal workings, researchers have been speculating about how it works and trying to reverse engineer the process. There are already several open alternatives to o1.
Hugging Face’s work builds on a DeepMind study published in August, which investigates trade-offs between inference time and pre-training computation. The study provides comprehensive guidelines on how to balance training and inference computing to get the best results on a fixed budget.
In addition to using additional inference time calculations, the success of the technique depends on two key components: a reward model that evaluates the SLM’s responses and a search algorithm that optimizes the path it takes to refine its responses.
The simplest way to use test time scaling is “majority voting”, where you send the same message to the model multiple times and choose the one with the most votes. On simple problems, majority voting can be useful, but its gains quickly plateau on complex reasoning problems or tasks where errors are consistent across generations.
A more advanced reasoning method is “Best of N”. In this technique, the SLM generates multiple answers, but instead of majority voting, a reward model is used to evaluate the answers and choose the best one. “Weighted best-of-N,” a more nuanced version of this method, takes consistency into account to choose answers that are safe and occur more frequently than others.
The researchers used a “reward process model” (PRM) that rates the SLM’s response not only on the final response but also on the multiple stages it goes through to achieve it. Their experiments showed that Weighted Best-of-N and PRM brought Llama-3.2 1B close to the level of Llama-3.2 8B on the difficult MATH-500 benchmark.
To further improve the model’s performance, the researchers added search algorithms to the model’s reasoning process. Instead of generating the response in a single pass, they used “beam search,” an algorithm that guides the model response process step by step.
At each step, the SLM generates multiple partial responses. The search algorithm uses the reward model to evaluate the responses and chooses a subset worth exploring further. The process is repeated until the model exhausts its inference budget or arrives at the correct answer. In this way, the inference budget can be reduced to focus on the most promising answers.
The researchers found that while beam search improves model performance on complex problems, it tends to underperform other techniques on simple problems. To address this challenge, they added two more elements to their inference strategy.
The first was diverse verifier tree search (DVTS), a variant of beam search that ensures that the SLM does not get stuck on false reasoning paths and diversifies its response branches. Second, they developed an “optimal scaling strategy for computation,” as suggested in the DeepMind paper, which dynamically chooses the best scaling strategy at test time based on the difficulty of the input problem.
The combination of these techniques allowed the Llama-3.2 1B to punch above its weight and outperform the 8B model by a significant margin. They also found that the strategy was scalable, and when applied to Llama-3.2 3B, they were able to outperform the much larger 70B model.
Expanding the calculation at test time changes the cost dynamics of the model. Companies now have the ability to choose where to allocate their computing resources. For example, if you have low memory or can tolerate slower response times, you can use a small model and spend more cycles of inference time to generate more accurate responses.
However, scaling in test time also has its limitations. For example, in the experiments carried out by Hugging Face, the researchers used a specially trained Llama-3.1-8B model as a PRM, which requires running two models in parallel (although it is much more resource efficient than the 70B model). Researchers recognize that the holy grail of scaling in test time is to have “self-verification,” where the original model verifies its own answer rather than relying on an external verifier. This is an open area of research.
The test time scaling technique presented in this study is also limited to problems where the answer can be clearly evaluated, such as coding and mathematics. Creating reward models and verifiers for subjective tasks such as creative writing and product design requires more research.
But what is clear is that the expansion in testing has generated a lot of interest and activity and we can expect more tools and techniques to emerge in the coming months. Companies will do well to keep an eye on how the landscape develops.