Useful information

Prime News delivers timely, accurate news and insights on global events, politics, business, and technology

Less is more: UC Berkeley and Google unlock the potential of LLM through simple sampling


Join our daily and weekly newsletters to obtain the latest updates and exclusive content on the coverage of the industry leader. Get more information


TO New paper By researchers of Google research and The University of California, Berkeley, It shows that a surprisingly simple test time -timing scale approach can increase the reasoning skills of large language models (LLM). The key? Climbing the sampling -based search, a technique that is based on generating multiple answers and using the model itself to verify them.

The central finding is that even a minimalist implementation of the sampling -based search, using random sampling and self -verification, can raise the performance performance of models such as Gemini 1.5 Pro beyond that of the O1 forecast at popular reference points. The findings may have important implications for business applications and challenge the assumption that highly specialized training or complex architectures are always necessary to achieve top -level performance.

The limits of the current test scale test

The current popular method for the test time scale in LLM is to train the model through reinforcement learning to generate longer responses with thought chain traces (COT). This approach is used in models such as Openai O1 and Deepseek-R1. While it is beneficial, these methods generally require a substantial investment in the training phase.

Another method of trial time scale is “self -consistency”, where the model generates multiple responses to the consultation and chooses the response that appears more frequently. Self -consistency reaches its limits when handling complex problems, as in these cases, the most repeated response is not necessarily correct.

Sampling -based search offers a simpler and more scalable alternative at the test time scale: let the model generate multiple answers and select the best through a verification mechanism. The sampling -based search can complement other strategies of time calculating time calculation strategies and, as researchers write in their article, “it also has the unique advantage of being shamefully parallel and allowing an arbitrary scale: simply show more answers.”

More importantly, sampling -based search can be applied to any LLM, including those that have not been explicitly trained for reasoning.

How the sampling -based search works

The researchers focus on a minimalist implementation of sampling based, using a language model to generate candidate responses and verify them. This is a “self -verification” process, where the model evaluates its own exits without depending on external responses on land or symbolic verification systems.

Search -based sampling
Search -based sampling credit: Venturebeat

The algorithm works in a few simple steps:

1 – The algorithm begins by generating a set of candidate solutions to the problem using a language model. This is done by giving the same notice several times and using a temperature configuration other than zero to create a diverse set of answers.

2 – The candidate’s response suffers a verification process in which the LLM is requested several times to determine if the answer is correct. Verification results are averaged to create a final verification score for the response.

3— The algorithm selects the highest score response as the final response. If multiple candidates are at a short distance from each other, the LLM is asked to compare them by pairs and choose the best. The response that most peer comparisons earn is chosen as the final response.

The researchers considered two key axes for the test time scale:

Sampling: The number of answers generated by the model for each input problem.

Verification: The number of verification scores calculated for each solution generated

How sampling -based search is compared to other techniques

The study revealed that the performance of reasoning continues to improve with the sampling -based search, even when the calculation of trial time is climbed far beyond the point where self -consistency is saturated.

On a sufficient scale, this minimalist implementation significantly increases the precision of reasoning at reasoning points such as Aime and Mathematics. For example, Gemini 1.5 Pro’s performance exceeded O1 Preview, which has been explicitly trained in reasoning problems, and Gemini 1.5 Flash beat Gemini 1.5 Pro.

“This not only highlights the importance of sampling -based search for scale capacity, but also suggests the usefulness of sampling -based search as a simple baseline on which to compare other strategies of the computing time of testing time scale and measure genuine improvements in models search capabilities,” the researchers write.

It is worth noting that, although the search results based on the search are impressive, costs can also become prohibitive. For example, with 200 samples and 50 sample verification steps, an AIME consultation will generate around 130 million tokens, which costs $ 650 with Gemini 1.5 Pro. However, this is a very minimalist approach to sampling -based search, and is compatible with optimization techniques proposed in other studies. With the smartest sampling and verification methods, inference costs can be reduced considerably by using smaller models and generating less tokens. For example, when using Gemini 1.5 Flash to perform verification, costs fall to $ 12 per question.

Effective self -verification strategies

There is an ongoing debate on whether LLMS can verify your own answers. The researchers identified two key strategies to improve self -verification using the test time compute:

Comparing directly to response candidates: Disagreements between candidate solutions indicate possible errors. By providing multiple responses to the verifier to compare, the model can identify better errors and hallucinations, addressing a central weakness of the LLMs. Researchers describe this as an instance of “implicit scale.”

Specific task rewriting: The researchers propose that the optimal output style of a LLM depends on the task. The chain of thought is effective in solving reasoning tasks, but the answers are easier to verify when they are written in a more formal and mathematically conventional style. Verifiers can rewrite candidate responses in a more structured format (eg, motto-proof theorem-leore) before evaluation.

“We anticipate the model self -verification capabilities to quickly improve in the short term, since the models learn to take advantage of the principles of the implicit suitability of scale and the output style, and generate improved scale rates for sampling -based search,” the researchers write.

Implications for real world applications

The study demonstrates that a relatively simple technique can achieve impressive results, which potentially reduces the need for complex and expensive architectures or training regimes.

This is also a scalable technique, which allows companies to increase performance by assigning more calculation resources to sampling and verification. It also allows developers to push border language models beyond their limitations in complex tasks.

“Since it complements other strategies of time of time of trial time, it is parallel and allows a scale arbitrarily, and admits simple implementations that are demonstrably effective, we hope that the sampling -based search plays a crucial role as the language models have the task of solving increasingly complex problems with increasingly large budgets,” the researchers write.

Source link

Leave a Reply

Your email address will not be published. Required fields are marked *