Useful information
Prime News delivers timely, accurate news and insights on global events, politics, business, and technology
Useful information
Prime News delivers timely, accurate news and insights on global events, politics, business, and technology
Join our daily and weekly newsletters to obtain the latest updates and exclusive content on the coverage of the industry leader. Get more information
In general, developers focus on reducing inference time, the period between when IA receives a notice and provides an answer, to obtain faster information.
But when it comes to adverse robustness, Operai researchers say: Not so fast. They propose that increasing the amount of time that a model has to “think”, the inference of calculating time, can help accumulate defenses against adverse attacks.
The company used its own previous O1 models and O1-mini to test this theory, launching a variety of static and adaptive attack methods: image-based manipulations, intentionally providing incorrect responses to mathematical problems and overwhelming models with information (“Many-shot Jailbreaking ”).
“We see that in many cases, this probability decays, often to almost zero, as the calculation of inference time grows,” researchers Write in a blog post. “Our statement is not that these particular models are unwavering, we know that they are, but that the time inference scale produces greater robustness for a variety of environments and attacks.”
Large language models (LLM) are becoming increasing As they do, its attack surface becomes broader and wider more exposed.
However, adverse robustness continues to be a stubborn problem, with the progress in the resolution that Openai’s researchers point out, even when it is increasingly critical, since the models acquire more impact actions from the real world.
“Ensure that agent models work reliably when navigating the web, sending emails or loading code to repositories can be seen as analogues to ensure that autonomous cars drive without accidents,” they write in a New research work. “As in the case of autonomous cars, an agent who forwards an incorrect email or the creation of safety vulnerabilities can have a large -range consequences of the real world.”
To test the robustness of O1-mini and previous O1, the researchers tested a series of strategies. First, they examined the ability of the models to solve simple mathematical problems (basic addition and multiplication) and more complex in the Mathematics data set (presents 12,500 mathematics competition questions).
Then they establish “objectives” for the adversary: make the model leave 42 instead of the correct answer; to generate the correct answer plus one; or issue the correct response times seven. Using a neuronal network to qualify, the researchers found that an increase in “thought” time allowed the models to calculate the correct answers.
They also adapted the Simpleqa Factuality BenchmarkA question set of questions aimed at being difficult to solve for models without navigating. The researchers inject adverse indications to the web pages that the AI sailed and discovered that, with higher computing times, they could detect inconsistencies and improve objective precision.
In another method, researchers used adverse images to confuse models; Again, more time to “think” improved recognition and reduced error. Finally, they tested a series of “undue use indications” of the Strongect reference pointDesigned for victim models must respond with specific and harmful information. This helped prove the adhesion of the models to the content policy. However, although a longer inference time improved resistance, some indications could avoid defenses.
Here, researchers call the differences between “ambiguous” and “unequivocal” tasks. Mathematics, for example, are undoubtedly unequivocal: for each problem x, there is a corresponding terrestrial truth. However, for more ambiguous tasks such as undue use indications, “even human evaluators often fight to agree if production is harmful and/or violates content policies that are supposed to follow the model,” they point out.
For example, if an abusive warning seeks advice on how to plagiar without detection, it is not clear if a result that simply provides general information on plagiarism methods is really detailed enough to support harmful actions.
“In the case of ambiguous tasks, there are environments in which the attacker successfully finds the ‘lagoons’, and his success rate does not break down with the amount of infection of inference,” recognize the researchers.
When performing these tests, Openai researchers explored a variety of attack methods.
One is a large amount of Jailbreak, or exploiting the disposition of a model to follow examples of few shots. The adversaries “fill” the context with a large number of examples, each demonstrating an instance of a successful attack. Models with higher computing times could detect and mitigate them more frequently and successfully.
Meanwhile, soft tokens allow adversaries directly manipulating incrusting vectors. While the increasing inference time helped here, the researchers point out that there is a need for better mechanisms to defend sophisticated attacks based on vectors.
The researchers also carried out human red team attacks, with 40 expert evaluators seeking indications to obtain policy violations. The red teams executed attacks in five levels of inference time, specifically aimed at erotic and extremist content, illicit behavior and self -harm. To help guarantee odd results, they did blind and random tests and also rotated coaches.
In a more novel method, the researchers carried out an adaptive attack of the Language Model (LMP) program, which emulates the behavior of human red teams that depend largely on the test and the iterative error. In a loop process, the attackers received comments on previous failures, then used this information for subsequent attempts and a new reformulation. This continued until they finally achieved a successful attack or made 25 iterations without any attack.
“Our configuration allows the attacker to adapt its strategy in the course of multiple attempts, based on descriptions of the defender’s behavior in response to each attack,” the researchers write.
In the course of his research, OpenAi discovered that the attackers are also actively exploiting the inference time. One of these methods that they called “think less”: adversaries essentially tell the models that reduce the calculation, which increases their susceptibility to error.
Similarly, they identified a fault mode in the reasoning models that “Nerd Sniping” called. As the name implies, this occurs when a model goes significantly more time reasoning than a certain task requires. With these “atypical” thought chains, the models essentially are trapped in unproductive thinking loops.
The researchers notice: “Like the attack of ‘Think less’, this is a new approach to attack models (ING), and one that must be taken into account to ensure that the attacker cannot do not They are reasoned at all, or spend their reasoning by calculating unproductively. “