Useful information
Prime News delivers timely, accurate news and insights on global events, politics, business, and technology
Useful information
Prime News delivers timely, accurate news and insights on global events, politics, business, and technology
Join our daily and weekly newsletters to get the latest updates and exclusive content on industry-leading AI coverage. More information
Researchers of Section AIAn artificial intelligence research laboratory that focuses on algorithms inspired by nature, has developed a self-adaptive language model that can learn new tasks without needing to make adjustments. Called Transformer² (Transformer Squared), the model uses mathematical tricks to align its weights with user requests during inference.
This is the latest in a series of techniques that aim to improve the capabilities of large language models (LLMs) at inference time, making them increasingly useful for everyday applications in different domains.
Typically, setting up LLM for new tasks requires expensive adjustment processduring which the model is exposed to new examples and its parameters are adjusted. A more cost-effective approach is “low-range adaptation” (LoRA), in which a small subset of the model parameters relevant to the target task is identified and modified during fitting.
After training and tuning, the model parameters remain frozen and the only way to reuse it for new tasks is through techniques such as few-shot and many-shot learning.
Unlike classic fine-tuning, Transformer-squared uses a two-step approach to dynamically adjust its parameters during inference. It first analyzes the incoming request to understand the task and its requirements, then applies task-specific adjustments to the model’s weights to optimize its performance for that specific request.
“By selectively adjusting critical components of the model weights, our framework enables LLMs to dynamically adapt to new tasks in real time,” the researchers write in a blog post posted on the company website.
The main capability of Transformer-squared is to dynamically adjust the critical components of their weights in inference.
To do this, you must first identify the key components that can be modified during inference. Transformer-square does this via singular value decomposition (SVD), a linear algebra trick that decomposes a matrix into three other matrices that reveal its internal structure and geometry. SVD is often used to compress data or simplify machine learning models.
When applied to the LLM weight matrix, SVD obtains a set of components that roughly represent the different skills in the model, such as mathematics, language comprehension or coding. In their experiments, the researchers found that these components could be modified to modify the model’s capabilities on specific tasks.
To systematically leverage these findings, they developed a process called singular value fitting (SVF). At training time, SVF learns a set of vectors of the SVD components of the model. These vectors, called z vectors, are compact representations of individual abilities and can be used as knobs to amplify or attenuate the model’s ability on specific tasks.
At inference time, Transformer-squared uses a two-step mechanism to adapt the LLM to unseen tasks. First, examine the message to determine the skills needed to address the problem (researchers propose three different techniques for determining the skills needed). In the second stage, Transformer-squared sets up the z-vectors corresponding to the request and runs the message through the updated model and weights. This allows the model to provide a personalized response to each message.
The researchers applied Transformer-squared to Llama-3 and Mistral LLM and compared them to LoRA on various tasks, including mathematics, coding, reasoning, and visual question answering. Transformer-squared outperforms LoRA on all benchmarks and has fewer parameters. It is also notable that, unlike Transformer-squared, LoRA models cannot adapt their weights at inference time, making them less flexible.
Another intriguing finding is that knowledge extracted from one model can be transferred to another. For example, the z vectors obtained from the Llama models could be applied to the Mistral models. The results were not on par with creating z vectors from scratch for the target model, and transferability was possible because the two models had similar architectures. But it suggests the possibility of learning generalized z vectors that can be applied to a wide range of models.
“The way forward is to build models that dynamically adapt and collaborate with other systems, combining specialized capabilities to solve complex, multi-domain problems,” the researchers write. “Self-adaptive systems like Transformer² bridge the gap between static AI and living intelligence, paving the way for efficient, personalized and fully integrated AI tools that drive progress across industries and in our daily lives.”
Sakana AI has published the code to train Transformer-squared components on GitHub.
As companies explore different applications of LLM, the past year has seen a notable shift toward the development of time-based inference techniques. Transformer-squared is one of several approaches that allow developers to customize LLMs for new tasks at inference time without the need to retrain or tune them.
Titans, an architecture developed by Google researchers, approaches the problem from a different angle, giving language models the ability to learn and memorize new information at the time of inference. Other techniques focus on allowing cutting-edge LLMs to take advantage of their increasingly longer context windows to learn new tasks without retraining.
Since companies possess the data and knowledge specific to their applications, advances in inference time customization techniques will make LLMs much more useful.