New LLM optimization technique reduces memory costs by up to 75%

Join our daily and weekly newsletters to get the latest updates and exclusive content on industry-leading AI coverage. More information

Researchers at Tokyo-based startup Sakana AI have developed a new technique that allows language models to use memory more efficiently, helping companies reduce the costs of building applications on top of large language models. (LLM) and other Transformer based models.

The technique, called “universal transformer memory”, uses special neural networks to optimize LLMs to preserve important bits of information and discard redundant details from their context.

Transformer memory optimization

The responses of Transformer models, the backbone of LLMs, depend on the content of their “contextual window”, that is, what they receive as input from users.

The context window can be considered the working memory of the model. Modifying the contents of the context window can have a tremendous impact on model performance, which has given rise to an entire field of “rapid engineering.”

Current models support very long context windows with hundreds of thousands, or even millions, of tokens (numerical representations of an LLM of the words, word parts, phrases, concepts, and numbers entered by users in their prompts).

This allows users to include more information in their prompts. However, longer prompts can result in higher processing costs and slower performance. Optimizing prompts to eliminate unnecessary tokens while preserving important information can reduce costs and increase speed.

Current message optimization techniques are resource-intensive or require users to manually test different settings to reduce the size of their messages.

Neural attention memory modules

Universal Transformer Memory optimizes cues using Neural Attention Memory Models (NAMM), simple neural networks that decide whether to “remember” or “forget” each token stored in LLM memory.

“This new capability allows Transformers to discard useless or redundant details and focus on the most critical information, something we consider crucial for tasks that require prolonged context reasoning,” the researchers write.

NAMMs are trained separately from the LLM and combined with the pre-trained model at inference time, making them flexible and easy to implement. However, they need access to the model’s internal activations, which means they can only be applied to open source models.

Like other techniques developed by Sakana AI, NAMMs are trained through evolutionary algorithms instead of gradient-based optimization methods. By iteratively mutating and selecting the best performing models through trial and error, evolution algorithms optimize NAMMs for efficiency and performance. This is especially important as NAMMs attempt to achieve a non-differentiable goal: keeping or discarding tokens.

NAMMs operate on the attention layers of LLMs, one of the key components of the Transformer architecture that determines the relationships and importance of each token in the model’s context window. Based on the attention values, NAMMs determine which tokens should be kept and which can be discarded from the LLM context window. This attention-based mechanism makes it possible to use a NAMM trained on multiple models without additional modifications. For example, a NAMM trained on text-only data can be applied to vision or multimodal models without additional training.

Universal memory in action

To test the universal transformer memory concept in action, the researchers trained a NAMM on an open source Meta Llama 3-8B model. Their experiments show that with NAMM, Transformer-based models perform better on natural language and coding problems on very long sequences. Meanwhile, by discarding unnecessary tokens, NAMM allowed the LLM model to save up to 75% of its cache memory while performing tasks.

“In all of our benchmarks, NAMMs provide clear improvements in the performance of the Llama 3-8B transformer,” the researchers write. “In addition, our memory systems generate notable secondary benefits by reducing the context size of each layer, without ever explicitly optimizing memory efficiency.”

They also tested the model on Llama version 70B, as well as Transformer models designed for other modalities and tasks, such as Llava (computer vision) and Decision Transformer (reinforcement learning).

“Even in these out-of-distribution environments, NAMMs retain their benefits by discarding tokens such as redundant video frames and suboptimal actions, allowing their new base models to focus on the most relevant information to improve performance,” the researchers write. .

Task-dependent behavior

Another interesting finding is that NAMMs automatically adjust their behavior depending on the task.

For example, for coding tasks, the model discards contiguous fragments of tokens that correspond to comments and whitespace that do not affect code execution.

On the other hand, in natural language tasks, the model discards tokens that represent grammatical redundancies and do not affect the meaning of the sequence.

The researchers published the code to create your own NAMM. Techniques such as universal transformer memory can be very useful for enterprise applications that process millions of tokens and can benefit from speed increases and cost reductions. The reuse of a trained NAMM also makes it a versatile tool to use in different applications across an enterprise.

For the future, the researchers suggest more advanced techniques, such as using NAMM during training of LLMs to further expand their memory capabilities.

“This work has only begun to realize the potential of our new class of memory models, which we anticipate could offer many new opportunities to advance future generations of transformers,” the researchers write.

Daily insights into business use cases with VB Daily

If you want to impress your boss, VB Daily has you covered. We give you the inside scoop on what companies are doing with generative AI, from regulatory changes to practical implementations, so you can share insights for maximum return on investment.

Read our Privacy Policy

Thanks for subscribing. Check out more VB newsletters here.

An error occurred.

Source link

New LLM optimization technique reduces memory costs by up to 75%

Transformer memory optimization

Neural attention memory modules

Universal memory in action

Task-dependent behavior

Leave a ReplyCancel Reply

Van Dijk plays down Liverpool’s quadruple chances after Spurs win

MEI Pharma Shares Hit 52-Week Low of $2.4 Amid Market Challenges By Investing.com

Meta is reportedly adding screens to its Ray-Ban smart glasses

Transformer memory optimization

Neural attention memory modules

Universal memory in action

Task-dependent behavior

Leave a ReplyCancel Reply

Trending now

Van Dijk plays down Liverpool’s quadruple chances after Spurs win

MEI Pharma Shares Hit 52-Week Low of $2.4 Amid Market Challenges By Investing.com

Meta is reportedly adding screens to its Ray-Ban smart glasses