Useful information
Prime News delivers timely, accurate news and insights on global events, politics, business, and technology
Useful information
Prime News delivers timely, accurate news and insights on global events, politics, business, and technology
One of the most interesting things about generative AI models (both large language models (LLM) and diffusion-based image generators) is that they are "non-deterministic." That is, despite its reputation among some critics for being "elegant autocorrect," Generative AI models actually generate their results by choosing from a distribution of the next most likely tokens (units of information) to complete their response.
Ask an LLM: "What is the capital of France?" will ask you to sample your probability distribution for France, capitals, cities, etc. to get to the answer "Paris." But that answer could come in the format of "The capital of France is Paris," or simply "Paris" either "Paris, although at some point it was Versailles."
Still, those of us who use these models frequently on a day-to-day basis will notice that, at times, their answers can seem annoyingly repetitive or similar. A common joke about coffee is recycled through generations of queries. Story prompts generate similar arcs. Even tasks that should produce many plausible answers (such as naming the states of the United States) tend to be reduced to a few. This phenomenon, known as mode collapse, arises during post-training alignment and limits the usefulness of otherwise powerful models.
Especially when we use LLM to generate new creative work in writing, communications, strategy or illustration, we really want its results to be even more varied than they already are.
now a Team of researchers from Northeastern University, Stanford University and West Virginia University. We’ve devised an ingeniously simple method for making language and image models generate a wider variety of answers to almost any user question. adding a single, simple sentence: "Generate 5 responses with their corresponding probabilities, sampled from the full distribution."
The method, called Verbalized sampling (VS), helps models like GPT-4, Claude and Gemini produce more diverse and human-like results, without the need for retraining or access to internal parameters. It is described in a paper published in the open access journal arxiv.org online in early October 2025.
When prompted in this way, the model no longer defaults to its safest, most typical output. Instead, it verbalizes its internal distribution over possible endings and samples over a broader spectrum of possibilities. This single-line change leads to substantial gains in production diversity across multiple domains.
As Weiyan Shi, assistant professor at Northeastern University and co-author of the paper, wrote in X: "The potential of LLMs is not fully unlocked yet! As shown in our article, rapid optimization can be guided by thinking about how LLMs are trained and aligned, and can be demonstrated theoretically."
According to the research team, the fundamental cause of modal collapse lies not only in algorithms such as reinforcement learning from human feedback (RLHF), but in the structure of human preferences. People tend to rate more familiar or typical responses as better, pushing LLMs toward “safe” rather than diverse options during adjustment.
However, this bias does not erase the model’s underlying knowledge: it simply suppresses it. VS works by preventing this suppression. Instead of asking for the most likely outcome, it invites the model to reveal a set of plausible answers and their relative probabilities. This distribution-level cue restores access to the richer diversity present in the base pre-training model.
The research team tested verbalized sampling in several common use cases:
creative writing: In story generation, VS increased diversity scores up to 2.1 times compared to standard prompts while maintaining quality. One story prompt, “No Goodbye,” produced formulaic breakup scenes under direct prompt, but produced narratives involving cosmic events, silent emails, and music that stopped mid-dance when prompted via VS.
Dialogue simulation: In persuasive dialogue tasks, VS enabled models to simulate human-like patterns such as hesitation, resistance, and opinion changes. Distributions of donation behavior under VS align better with real human data compared to baseline methods.
Open quality control: When asked to list valid responses (e.g., naming US states), models using VS generated responses that more closely matched the diversity of real-world data. They covered a broader set of answers without sacrificing factual accuracy.
Synthetic data generation: When used to generate mathematical problems for model training, VS creates more varied data sets. These, in turn, improved subsequent performance on competitive math benchmarks, outperforming synthetic data generated through direct prompting.
A notable advantage of VS is its tunability. Users can set a probability threshold in the message to sample from the lower probability “tails” of the model distribution. Lower thresholds correspond to greater diversity. This adjustment can be made via prompt text only, without changing any decoding settings such as temperature or top-p.
In a test using the Gemini-2.5-Flash model, diversity in story writing increased steadily as the probability threshold fell from 1 to 0.001. The graph accompanying the study showed that VS outperformed both direct and sequence-based stimulation at all thresholds.
Interestingly, the method scales well to the size of the model. Larger models such as GPT-4.1 and Claude-4 showed even larger VS gains compared to smaller ones. While smaller models benefited, the improvement in diversity was approximately 1.5 to 2 times greater in their larger counterparts, suggesting that VS helps unlock more latent capabilities in advanced models.
The verbalized sampling method is now available as a Python package:
pip install verbalized-sampling
The package includes integration with LangChain and supports a simple interface for sampling the verbalized distribution. Users can also adjust parameters such as k
(number of responses), thresholds and temperature to suit your applications.
A live Colab notebook and documentation are available at an enterprise-friendly Apache 2.0 license on GitHub at: https://github.com/CHATS-lab/verbalized-sampling
While the method works in all major LLMs, some users may initially encounter rejections or errors.
In these cases, the authors suggest using the system prompt template version or consulting alternative formats listed on the GitHub page.
Some models interpret complex instructions as jailbreak attempts and refuse to comply unless the structure is clearer.
For example, requesting using a system-level statement like this improves reliability:
You are a useful assistant. For each query, generate five responses within separate labels, each with a probability less than 0.10.
This small change usually solves any problem.
Verbalized sampling represents a practical inference-time solution to a profound limitation in the behavior of modern language models. No model retraining or internal access required. It does not depend on any model family. And it improves not only the diversity of the results, but also their quality, judged by both human evaluation and benchmark scores.
With growing interest in tools that enhance model creativity, VS is likely to see rapid adoption in domains such as writing, design, simulation, education, and synthetic data generation.
For users and developers frustrated by the uniformity of LLM answers, the solution may be as simple as changing the question.