Verbalized Sampling
Verbalized Sampling (VS) is a training-free prompting strategy that improves the diversity of large language model outputs by asking the model to generate multiple responses with self-assigned probabilities. Introduced by researchers at Stanford University, Northeastern University, and West Virginia University in October 2025, VS addresses mode collapse — the tendency of aligned language models to produce narrow, repetitive outputs — by counteracting typicality bias embedded in human preference data[^c1].
The technique requires no fine-tuning, model modifications, or access to internal logits. Instead, a prompt instructs the model to produce several candidate responses along with numerical probabilities, enabling sampling from the full distribution rather than collapsing to a single mode[^c2]. Empirical evaluations show that VS improves creative writing diversity by 1.6–2.1x over standard prompting, recovering approximately 66.8% of the model's pre-alignment creativity[^c3][^c22]. Subsequent research has shown that mode collapse has multiple contributing factors: typicality bias in human annotations (α = 0.57–0.65)[^c4], KL-regularization bias in RLHF[^c9], self-recognition effects where models suppress entropy when reading their own outputs[^c10][^c21], and fundamental token-level miscalibration in the model's probability distributions[^c16].
The root cause of mode collapse identified by the VS paper is typicality bias in human preference data used for reinforcement learning from human feedback (RLHF). Human annotators systematically favor familiar, fluent, and predictable text due to cognitive biases including the mere-exposure effect, processing fluency, and schema congruity. This bias is quantified as a weight of α = 0.57–0.65 in the reward model[^c4].
Verbalized Sampling builds on a lineage of related research. Self-consistency decoding, introduced in 2022, samples multiple reasoning paths and aggregates them via majority voting[^c5]. Research on confidence elicitation has shown that hybrid approaches combining verbalized confidence with consistency-based methods achieve the best calibration[^c6], but that comparisons between the two are highly sensitive to measurement protocol choices — verbalized confidence may partly reflect answer plausibility rather than correctness alone[^c13]. Later work on difficulty-adaptive self-consistency addresses the computational costs of multi-path sampling[^c7], while ConfTuner proposes fine-tuning methods that improve verbalized confidence calibration by up to 54.7%[^c8]. More recent alignment frameworks such as ORCE and CogConf achieve 42–52% and further improvements in expected calibration error[^c14][^c15].
Broader research on diversity loss in alignment has identified multiple independent causes. Preference collapse in RLHF results from KL-based regularization, where minority preferences are systematically disregarded[^c9]; this is now understood as a statistical inevitability under reward-based alignment when preferences contain Condorcet cycles[^c19], while non-reward methods such as Nash Learning from Human Feedback can preserve preference diversity through mixed strategies[^c18]. Token-level entropy imbalance in RLVR algorithms such as GRPO causes premature determinism during training[^c17], and word-level lexical reachability analysis through the Word Coverage Score reveals that 22–57% of mid-frequency vocabulary is unreachable under default sampling parameters[^c20]. Together, these findings indicate that diversity collapse has multiple contributing factors requiring complementary mitigations at the data, algorithm, and decoding levels.