Vector Policy Optimization: Training for Diversity Improves Test-Time Search
Vector Policy Optimization (VPO) trains diverse policies to improve test-time search, achieving over 20% gains on best@k metrics across multiple tasks.
Key Findings
Methodology
This paper introduces Vector Policy Optimization (VPO), a reinforcement learning algorithm designed for post-training large language models (LLMs). VPO explicitly trains policies to anticipate diverse downstream reward functions by leveraging vector-valued rewards common in tasks like code generation or multi-hop reasoning. It combines multi-answer generation within a single autoregressive rollout and stochastic scalarization of reward vectors sampled from a Dirichlet distribution. This approach encourages the model to produce a set of candidate solutions that cover the Pareto frontier of the reward space, rather than collapsing onto a single scalar optimum. VPO replaces the traditional GRPO advantage estimator, optimizing a set-level objective that maximizes the expected best scalarized reward over sampled weightings, thereby enhancing diversity and competence of generated solutions.
Key Results
- Across four diverse tasks—Maze navigation, MuSiQue multi-hop QA, EUREQA logical reasoning, and ToolRL tool use—VPO consistently outperforms the strongest scalar RL baselines on best@k metrics. For instance, on MuSiQue, VPO achieves a best@30 score of 0.832, representing a 10%+ improvement over GRPO, with the performance gap widening as the search budget increases.
- On the LiveCodeBench code generation benchmark, a VPO-trained Qwen2.5-Coder-7B-Instruct model improves both pass@k and best@k over a matched-compute GRPO checkpoint. Moreover, when integrated with the OpenEvolve evolutionary search loop, VPO unlocks problem instances unsolvable by GRPO, demonstrating superior synergy with complex test-time search.
- Ablation studies reveal that neither multi-answer generation alone (Multi-RLVR) nor random reward weighting scalarization alone suffices to achieve VPO’s gains. The combination of both is critical to maintain reward-space diversity and improve test-time search outcomes.
Significance
This work addresses a critical gap in LLM post-training by aligning training objectives with the realities of test-time search, which relies on diverse candidate solutions rather than a single optimum. By explicitly optimizing for policy diversity across multiple reward dimensions, VPO enhances the model’s ability to generalize and adapt in complex, multi-objective tasks. This has significant implications for both academia and industry, as it improves the effectiveness of search-augmented inference methods and provides a principled framework for multi-objective RL in large-scale language models.
Technical Contribution
Technically, VPO innovates by integrating multi-answer autoregressive generation with stochastic scalarization of vector-valued rewards, forming a stable set-level optimization objective. Unlike traditional GRPO that optimizes a fixed scalar reward, VPO trains the policy to cover the Pareto frontier of multiple objectives, effectively mitigating policy collapse and enriching the candidate solution space. This approach enables more efficient exploration and exploitation separation, providing theoretical and practical advances in multi-objective RL for language models.
Novelty
VPO is the first method to combine multi-answer generation with randomized reward scalarization to explicitly optimize for diverse policy sets in LLM post-training. Unlike prior work focusing on single scalar objectives or goal-conditioned policies, VPO’s fundamental innovation lies in training a single policy to produce a diverse set of solutions that span different trade-offs in the reward vector space, tailored for test-time search scenarios.
Limitations
- VPO’s benefits depend on the non-collinearity of reward vector components; when reward dimensions are highly correlated or effectively scalar, performance gains diminish or reverse.
- The computational overhead of multi-answer generation and multiple reward weight samplings increases training costs compared to traditional scalar reward methods, potentially limiting scalability.
- Current evaluations are limited to four tasks and specific model architectures; generalization to larger models or multimodal tasks requires further validation.
Future Work
Future research directions include developing more efficient multi-answer generation strategies to reduce computational costs, adaptive reward weight sampling for improved training stability, and extending VPO to multimodal and larger-scale language models. Additionally, deeper integration of VPO with advanced test-time search algorithms could further enhance performance on complex tasks.
AI Executive Summary
Large language models (LLMs) have revolutionized natural language processing and related fields, yet their deployment increasingly relies on complex test-time search procedures, such as evolutionary algorithms like AlphaEvolve, which select among multiple candidate solutions based on task-specific reward functions. Traditional post-training methods optimize a fixed scalar reward, which often leads to low-entropy, narrowly focused policies that generate near-duplicate outputs, limiting the effectiveness of downstream search.
To address this, the authors propose Vector Policy Optimization (VPO), a novel reinforcement learning algorithm that explicitly trains policies to produce diverse sets of solutions by leveraging vector-valued rewards inherent in many tasks. VPO combines multi-answer autoregressive generation, where multiple candidate solutions are generated sequentially within a single rollout, with stochastic scalarization, sampling reward weight vectors from a Dirichlet distribution. This encourages the policy to cover the Pareto frontier of the reward space, maintaining a rich and diverse candidate pool for test-time search.
The core technical principle of VPO is to separate exploration and exploitation: training focuses on generating a diverse set of competent solutions, while exploitation is deferred to the test-time search procedure. By optimizing the expected best scalarized reward over sampled weightings, VPO prevents premature policy collapse and fosters specialization of individual solutions to different reward trade-offs.
Empirical evaluation across four distinct tasks—Maze navigation, MuSiQue multi-hop question answering, EUREQA logical reasoning, and ToolRL tool use—demonstrates that VPO consistently outperforms state-of-the-art scalar RL baselines on best@k metrics, with improvements growing as the search budget increases. Notably, on the LiveCodeBench code generation benchmark, VPO-trained models achieve higher pass@k and best@k scores than GRPO-trained counterparts and unlock problem instances unsolvable by GRPO when combined with OpenEvolve evolutionary search.
This work has broad implications for both academic research and practical deployment of LLMs. By aligning training objectives with the realities of test-time search, VPO enhances model adaptability and generalization in multi-objective settings. However, its benefits depend on the structure of the reward space and come with increased computational costs. Future work will focus on scaling VPO to larger models, multimodal tasks, and integrating it more deeply with advanced search algorithms, potentially establishing diversity optimization as a standard post-training objective for LLMs.
Deep Analysis
Background
Large language models (LLMs) have rapidly evolved, achieving remarkable performance across natural language understanding, generation, and reasoning tasks. Traditional training paradigms often optimize a single scalar reward, such as accuracy or human feedback scores, to guide model learning. However, real-world applications increasingly embed LLMs within complex inference pipelines that perform test-time search or sampling to select the best output among many candidates. Examples include rejection sampling with verifiers and evolutionary search methods like AlphaEvolve. In these contexts, the diversity of candidate solutions is crucial for effective search, as it allows exploration of different trade-offs and strategies. Prior reinforcement learning (RL) approaches, such as GRPO, optimize scalar rewards and tend to produce low-entropy policies that collapse to a narrow set of similar outputs, limiting search effectiveness. Multi-objective RL and lexicase selection in evolutionary computation have explored maintaining diverse solutions optimal under different objectives, but their application to LLM post-training remains underexplored. This paper builds on these insights to address the gap between training objectives and test-time search needs.
Core Problem
The core challenge addressed is the mismatch between traditional scalar reward optimization during LLM post-training and the requirements of test-time search procedures that benefit from diverse candidate solutions. Specifically, training with a fixed scalar reward causes the policy to concentrate probability mass on a single dominant mode, leading to candidate sets with low diversity and redundancy. This premature convergence reduces the search space's richness, hindering the discovery of superior solutions during inference. Moreover, many practical tasks naturally decompose rewards into multiple components (e.g., per-test-case correctness, multi-hop reasoning steps), which scalarization collapses, losing valuable structure. The problem is to design a training objective that encourages the policy to maintain a diverse set of competent solutions spanning different trade-offs in the reward vector space, thereby enhancing test-time search efficacy.
Innovation
The paper introduces several key innovations:
1) Vector Policy Optimization (VPO): a novel RL algorithm that trains policies to output sets of solutions covering the Pareto frontier of multi-dimensional reward spaces, rather than optimizing a single scalar reward.
2) Multi-answer autoregressive generation: generating multiple candidate answers sequentially within a single rollout, allowing later answers to condition on earlier ones to explicitly encourage diversity.
3) Stochastic scalarization: sampling reward weight vectors from a Dirichlet distribution to define multiple scalar objectives during training, incentivizing the policy to specialize candidates along different reward trade-offs.
4) Set-level optimization objective: maximizing the expected best scalarized reward over sampled weightings, directly rewarding coverage of the reward space and preventing policy collapse.
5) Comprehensive ablations: demonstrating the necessity of combining multi-answer generation with stochastic scalarization to achieve improved test-time search performance.
These innovations collectively shift the training paradigm from single-solution optimization to diverse solution set optimization, tailored for search-augmented inference.
Methodology
- �� Multi-answer generation: Following Puri et al. (2026), the model generates m candidate completions sequentially within a single autoregressive rollout, separated by delimiter tokens. Each subsequent answer conditions on previous ones, enabling explicit in-context exploration.
- �� Reward vector decomposition: Tasks provide vector-valued rewards r(x,y) = [r1, r2, ..., rd], capturing multiple quality aspects (e.g., per-test-case correctness).
- �� Stochastic scalarization: For each rollout, sample K reward weight vectors w^(k) from a Dirichlet(α) distribution over the simplex Δ^{d-1}, inducing scalar objectives w^(k)ᵀ r(x,y).
- �� Set-level reward: For candidate set S = {y1,...,ym}, define the set reward as the average over K samples of max_{y ∈ S} w^(k)ᵀ r(x,y), encouraging coverage of different reward trade-offs.
- �� Advantage estimation: Replace GRPO advantage estimator with VPO’s set-level reward to compute policy gradients, applying the same advantage uniformly to all tokens in the rollout.
- �� Training procedure: For each prompt x, sample G rollouts of m candidates, evaluate with K scalarizations, compute Monte Carlo estimates of set rewards, and update policy parameters via gradient ascent.
- �� Evaluation metrics: Use best@k and pass@k to assess test-time search performance, measuring the quality of the best candidate among k samples.
This methodology explicitly separates exploration (training diverse candidate sets) from exploitation (test-time search), optimizing the policy to produce a rich, diverse solution space.
Experiments
Experiments span four domains chosen to represent diverse multi-objective structures:
1) Maze navigation: A 9×9 grid task where the model outputs text sequences of moves to collect gold, diamonds, avoid lava, and reach the exit. Rewards include one binary completion component and three clipped item/safety terms. Qwen3-4B model trained and evaluated on 100 held-out mazes.
2) MuSiQue multi-hop QA: The model selects supporting paragraphs from 20 candidates and produces final answers. Rewards consist of four binary citation indicators and a continuous answer F1 score, with answer weighted thrice. Qwen3-1.7B trained and evaluated on 300 stratified questions.
3) EUREQA logical reasoning: The model chains through five relations to identify masked entities, with binary per-entity rewards. Qwen3-8B trained and evaluated on a hard split, averaging over 4 seeds.
4) ToolRL tool use: Rewards include one binary structural-format component and three continuous F1 dimensions (tool-name, arg-key, arg-value). Qwen3-1.7B trained and evaluated on 80 prompts, averaged over 4 seeds.
Baselines include GRPO, Multi-RLVR (multi-answer with fixed scalar reward), random-weighting GRPO, Max-at-k training, MaxRL, and goal-conditioned GRPO. Metrics focus on best@k and reward-space diversity. Training details and hyperparameters are documented in the appendix.
Results
VPO consistently outperforms all baselines across tasks:
- �� MuSiQue: VPO achieves best@30 of 0.832, surpassing GRPO by over 10%, with performance gains increasing with k.
- �� Maze: VPO reaches best@30 of 0.671, significantly higher than GRPO’s 0.432, demonstrating effective coverage of reward trade-offs.
- �� EUREQA and ToolRL: VPO maintains superior best@k scores, with ablations confirming the necessity of both multi-answer generation and stochastic scalarization.
- �� LiveCodeBench: VPO-trained Qwen2.5-Coder-7B-Instruct improves pass@k and best@k over GRPO, and when integrated with OpenEvolve evolutionary search, solves previously intractable problems.
- �� Ablations show that multi-answer generation alone or random reward weighting alone fail to maintain reward-space diversity or improve test-time search, highlighting the synergy in VPO’s design.
These results validate VPO’s effectiveness in enhancing policy diversity and downstream search performance.
Applications
VPO is applicable to any scenario where LLMs are embedded within test-time search frameworks requiring diverse candidate solutions. This includes code generation tasks where per-test-case correctness varies, multi-hop question answering requiring diverse reasoning paths, complex logical reasoning, and tool use scenarios with multiple evaluation criteria. Industrial deployments can integrate VPO into post-training pipelines to improve model robustness and adaptability in multi-objective environments. Academically, VPO provides a new paradigm for multi-objective RL in language models, fostering research into diversity maintenance and search-aware training. Future extensions could adapt VPO to multimodal tasks and larger-scale models, broadening its impact.
Limitations & Outlook
VPO’s effectiveness hinges on the non-collinearity of reward vector components; in tasks where reward dimensions are highly correlated or effectively scalar, VPO’s advantage diminishes or reverses. The computational cost is higher due to multi-answer generation and multiple reward weight samplings, posing challenges for scaling to very large models or datasets. Additionally, current experiments are limited to four tasks and specific model architectures, necessitating further validation for generalization. Finally, the design of the reward weight sampling distribution impacts training stability and performance, requiring careful tuning.
Abstract
Language models must now generalize out of the box to novel environments and work inside inference-scaling search procedures, such as AlphaEvolve, that select rollouts with a variety of task-specific reward functions. Unfortunately, the standard paradigm of LLM post-training optimizes a pre-specified scalar reward, often leading current LLMs to produce low-entropy response distributions and thus to struggle at displaying the diversity that inference-time search will require. We propose Vector Policy Optimization (VPO), an RL algorithm that explicitly trains policies to anticipate diverse downstream reward functions and to produce diverse solutions. VPO exploits that rewards are often vector-valued in practice, like per-test-case correctness in code generation or, say, multiple different user personas or reward models. VPO is essentially a drop-in replacement for the GRPO advantage estimator, but it trains the LLM to output a set of solutions where individual solutions specialize to different trade-offs in the vector reward space. Across four tasks, VPO matches or beats the strongest scalar RL baselines on test-time search (e.g. pass@k and best@k), with the gap widening as the search budget grows. For evolutionary search, VPO models unlock problems that GRPO models cannot solve at all. As test-time search becomes more standardized, optimizing for diversity may need to become the default post-training objective.
References (20)
Pass@k Training for Adaptively Balancing Exploration and Exploitation of Large Reasoning Models
Zhipeng Chen, Xiaobo Qin, Youbin Wu et al.
ToolRL: Reward is All Tool Learning Needs
Cheng Qian, Emre Can Acikgoz, Qi He et al.
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu et al.
Reaching Beyond the Mode: RL for Distributional Reasoning in Language Models
Isha Puri, Mehul Damani, Idan Shenfeld et al.
InfAlign: Inference-aware language model alignment
Ananth Balashankar, Ziteng Sun, Jonathan Berant et al.
HybridFlow: A Flexible and Efficient RLHF Framework
Guangming Sheng, Chi Zhang, Zilingfeng Ye et al.
Curiosity-driven Red-teaming for Large Language Models
Zhang-Wei Hong, Idan Shenfeld, Tsun-Hsuan Wang et al.
Inference-Aware Fine-Tuning for Best-of-N Sampling in Large Language Models
Yinlam Chow, Guy Tennenholtz, Izzeddin Gur et al.
A practical guide to multi-objective reinforcement learning and planning
Conor F. Hayes, Roxana Ruadulescu, Eugenio Bargiacchi et al.
e3: Learning to Explore Enables Extrapolation of Test-Time Compute for LLMs
Amrith Rajagopal Setlur, Matthew Y. R. Yang, C. Snell et al.
Training Verifiers to Solve Math Word Problems
K. Cobbe, Vineet Kosaraju, Mo Bavarian et al.
Mathematical discoveries from program search with large language models
B. Romera-Paredes, M. Barekatain, Alexander Novikov et al.
Random Latent Exploration for Deep Reinforcement Learning
Srinath Mahankali, Zhang-Wei Hong, Ayush Sekhari et al.
Understanding the Effects of RLHF on LLM Generalisation and Diversity
Robert Kirk, Ishita Mediratta, Christoforos Nalmpantis et al.
The Best of N Worlds: Aligning Reinforcement Learning with Best-of-N Sampling via max@k Optimisation
Farid Bagirov, Mikhail Arkhipov, Ksenia Sycheva et al.
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
Naman Jain, King Han, Alex Gu et al.
Interpretable Preferences via Multi-Objective Reward Modeling and Mixture-of-Experts
Haoxiang Wang, Wei Xiong, Tengyang Xie et al.
Evaluating Large Language Models Trained on Code
Mark Chen, Jerry Tworek, Heewoo Jun et al.
A Survey of Multi-Objective Sequential Decision-Making
D. Roijers, P. Vamplew, Shimon Whiteson et al.
Exploration in Deep Reinforcement Learning: A Survey
Pawel Ladosz, Lilian Weng, Minwoo Kim et al.