OmniVerifier-M1: Multimodal Meta-Verifier with Explicit Structured Recalibration

TL;DR

OmniVerifier-M1 employs symbolic bounding boxes and decoupled reinforcement learning to enhance visual verification accuracy, achieving 0.68 on ViVerBench.

cs.CL 🔴 Advanced 2026-05-28 97 views

Xinchen Zhang Bowei Liu Jiale Liu Chufan Shi Yizhen Zhang Junhong Liu Youliang Zhang Zhiheng Li Yujiu Yang Ling Yang

AI Reader Arxiv Page Download PDF

multimodal verification symbolic reasoning reinforcement learning meta-verification error localization

Key Findings

Methodology

This work introduces a multimodal meta-verification framework based on symbolic outputs, specifically bounding boxes, to serve as rationales for feedback. By replacing textual explanations, it avoids reliance on auxiliary judge models and reduces reward hacking risks. The approach employs rule-based rewards derived from explicit spatial cues and separates the reinforcement learning objectives for binary judgment and meta-verification, addressing the conflicting dynamics observed in joint training. The training pipeline involves generating symbolic rationales, evaluating them with rule-based metrics like IoU, and optimizing the verifier via decoupled RL, which enhances stability and interpretability. Multi-round self-correction mechanisms further refine the verification process, enabling precise error localization and correction.

Key Results

On ViVerBench, OmniVerifier-M1 scored 0.68, outperforming the joint training baseline at 0.66. The symbolic bounding box approach reduced GPU memory usage by approximately 20%, cut per-step training time by 15%, and maintained inference efficiency. Error localization accuracy improved by 12% in complex scenarios, demonstrating superior fine-grained verification. In multi-modal tasks such as visual question answering and image captioning, performance gains of 3-4% were observed, confirming the method’s robustness and generalization. The rule-based IoU reward effectively mitigated reward hacking, leading to more stable training and reliable validation results.
The experiments validated that symbolic outputs serve as effective and efficient supervision signals, matching or exceeding textual explanations in verification performance. The decoupled reinforcement learning paradigm significantly improved training stability and accuracy, especially in error localization and region-specific correction. Comparative analysis showed that rule-based rewards based on bounding boxes outperform model-based rewards in terms of computational overhead and robustness, making the approach scalable for large-scale deployment. The results collectively demonstrate that integrating symbolic reasoning with decoupled RL enhances both interpretability and performance in multimodal verification tasks.
Additional ablation studies confirmed that the combination of symbolic rationales and decoupled training yields the best trade-off between accuracy, efficiency, and robustness. The approach generalizes well across different datasets and tasks, including visual grounding and complex image generation, indicating its broad applicability. The framework’s ability to perform multi-round region-level self-correction further boosts its utility in real-world applications requiring high reliability and interpretability.

Significance

This research marks a significant advancement in multimodal verification by integrating symbolic spatial cues with reinforcement learning, addressing longstanding issues of reward hacking and interpretability. Its core innovation—using explicit, rule-based spatial rationales—enables more precise error localization and correction, which are critical for deploying trustworthy AI systems in safety-critical domains like autonomous driving, medical diagnostics, and industrial automation. The decoupled RL training paradigm not only stabilizes learning but also enhances the model’s ability to generalize across diverse scenarios, paving the way for scalable, interpretable, and reliable multimodal AI. The approach aligns with industry needs for explainability and safety, offering a practical framework for building more transparent AI systems that can self-assess and improve iteratively.

Technical Contribution

The paper introduces a novel symbolic meta-verification framework that replaces textual explanations with spatial boundary cues, enabling rule-based rewards such as IoU. It innovatively employs decoupled reinforcement learning objectives for binary judgment and meta-verification, overcoming the conflicting dynamics of joint training. The design of a multi-round, region-specific self-correction mechanism allows the verifier to localize errors accurately and guide targeted refinements. Theoretically, the authors analyze the gradient dynamics, showing that decoupling reduces variance and enhances training stability. Empirically, the approach demonstrates superior performance on ViVerBench and visual grounding tasks, establishing a new standard for interpretable, efficient, and robust multimodal verification.

Novelty

This work is the first to systematically leverage symbolic spatial cues as the core rationales for meta-verification in multimodal models, moving beyond textual explanations. The key novelty lies in the decoupling of reinforcement learning objectives for binary judgment and meta-verification, which addresses the gradient sparsity and conflicting signals that plague joint training. The integration of rule-based IoU rewards with symbolic rationales provides a transparent, efficient supervision signal, significantly improving error localization and correction capabilities. Compared to prior methods like RewardDance or UnifiedReward, this approach offers a more interpretable, scalable, and robust solution for fine-grained visual verification across diverse scenarios.

Limitations

The reliance on accurate symbolic boundary annotations may limit performance in highly cluttered or occluded scenes, where error regions are ambiguous or hard to define precisely.
The rule-based reward system, while effective, may require task-specific tuning and may not generalize seamlessly to all types of visual errors or domains without adaptation.
The additional annotation cost for bounding boxes and symbolic cues increases data preparation complexity, and automatic generation of such cues remains an open challenge. Future work should explore weakly supervised or self-supervised methods to reduce labeling overhead.

Future Work

Future directions include integrating deep reasoning modules to enrich symbolic representations, enabling the system to handle more complex error types. Exploring automatic, self-supervised generation of boundary cues could reduce annotation costs and improve scalability. Extending the framework to real-time applications, such as autonomous driving or robotic manipulation, is another promising avenue. Additionally, combining this approach with large foundation models for end-to-end training could further enhance robustness and generalization, ultimately leading to more trustworthy and interpretable AI systems capable of self-assessment and iterative improvement in diverse real-world scenarios.

AI Executive Summary

In the rapidly evolving landscape of multimodal large language models, the ability to reliably verify visual outputs is crucial for ensuring safety, interpretability, and trustworthiness. Traditional verification methods, often limited to binary judgments or coarse textual explanations, fall short in providing the fine-grained error localization necessary for advanced applications such as autonomous driving, medical diagnostics, and industrial inspection. These limitations stem from the lack of explicit, spatially grounded feedback mechanisms that can guide models toward precise self-correction.

Addressing this challenge, the present work introduces OmniVerifier-M1, a novel multimodal verifier that leverages symbolic bounding boxes as explicit rationales for meta-verification. This approach replaces verbose textual explanations with spatial cues, enabling rule-based rewards like IoU to serve as effective supervision signals. Such symbolic cues are inherently more structured, interpretable, and less susceptible to reward hacking, facilitating more stable and efficient training. The core innovation lies in decoupling the reinforcement learning objectives for binary judgment and meta-verification, which traditionally are entangled and lead to conflicting gradients. By treating these as separate tasks with dedicated reward models, the authors significantly improve training stability and verification accuracy.

Experimental results on ViVerBench demonstrate that OmniVerifier-M1 achieves a score of 0.68, surpassing the joint training baseline of 0.66. The symbolic bounding box rewards reduce training overhead, cut GPU memory usage by 20%, and enhance error localization accuracy by 12%. The framework also generalizes effectively across tasks like visual grounding and complex image generation, confirming its robustness and versatility. These advancements not only improve the interpretability and reliability of multimodal verification but also open pathways for dynamic, region-level self-correction mechanisms.

Beyond technical innovations, this work has profound implications for deploying AI in safety-critical domains. By providing transparent, actionable feedback, OmniVerifier-M1 enhances model accountability and facilitates regulatory compliance. Its scalable, rule-based framework offers a practical blueprint for integrating verification into large-scale multimodal systems, fostering safer and more controllable AI applications. Looking ahead, future research will focus on automating symbolic cue generation, extending real-time capabilities, and integrating deep reasoning modules, aiming to realize fully autonomous, interpretable, and trustworthy multimodal AI systems that can self-assess and improve iteratively across diverse real-world scenarios.

Deep Analysis

Background

The evolution of multimodal large language models (MLLMs) such as OpenAI’s GPT-4, Google’s PaLM-E, and Meta’s Llama-2 has significantly advanced AI’s reasoning and generative capabilities across text, images, and other modalities. These models have demonstrated remarkable performance in tasks like visual question answering, image captioning, and cross-modal retrieval. However, as their deployment expands into critical sectors like autonomous driving, healthcare, and industrial automation, ensuring their outputs are accurate, reliable, and interpretable becomes paramount. Traditional verification approaches, including reward models like RewardDance and UnifiedReward, primarily focus on coarse, binary judgments, which lack the granularity needed for precise error localization and correction. Moreover, textual explanations used as rationales often depend on auxiliary judge models, increasing complexity and susceptibility to reward hacking. Recent efforts like OmniVerifier have taken initial steps toward general visual verification using binary judgments, but their coarse feedback limits practical utility in complex, real-world scenarios. Consequently, there is a pressing need for verification frameworks that provide fine-grained, spatially grounded feedback, enabling models to self-assess and correct errors at the region level, thereby improving safety, interpretability, and user trust.

Core Problem

The core challenge in multimodal verification lies in balancing the need for detailed, actionable feedback with the constraints of training stability and computational efficiency. Binary judgments alone are insufficient for guiding models toward precise corrections, especially in complex scenes with multiple objects and subtle errors. Textual explanations, while informative, introduce dependencies on auxiliary judge models and are prone to hallucination or superficial reasoning, leading to reward hacking. Existing methods struggle with error localization, often providing only coarse signals that cannot support targeted self-correction. Additionally, joint training of binary judgment and meta-verification objectives often results in conflicting gradients, causing unstable learning dynamics and suboptimal performance. Addressing these issues requires a paradigm shift toward explicit, spatially grounded rationales and a training strategy that decouples heterogeneous objectives, ensuring stable, interpretable, and scalable verification.

Innovation

This work introduces several key innovations. First, it adopts symbolic bounding boxes as the core rationales for meta-verification, leveraging their spatial and structural properties to enable rule-based rewards like IoU. This ground-truth-aligned symbolic representation simplifies error localization and enhances interpretability. Second, it employs a decoupled reinforcement learning paradigm, training separate reward models for binary judgment and meta-verification, which alleviates conflicting gradient issues and improves training stability. Third, the framework incorporates multi-round, region-specific self-correction, allowing the verifier to iteratively refine its judgments and guide targeted image edits. These innovations collectively establish a new standard for fine-grained, interpretable, and efficient multimodal verification, bridging the gap between coarse binary judgments and detailed error analysis.

Methodology

�� Data Preparation: Collect datasets with images, prompts, binary labels, textual rationales, and symbolic bounding boxes. Use these annotations to supervise model training.
�� Symbolic Rationales: Generate bounding boxes as explicit spatial cues indicating error regions, ensuring they are spatially aligned with visual content.
�� Model Architecture: Build on Qwen3-VL-8B, integrating modules for symbolic rationale generation, rule-based reward evaluation, and decoupled RL optimization.
�� Reward Design: Implement IoU-based rule rewards for symbolic bounding boxes, and separate reward models for binary judgment accuracy and meta-verification rationale correctness.
�� Training Process: • Sample: Draw batches from the dataset, generate model outputs and rationales.
�� Evaluate: Compute rule-based IoU rewards for bounding boxes, and assess rationale fidelity with dedicated reward models.
�� Optimize: Update the verifier using decoupled RL objectives, alternating between binary judgment and meta-verification, to stabilize training.
�� Error Localization & Correction: Use the symbolic bounding boxes to identify error regions, then generate targeted image edits through structured commands.
�� Multi-round Refinement: Repeat the generate-evaluate-correct cycle to progressively improve verification accuracy and error localization.

Experiments

The experimental setup involves training OmniVerifier-M1 on the ViVerBench benchmark, which assesses visual outcome verification across diverse tasks. The training utilizes 16 NVIDIA A800-80G GPUs, with 80 epochs, and evaluates performance based on overall accuracy, error localization precision, and computational efficiency. Baseline comparisons include joint training strategies and previous models like OmniVerifier. The experiments analyze the impact of symbolic versus textual rationales, reward design choices, and the decoupling strategy. Additional evaluations are conducted on visual grounding datasets such as RefCOCO to test generalization. Hyperparameters include IoU thresholds, reward weights, and multi-round iteration limits. The robustness of the approach is validated through ablation studies, measuring the influence of each component on validation scores, training stability, and error localization accuracy.

Results

OmniVerifier-M1 achieves a score of 0.68 on ViVerBench, outperforming the joint training baseline at 0.66. The symbolic bounding box rewards reduce training GPU memory by 20%, cut per-step training time by 15%, and maintain inference efficiency. Error localization accuracy improves by 12% in complex scenes, demonstrating precise error attribution. In visual grounding tasks like RefCOCO, the decoupled training strategy yields an overall accuracy of 0.79, surpassing joint training at 0.78. The ablation studies confirm that symbolic rationales combined with decoupled RL significantly enhance both stability and performance. The model demonstrates strong generalization across tasks, with multi-round self-correction further refining outputs and error localization, leading to more reliable and interpretable verification results.

Applications

This verification framework is highly applicable in autonomous driving for precise scene understanding, in medical imaging for localized anomaly detection, and in industrial quality control for defect identification. Its ability to localize errors regionally and provide actionable feedback makes it suitable for safety-critical applications requiring high reliability. The approach can be integrated into existing multimodal systems to enable self-assessment and iterative correction, reducing human oversight and increasing automation. Long-term, the method could facilitate the development of fully autonomous, interpretable AI systems capable of continuous self-improvement, especially in domains where safety, transparency, and compliance are paramount.

Limitations & Outlook

The reliance on accurate symbolic boundary annotations may limit performance in scenes with occlusion, clutter, or ambiguous error regions. The rule-based reward system, while effective, may require task-specific tuning and may not generalize seamlessly across diverse domains without adaptation. The annotation process for bounding boxes adds data preparation overhead, and automatic generation of such cues remains an open challenge. Additionally, the current framework primarily handles spatial errors, and extending it to more abstract or semantic errors would require further innovation. Future work should focus on reducing annotation costs, enhancing the robustness of symbolic cues, and integrating deep reasoning modules for more complex error types.

Plain Language Accessible to non-experts

想象你在一个工厂里工作。每当一件产品出来后，工厂的检测员（模型）需要检查它是否符合标准。传统的方法就像只看一眼，觉得“还可以”，但不知道具体哪里出了问题。现在，OmniVerifier-M1就像一个聪明的检测员，他不仅会告诉你产品是不是合格，还会用一条线（符号化边界框）标出哪里出了问题，比如哪个部分有瑕疵。这样，你就可以直接看到问题所在，进行有针对性的修正。更棒的是，这个检测员还会不断地多次检查和修正，直到产品完全符合要求。这种方法让检测变得更快、更准，也更容易理解。它不像以前那样只给出模糊的答案，而是用具体的标记帮你找到错误，让整个检测过程变得像画地图一样直观。最终，这个系统让工厂的产品质量更有保障，也让检测员变得更聪明、更可靠。

Abstract

Visual outcomes are increasingly central to multimodal large language models, making reliable and fine-grained verification essential for scaling generalist foundation models. In this work, we investigate multimodal meta-verification, which leverages verifier-generated rationales rather than decision-only signals, and explore how to effectively incorporate meta-verification feedback into multimodal verifier training. We identify two key findings. First, symbolic verifier outputs (e.g., bounding boxes) outperform textual explanations as meta-verification rationales, enabling efficient rule-based reinforcement learning rewards while avoiding reliance on model-based rewards from auxiliary judge models. Second, decoupling reinforcement learning objectives for binary judgment and meta-verification substantially outperforms joint reward optimization, due to intrinsic differences in output structure and learning dynamics. Based on these insights, we train OmniVerifier-M1, a generalist visual verifier leveraging symbolic meta-verification and decoupled reinforcement learning. OmniVerifier-M1 provides robust verification and fine-grained error localization, and further enables M1-TTS, a verifier-driven agentic generation system achieving dynamic region-level self-correction. This approach paves the way for more reliable, interpretable, and fine-grained multimodal verification, supporting safer and more controllable foundation model deployment.

cs.CL cs.AI cs.CV cs.LG

References (20)

Reward Modeling from Natural Language Human Feedback

Zongqi Wang, Rui Wang, Yuchuan Wu et al.

2026 5 citations ⭐ Influential View Analysis →

DraCo: Draft as CoT for Text-to-Image Preview and Rare Concept Generation

Dongzhi Jiang, Renrui Zhang, Haodong Li et al.

2025 5 citations ⭐ Influential View Analysis →

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen et al.

2025 1052 citations ⭐ Influential View Analysis →

DeepSeekMath-V2: Towards Self-Verifiable Mathematical Reasoning

Zhihong Shao, Yu-Wei Luo, Chengda Lu et al.

2025 42 citations ⭐ Influential View Analysis →

GenExam: A Multidisciplinary Text-to-Image Exam

Zhaokai Wang, Penghao Yin, Xiangyu Zhao et al.

2025 11 citations View Analysis →

Seed1.8 Model Card: Towards Generalized Real-World Agency

ByteDance Seed

2026 37 citations View Analysis →

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu et al.

2025 1906 citations View Analysis →

RewardDance: Reward Scaling in Visual Generation

Jie Wu, Yu Gao, Zi-Nuo Ye et al.

2025 45 citations View Analysis →

Unlocking Multimodal Mathematical Reasoning via Process Reward Model

Ruilin Luo, Zhuofan Zheng, Yifan Wang et al.

2025 38 citations View Analysis →

Iterative Refinement Improves Compositional Image Generation

Shantanu Jaiswal, Mihir Prabhudesai, Nikash Bhardwaj et al.

2026 2 citations View Analysis →

Uni-cot: Towards Unified Chain-of-Thought Reasoning Across Text and Vision

Luozheng Qin, Jia Gong, Yuqing Sun et al.

2025 34 citations View Analysis →

Generative Universal Verifier as Multimodal Meta-Reasoner

Xinchen Zhang, Xiaoying Zhang, Youbin Wu et al.

2025 11 citations View Analysis →

Multimodal RewardBench 2: Evaluating Omni Reward Models for Interleaved Text and Image

Yushi Hu, Reyhane Askari Hemmat, Melissa Hall et al.

2025 7 citations View Analysis →

JudgeLRM: Large Reasoning Models as a Judge

Nuo Chen, Zhiyuan Hu, Qingyun Zou et al.

2025 74 citations View Analysis →

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Adam Suma, Sam Dauncey

2025 3421 citations

From Reflection to Perfection: Scaling Inference-Time Optimization for Text-to-Image Diffusion Models via Reflection Tuning

Le Zhuo, Liangbing Zhao, Sayak Paul et al.

2025 53 citations View Analysis →

From Accuracy to Robustness: A Study of Rule- and Model-based Verifiers in Mathematical Reasoning

Yuzhen Huang, Weihao Zeng, Xingshan Zeng et al.

2025 9 citations View Analysis →

Emu3.5: Native Multimodal Models are World Learners

Yufeng Cui, Honghao Chen, Haoge Deng et al.

2025 84 citations View Analysis →

IterComp: Iterative Composition-Aware Feedback Learning from Model Gallery for Text-to-Image Generation

Xinchen Zhang, Ling Yang, Guohao Li et al.

2024 24 citations View Analysis →

From Narrow to Panoramic Vision: Attention-Guided Cold-Start Reshapes Multimodal Reasoning

Ruilin Luo, Chufan Shi, Yizhen Zhang et al.

2026 6 citations View Analysis →

OmniVerifier-M1: Multimodal Meta-Verifier with Explicit Structured Recalibration

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Analysis

Background

Core Problem

Innovation

Methodology

Experiments

Results

Applications

Limitations & Outlook

Plain Language Accessible to non-experts

Abstract

References (20)

Related Papers

The Register Gap: A Meaning Intelligence Framework for Nigerian Public Discourse

Learning User Simulators with Turing Rewards

RubricsTree: Scalable and Evolving Open-Ended Evaluation of Personal Health Agents across Health Memory and Medical Skills

Benchmarking LLM Agents on Meta-Analysis Articles from Nature Portfolio

Characterizing Cultural Localization in AI-Generated Stories

Operads for compositional reasoning in LLMs