Visual-ERM: Reward Modeling for Visual Equivalence

Key Findings

Methodology

Visual-ERM is a multimodal generative reward model that evaluates vision-to-code quality directly in the rendered visual space. It integrates modeling of global structure and local visual details, providing fine-grained, interpretable, and task-agnostic feedback. Integrated into RL, Visual-ERM significantly improves Qwen3-VL-8B-Instruct's performance on chart-to-code tasks and yields consistent gains on table and SVG parsing. The model further enhances test-time scaling via reflection and revision.

Key Results

Visual-ERM improved Qwen3-VL-8B-Instruct's performance by 8.4 points on chart-to-code tasks, offering more precise visual detail evaluation compared to DINO-based rewards.
On table and SVG parsing tasks, Visual-ERM achieved improvements of 2.7 and 4.1 points respectively, demonstrating its broad applicability across various vision-to-code tasks.
In the VisualCritic-RewardBench benchmark, Visual-ERM at 8B decisively outperformed Qwen3-VL-235B-Instruct and approached leading closed-source models.

Significance

The introduction of Visual-ERM addresses the misalignment of reward signals in existing vision-to-code tasks. By providing fine-grained evaluation directly in the visual space, the model avoids vulnerabilities associated with textual rules or coarse visual embedding similarity, offering more reliable reward signals. This advancement is significant not only for academia but also provides stronger technical support for industrial applications of vision-to-code.

Technical Contribution

Technically, Visual-ERM offers a novel reward modeling framework fundamentally different from existing text-based or visual encoder similarity methods. It achieves fine-grained perception of visual details through a multimodal generative model, providing higher fidelity supervision signals in vision-to-code tasks. This approach not only enhances the model's parsing capabilities but also opens new engineering possibilities for future vision-to-code tasks.

Novelty

Visual-ERM is the first model to provide fine-grained reward signals in the visual space. Compared to existing methods, its innovation lies in its ability to perceive both visual details and embedded text in a multimodal space, surpassing traditional semantic similarity assessments.

Limitations

Visual-ERM may encounter performance bottlenecks when dealing with complex visual structures, particularly in parsing high-resolution images.
The model relies heavily on training data, potentially requiring substantial annotated data to realize its potential.
In certain specific tasks, Visual-ERM's generalization ability may be limited.

Future Work

Future research could explore the application of Visual-ERM in more vision-to-code tasks, especially those involving complex visual structures. Additionally, further optimization of the model's computational efficiency and generalization capabilities are important research directions.

AI Executive Summary

Vision-to-code tasks require models to reconstruct structured visual inputs, such as charts, tables, and SVGs, into executable or structured representations with high visual fidelity. While recent Large Vision Language Models (LVLMs) achieve strong results via supervised fine-tuning, reinforcement learning remains challenging due to misaligned reward signals. Existing rewards either rely on textual rules or coarse visual embedding similarity, both of which fail to capture fine-grained visual discrepancies and are vulnerable to reward hacking. We propose the Visual Equivalence Reward Model (Visual-ERM), a multimodal generative reward model that provides fine-grained, interpretable, and task-agnostic feedback to evaluate vision-to-code quality directly in the rendered visual space. Integrated into RL, Visual-ERM significantly improves Qwen3-VL-8B-Instruct's performance on chart-to-code tasks and yields consistent gains on table and SVG parsing. The model further enhances test-time scaling via reflection and revision.

We also introduce VisualCritic-RewardBench (VC-RewardBench), a benchmark for judging fine-grained image-to-image discrepancies on structured visual data, where Visual-ERM at 8B decisively outperforms Qwen3-VL-235B-Instruct and approaches leading closed-source models. Our results suggest that fine-grained visual reward supervision is both necessary and sufficient for vision-to-code RL, regardless of task specificity.

The introduction of Visual-ERM addresses the misalignment of reward signals in existing vision-to-code tasks. By providing fine-grained evaluation directly in the visual space, the model avoids vulnerabilities associated with textual rules or coarse visual embedding similarity, offering more reliable reward signals. This advancement is significant not only for academia but also provides stronger technical support for industrial applications of vision-to-code.

Technically, Visual-ERM offers a novel reward modeling framework fundamentally different from existing text-based or visual encoder similarity methods. It achieves fine-grained perception of visual details through a multimodal generative model, providing higher fidelity supervision signals in vision-to-code tasks. This approach not only enhances the model's parsing capabilities but also opens new engineering possibilities for future vision-to-code tasks.

However, Visual-ERM may encounter performance bottlenecks when dealing with complex visual structures, particularly in parsing high-resolution images. Additionally, the model relies heavily on training data, potentially requiring substantial annotated data to realize its potential. In certain specific tasks, Visual-ERM's generalization ability may be limited. Future research could explore the application of Visual-ERM in more vision-to-code tasks, especially those involving complex visual structures. Additionally, further optimization of the model's computational efficiency and generalization capabilities are important research directions.

Deep Analysis

Background

In recent years, with advances in computer vision and natural language processing, vision-to-code tasks have become an important research area. The goal of these tasks is to convert structured visual inputs, such as charts, tables, and SVGs, into executable or structured representations like code or markup. Traditional approaches primarily rely on supervised fine-tuning, which requires substantial annotated data and often lacks cross-domain generalization. Recently, reinforcement learning has emerged as a promising alternative, but it introduces challenges due to misaligned reward signals. Existing reward methods either rely on textual rules or coarse visual embedding similarity, both of which fail to capture fine-grained visual discrepancies and are vulnerable to reward hacking.

Core Problem

The core problem in vision-to-code tasks is how to effectively evaluate the similarity between the generated code and the original visual input. Existing methods primarily rely on textual rules or coarse visual embedding similarity, both of which fail to capture fine-grained visual discrepancies and are vulnerable to reward hacking. This leads to models potentially optimizing the wrong objectives during training, affecting final performance. Therefore, developing a reward model that can provide fine-grained evaluation directly in the visual space is key to addressing this issue.

Innovation

The core innovation of Visual-ERM lies in its design as a multimodal generative reward model. Firstly, the model can evaluate vision-to-code quality directly in the rendered visual space, avoiding vulnerabilities associated with textual rules or coarse visual embedding similarity. Secondly, Visual-ERM integrates modeling of global structure and local visual details, providing fine-grained, interpretable, and task-agnostic feedback. This approach not only enhances the model's parsing capabilities but also opens new engineering possibilities for future vision-to-code tasks. Finally, Visual-ERM further enhances test-time scaling via reflection and revision.

Methodology

The implementation of Visual-ERM includes the following key steps:

�� Reward Data Generation: Generate reward data through controlled corruption and annotation.

�� Supervised Fine-Tuning of the Reward Model: Perform supervised fine-tuning on the generated reward data.

�� Integration into RL: Integrate Visual-ERM into the RL pipeline to enhance the model's parsing capabilities.

�� Reflection and Revision: Further enhance test-time scaling through reflection and revision.

Experiments

The experimental design includes datasets for multiple vision-to-code tasks, such as ChartMimic, OmniDocBench, and UniSVG. We adopt Qwen3-VL-8B-Instruct as the policy model's backbone and use GRPO as the RL algorithm. In the experiments, we compare the performance differences between Visual-ERM and DINO-based reward methods, focusing on the model's performance in fine-grained visual discrepancy evaluation.

Results

Experimental results show that Visual-ERM achieves significant performance improvements across multiple vision-to-code tasks. On chart-to-code tasks, Visual-ERM improved Qwen3-VL-8B-Instruct's performance by 8.4 points. On table and SVG parsing tasks, Visual-ERM achieved improvements of 2.7 and 4.1 points respectively. Additionally, in the VisualCritic-RewardBench benchmark, Visual-ERM at 8B decisively outperformed Qwen3-VL-235B-Instruct and approached leading closed-source models.

Applications

Application scenarios for Visual-ERM include but are not limited to:

�� AI-assisted front-end development: Converting UI designs to code.

�� Scientific paper parsing: Automatically extracting and parsing charts and data from papers.

�� Knowledge management and system integration: Enhancing information accessibility and usability through vision-to-code conversion.

Limitations & Outlook

Despite its strong performance across multiple tasks, Visual-ERM may encounter performance bottlenecks when dealing with complex visual structures, particularly in parsing high-resolution images. Additionally, the model relies heavily on training data, potentially requiring substantial annotated data to realize its potential. In certain specific tasks, Visual-ERM's generalization ability may be limited. Future research could explore the application of Visual-ERM in more vision-to-code tasks, especially those involving complex visual structures. Additionally, further optimization of the model's computational efficiency and generalization capabilities are important research directions.

Plain Language Accessible to non-experts

Imagine you're in a kitchen cooking. You have a recipe (visual input) that needs to be turned into a delicious dinner (code output). Traditional methods are like following the recipe step by step, but sometimes you might miss some details, like the amount of salt or the cooking time, similar to how existing reward methods fail to capture fine-grained visual discrepancies. Visual-ERM is like an experienced chef who can not only see the instructions on the recipe but also adjust the cooking process by observing the color and taste of the ingredients, ensuring every dish is perfect. This method not only makes your dinner more delicious but also helps you become more adept at cooking in the future.

ELI14 Explained like you're 14

Hey there, buddy! Have you ever wondered how computers turn pictures into code? It's like when you're playing a game and turning your character's moves into commands. Now, there's a new method called Visual-ERM, which is like a super coach in the game, helping the computer better understand the details in pictures. Previous coaches might only see the general moves, but Visual-ERM can see every little move, just like you notice every enemy's move in the game. This way, the computer can turn pictures into code more accurately, just like you beat enemies more accurately in the game! Isn't that cool?

Glossary

Visual-ERM (Visual Equivalence Reward Model)

A multimodal generative reward model that evaluates vision-to-code quality directly in the rendered visual space.

Used to provide fine-grained reward signals in vision-to-code tasks.

LVLMs (Large Vision Language Models)

A large-scale model that combines vision and language understanding, capable of handling multimodal inputs.

Used as the base model for vision-to-code tasks.

RL (Reinforcement Learning)

A machine learning method that guides model learning through reward signals.

Used as a method for training vision-to-code models.

Qwen3-VL-8B-Instruct

A policy model used for vision-to-code tasks, serving as one of the base models for Visual-ERM.

Used in experiments to evaluate Visual-ERM's performance.

VisualCritic-RewardBench

A benchmark for evaluating fine-grained image-to-image discrepancies on structured visual data.

Used to validate Visual-ERM's fine-grained evaluation capabilities.

DINO (Self-supervised Vision Model)

A reward method based on visual encoders used to evaluate visual similarity.

Used as a comparative benchmark for Visual-ERM.

ChartMimic

A dataset for chart-to-code tasks.

Used in experiments to evaluate Visual-ERM's performance.

OmniDocBench

A dataset for table-to-markdown tasks.

Used in experiments to evaluate Visual-ERM's performance.

UniSVG

A dataset for SVG-to-code tasks.

Used in experiments to evaluate Visual-ERM's performance.

GRPO (Gradient-based Policy Optimization)

An optimization algorithm used in reinforcement learning.

Used as an algorithm for training vision-to-code models.

Open Questions Unanswered questions from this research

1 How can Visual-ERM's parsing ability for high-resolution images be enhanced without increasing computational complexity? Existing methods may encounter performance bottlenecks when handling complex visual structures, requiring further research on optimization strategies.
2 Visual-ERM heavily relies on training data; how can its performance be maintained in data-scarce scenarios? This requires exploring more efficient data augmentation and transfer learning methods.
3 In certain specific tasks, Visual-ERM's generalization ability may be limited. How can its cross-domain adaptability be improved? This requires research on more robust model architectures.
4 How can Visual-ERM's computational efficiency be further optimized to reduce resource consumption in practical applications? This requires breakthroughs in model compression and acceleration technologies.
5 What is the potential of Visual-ERM in multimodal generative models? This requires exploring its applicability and performance in other multimodal tasks.

Applications

Immediate Applications

Front-end Development

Visual-ERM can be used to automatically convert UI designs into code, improving development efficiency.

Scientific Paper Parsing

By automatically extracting and parsing charts and data from papers, Visual-ERM can accelerate scientific research progress.

Knowledge Management

Through vision-to-code conversion, Visual-ERM can enhance information accessibility and usability, aiding knowledge management and system integration.

Long-term Vision

Intelligent Design Tools

Visual-ERM has the potential to become a core component of intelligent design tools, automatically generating code that meets design specifications.

Automated Data Analysis

By converting complex visual data into structured information, Visual-ERM can drive the development of automated data analysis.

Abstract

Vision-to-code tasks require models to reconstruct structured visual inputs, such as charts, tables, and SVGs, into executable or structured representations with high visual fidelity. While recent Large Vision Language Models (LVLMs) achieve strong results via supervised fine-tuning, reinforcement learning remains challenging due to misaligned reward signals. Existing rewards either rely on textual rules or coarse visual embedding similarity, both of which fail to capture fine-grained visual discrepancies and are vulnerable to reward hacking. We propose Visual Equivalence Reward Model (Visual-ERM), a multimodal generative reward model that provides fine-grained, interpretable, and task-agnostic feedback to evaluate vision-to-code quality directly in the rendered visual space. Integrated into RL, Visual-ERM improves Qwen3-VL-8B-Instruct by +8.4 on chart-to-code and yields consistent gains on table and SVG parsing (+2.7, +4.1 on average), and further strengthens test-time scaling via reflection and revision. We also introduce VisualCritic-RewardBench (VC-RewardBench), a benchmark for judging fine-grained image-to-image discrepancies on structured visual data, where Visual-ERM at 8B decisively outperforms Qwen3-VL-235B-Instruct and approaches leading closed-source models. Our results suggest that fine-grained visual reward supervision is both necessary and sufficient for vision-to-code RL, regardless of task specificity.

cs.CV cs.AI

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Analysis

Background

Core Problem

Innovation

Methodology

Experiments

Results

Applications

Limitations & Outlook

Plain Language Accessible to non-experts

ELI14 Explained like you're 14

Glossary

Visual-ERM (Visual Equivalence Reward Model)

LVLMs (Large Vision Language Models)

RL (Reinforcement Learning)

Qwen3-VL-8B-Instruct

VisualCritic-RewardBench

DINO (Self-supervised Vision Model)

ChartMimic

OmniDocBench

UniSVG

GRPO (Gradient-based Policy Optimization)

Open Questions Unanswered questions from this research

Applications

Immediate Applications

Front-end Development

Scientific Paper Parsing

Knowledge Management

Long-term Vision

Intelligent Design Tools

Automated Data Analysis

Abstract

Related Papers

Out of Sight, Out of Mind? Evaluating State Evolution in Video World Models

InterEdit: Navigating Text-Guided Multi-Human 3D Motion Editing

Alternating Gradient Flow Utility: A Unified Metric for Structural Pruning and Dynamic Routing in Deep Networks

EVATok: Adaptive Length Video Tokenization for Efficient Visual Autoregressive Generation

MM-CondChain: A Programmatically Verified Benchmark for Visually Grounded Deep Compositional Reasoning

OmniStream: Mastering Perception, Reconstruction and Action in Continuous Streams