MM-CondChain: A Programmatically Verified Benchmark for Visually Grounded Deep Compositional Reasoning

TL;DR

MM-CondChain uses VPIR for visually grounded deep compositional reasoning, with top model achieving only 53.33 Path F1.

cs.CV 🔴 Advanced 2026-03-13 15 views

Haozhan Shen Shilin Yan Hongwei Xue Shuaiqi Lu Xiaojun Tang Guannan Zhang Tiancheng Zhao Jianwei Yin

multimodal deep learning compositional reasoning visual reasoning benchmark

Key Findings

Methodology

This study introduces MM-CondChain, a benchmark for evaluating multimodal large language models' (MLLMs) capability in visually grounded deep compositional reasoning. The methodology involves an agentic synthesis pipeline comprising a Planner for layer-by-layer generation of compositional conditions and a Verifiable Programmatic Intermediate Representation (VPIR) to ensure mechanical verifiability of each layer's condition. A Composer then assembles these verified layers into complete instructions.

Key Results

Experiments across three visual domains (natural images, data charts, and GUI trajectories) show that even the strongest model achieves only 53.33 Path F1, with performance dropping sharply as depth or predicate complexity increases.
Models perform significantly worse on False-path than on True-path, indicating a tendency to assume conditions hold under complex scenarios.
In the GUI domain, models perform the worst, with the best F1 being only 40.19, lower than in natural images and data charts.

Significance

This research fills a gap in existing benchmarks by introducing MM-CondChain, which systematically evaluates deep compositional reasoning capabilities of MLLMs. By verifying multi-factor visual conditions, this benchmark provides a more comprehensive assessment framework for MLLMs' capabilities, significantly impacting the development of visual reasoning, especially in complex visual workflows.

Technical Contribution

The technical contributions of MM-CondChain include the innovative use of VPIR to ensure mechanical verifiability of each layer's condition, avoiding logical conflicts and unclear visual references. Additionally, the agentic synthesis pipeline allows for scalable construction of complex workflow-style data.

Novelty

MM-CondChain is the first to systematically evaluate deeply compositional visual conditions. Its novelty lies in introducing a Verifiable Programmatic Intermediate Representation (VPIR), ensuring logical consistency compared to existing benchmarks.

Limitations

Current models still perform poorly in deep compositional reasoning, especially on False-path, indicating deficiencies in detecting violated conditions.
Performance in the GUI domain is the worst, likely due to the need for reasoning over multi-frame trajectories, user actions, and interface state transitions.

Future Work

Future research directions include improving models' performance in deep compositional reasoning, particularly in False-path accuracy. Additionally, exploring more complex visual domains and conditions could further challenge and enhance models' reasoning capabilities.

AI Executive Summary

In recent years, multimodal large language models (MLLMs) have been increasingly applied in visual workflows, such as navigating graphical user interfaces (GUIs). However, existing benchmarks primarily focus on shallow compositions or independent constraints, neglecting the evaluation of deeply chained compositional conditionals. Against this backdrop, this paper introduces MM-CondChain, a benchmark for visually grounded deep compositional reasoning.

Each benchmark instance in MM-CondChain is organized as a multi-layer reasoning chain, where every layer contains a non-trivial compositional condition grounded in visual evidence. To answer correctly, an MLLM must perceive the image in detail, reason over multiple visual elements at each step, and follow the execution path to the final outcome. To scalably construct such workflow-style data, the authors propose an agentic synthesis pipeline: a Planner orchestrates layer-by-layer generation of compositional conditions, while a Verifiable Programmatic Intermediate Representation (VPIR) ensures each layer's condition is mechanically verifiable. A Composer then assembles these verified layers into complete instructions.

Using this pipeline, the authors construct benchmarks across three visual domains: natural images, data charts, and GUI trajectories. Experimental results show that even the strongest model attains only 53.33 Path F1, with sharp drops as depth or predicate complexity grows. This finding confirms that visually grounded deep compositional reasoning remains a fundamental challenge.

The introduction of MM-CondChain fills a gap in existing benchmarks by providing a comprehensive assessment framework for MLLMs' capabilities in verifying multi-factor visual conditions. This is significant for advancing the field of visual reasoning, especially in complex visual workflows.

Nevertheless, current models still perform poorly in deep compositional reasoning, particularly on False-path, indicating deficiencies in detecting violated conditions. Future research directions include improving models' performance in deep compositional reasoning, particularly in False-path accuracy. Additionally, exploring more complex visual domains and conditions could further challenge and enhance models' reasoning capabilities.

Deep Analysis

Background

Multimodal large language models (MLLMs) have shown great potential in visual reasoning tasks in recent years. As technology advances, these models are expected to go beyond simple visual question answering and tackle complex visual workflows. However, existing benchmarks mostly focus on shallow compositions or independent constraints, neglecting the evaluation of deep compositional conditions. This lack of depth evaluation limits our comprehensive understanding of MLLMs' capabilities in complex visual tasks. To address this gap, this paper introduces MM-CondChain, a benchmark specifically designed to evaluate visually grounded deep compositional reasoning capabilities.

Core Problem

Existing benchmarks fall short in evaluating MLLMs' deep compositional reasoning capabilities. Specifically, these benchmarks typically involve single-layer compositions or independent constraints, without systematically exploring multi-layer compositional reasoning capabilities. This lack of depth evaluation limits our comprehensive understanding of MLLMs' capabilities in complex visual tasks.

Innovation

The core innovations of MM-CondChain lie in its agentic synthesis pipeline and Verifiable Programmatic Intermediate Representation (VPIR).

�� Agentic synthesis pipeline: A Planner generates compositional conditions layer by layer, ensuring logical consistency of each layer.
�� Verifiable Programmatic Intermediate Representation (VPIR): Ensures mechanical verifiability of each layer's condition, avoiding logical conflicts and unclear visual references.
�� Composer: Assembles verified layers into complete instructions, ensuring the integrity and accuracy of instructions.

Methodology

The construction process of MM-CondChain includes the following key steps:

�� Planner: Responsible for generating compositional conditions layer by layer, ensuring logical consistency of each layer.
�� Verifiable Programmatic Intermediate Representation (VPIR): Used to verify the mechanical verifiability of each layer's condition, avoiding logical conflicts.
�� Composer: Assembles verified layers into complete instructions, ensuring the integrity and accuracy of instructions.
�� Dataset construction: Benchmarks are constructed across three visual domains: natural images, data charts, and GUI trajectories.

Experiments

The experimental design includes testing various MLLMs across three visual domains: natural images, data charts, and GUI trajectories. Benchmarks used include SAM2023, GQA2019, ChartQA2022, and AITZ2024a. Evaluation metrics include True-path and False-path accuracy, as well as the average Path F1. Experiments also include ablation studies to explore different models' performance under various conditions.

Results

Experimental results show that even the strongest model achieves only 53.33 Path F1, with performance dropping sharply as depth or predicate complexity increases. Models perform significantly worse on False-path than on True-path, indicating a tendency to assume conditions hold under complex scenarios. In the GUI domain, models perform the worst, with the best F1 being only 40.19, lower than in natural images and data charts.

Applications

Application scenarios of MM-CondChain include evaluating MLLMs' capabilities in complex visual tasks. This is significant for applications requiring precise visual reasoning, such as autonomous driving, intelligent surveillance, and human-computer interaction. By verifying multi-factor visual conditions, this benchmark provides a more comprehensive assessment framework for MLLMs' capabilities.

Limitations & Outlook

Despite providing a comprehensive assessment framework, current models still perform poorly in deep compositional reasoning, particularly on False-path. Additionally, performance in the GUI domain is the worst, likely due to the need for reasoning over multi-frame trajectories, user actions, and interface state transitions. Future research directions include improving models' performance in deep compositional reasoning, particularly in False-path accuracy.

Plain Language Accessible to non-experts

Imagine you're in a kitchen preparing a complex dish. First, you need to gather all the ingredients according to the recipe, similar to the Planner in MM-CondChain, which generates compositional conditions layer by layer. Next, you must ensure each step is followed precisely, like chopping vegetables or boiling water, akin to the Verifiable Programmatic Intermediate Representation (VPIR) ensuring mechanical verifiability of each condition. Finally, you combine all the steps to complete the dish, like the Composer assembling verified layers into complete instructions. The entire process requires precise operations and strict adherence to each step to ensure the final dish is delicious, just as MM-CondChain ensures logical consistency and accuracy of each condition.

ELI14 Explained like you're 14

Hey there! Imagine you're playing a super complex game. This game has many levels, each with different tasks, like finding hidden treasures or solving puzzles. To pass each level, you need to complete each task step by step, just like the Planner in MM-CondChain helps you generate tasks layer by layer. Each task has specific rules you must follow, like the Verifiable Programmatic Intermediate Representation (VPIR) ensuring each task's rules are correctly executed. Finally, you need to complete all tasks to win the game, just like the Composer assembling all verified tasks into a complete victory plan. This game requires you to carefully observe and accurately execute each task to ultimately win!

Glossary

Multimodal Large Language Models (MLLMs)

MLLMs are artificial intelligence models capable of processing and understanding multiple data modalities, such as text, images, and audio.

In this paper, MLLMs are used to evaluate their capabilities in visually grounded deep compositional reasoning.

Compositional Reasoning

Compositional reasoning involves reasoning and decision-making by combining multiple conditions or factors.

MM-CondChain evaluates models' reasoning capabilities by combining multiple visual conditions.

Verifiable Programmatic Intermediate Representation (VPIR)

VPIR is a programmatic representation used to verify the mechanical verifiability of each layer's condition, ensuring logical consistency.

VPIR is used in MM-CondChain to verify the correctness of each layer's condition.

Planner

A Planner is a component responsible for generating compositional conditions layer by layer, ensuring logical consistency of each layer.

In MM-CondChain, the Planner generates compositional conditions for each benchmark instance.

Composer

A Composer is a component that assembles verified layers into complete instructions, ensuring the integrity and accuracy of instructions.

In MM-CondChain, the Composer assembles verified layers into complete instructions.

Path F1

Path F1 is a metric evaluating model performance on True-path and False-path, measuring overall performance.

In experiments, Path F1 is used to evaluate model performance under different conditions.

True-path

True-path refers to the execution path the model must follow when all conditions hold.

True-path accuracy is used to evaluate model performance when conditions hold.

False-path

False-path refers to the execution path the model must follow when a condition is minimally perturbed.

False-path accuracy is used to evaluate model performance when conditions do not hold.

Deep Compositional Reasoning

Deep compositional reasoning involves reasoning and decision-making through multi-layer compositional conditions.

MM-CondChain evaluates models' capabilities in deep compositional reasoning.

Visual Workflow

A visual workflow refers to a series of tasks or steps requiring visual input and reasoning.

In this paper, visual workflows are used to evaluate models' capabilities in complex visual tasks.

Open Questions Unanswered questions from this research

1 Despite providing a comprehensive assessment framework, current models still perform poorly in deep compositional reasoning, particularly on False-path. Future research needs to explore improving models' capabilities in detecting violated conditions.
2 In the GUI domain, models perform the worst, likely due to the need for reasoning over multi-frame trajectories, user actions, and interface state transitions. Future research could explore more complex visual domains and conditions to further challenge and enhance models' reasoning capabilities.
3 Current benchmarks focus mainly on three visual domains: natural images, data charts, and GUI trajectories. Future research could explore other visual domains, such as video analysis and 3D scene understanding, to evaluate models' capabilities in a broader range of visual tasks.
4 While VPIR ensures mechanical verifiability of each layer's condition, handling uncertainty and noise in real-world applications remains a challenge. Future research could explore more robust verification methods.
5 In training multimodal large language models, effectively combining visual and language information to enhance reasoning capabilities remains an open question. Future research could explore more effective multimodal fusion methods.

Applications

Immediate Applications

Autonomous Driving

MM-CondChain can be used to evaluate autonomous driving systems' decision-making capabilities in complex traffic environments, ensuring systems can correctly identify and respond to various visual conditions.

Intelligent Surveillance

In intelligent surveillance systems, MM-CondChain can be used to evaluate systems' event detection and response capabilities in complex scenarios, ensuring accuracy and reliability.

Human-Computer Interaction

MM-CondChain can be used to evaluate human-computer interaction systems' response capabilities in complex interfaces, ensuring systems can correctly understand and respond to multimodal inputs.

Long-term Vision

Comprehensive Evaluation of Visual Reasoning

MM-CondChain can serve as a standard framework for comprehensively evaluating MLLMs' reasoning capabilities in various visual tasks, advancing the field of visual reasoning.

Improvement of Multimodal Fusion Methods

Through the evaluation results of MM-CondChain, more effective multimodal fusion methods can be explored and improved to enhance models' reasoning capabilities in complex visual tasks.

Abstract

Multimodal Large Language Models (MLLMs) are increasingly used to carry out visual workflows such as navigating GUIs, where the next step depends on verified visual compositional conditions (e.g., "if a permission dialog appears and the color of the interface is green, click Allow") and the process may branch or terminate early. Yet this capability remains under-evaluated: existing benchmarks focus on shallow-compositions or independent-constraints rather than deeply chained compositional conditionals. In this paper, we introduce MM-CondChain, a benchmark for visually grounded deep compositional reasoning. Each benchmark instance is organized as a multi-layer reasoning chain, where every layer contains a non-trivial compositional condition grounded in visual evidence and built from multiple objects, attributes, or relations. To answer correctly, an MLLM must perceive the image in detail, reason over multiple visual elements at each step, and follow the resulting execution path to the final outcome. To scalably construct such workflow-style data, we propose an agentic synthesis pipeline: a Planner orchestrates layer-by-layer generation of compositional conditions, while a Verifiable Programmatic Intermediate Representation (VPIR) ensures each layer's condition is mechanically verifiable. A Composer then assembles these verified layers into complete instructions. Using this pipeline, we construct benchmarks across three visual domains: natural images, data charts, and GUI trajectories. Experiments on a range of MLLMs show that even the strongest model attains only 53.33 Path F1, with sharp drops on hard negatives and as depth or predicate complexity grows, confirming that deep compositional reasoning remains a fundamental challenge.

cs.CV

References (20)

MM-IFEngine: Towards Multimodal Instruction Following

Shengyuan Ding, Shenxi Wu, Xiangyu Zhao et al.

2025 27 citations ⭐ Influential View Analysis →

FollowBench: A Multi-level Fine-grained Constraints Following Benchmark for Large Language Models

Yuxin Jiang, Yufei Wang, Xingshan Zeng et al.

2023 77 citations ⭐ Influential View Analysis →

Benchmarking Complex Instruction-Following with Multiple Constraints Composition

Bosi Wen, Pei Ke, Xiaotao Gu et al.

2024 115 citations ⭐ Influential View Analysis →

Generalizing Verifiable Instruction Following

Valentina Pyatkin, Saumya Malik, Victoria Graf et al.

2025 46 citations ⭐ Influential View Analysis →

Adaptive Classifier-Free Guidance via Dynamic Low-Confidence Masking

Pengxiang Li, Shilin Yan, Joey Tsai et al.

2025 17 citations ⭐ Influential View Analysis →

Instruction-Following Evaluation for Large Language Models

Jeffrey Zhou, Tianjian Lu, Swaroop Mishra et al.

2023 697 citations ⭐ Influential View Analysis →

CrossLMM: Decoupling Long Video Sequences from LMMs via Dual Cross-Attention Mechanisms

Shilin Yan, Jiaming Han, Joey Tsai et al.

2025 10 citations View Analysis →

Visual CoT: Advancing Multi-Modal Language Models with a Comprehensive Dataset and Benchmark for Chain-of-Thought Reasoning

Hao Shao, Shengju Qian, Han Xiao et al.

2024 252 citations View Analysis →

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang et al.

2025 3560 citations View Analysis →

MIA-Bench: Towards Better Instruction Following Evaluation of Multimodal LLMs

Yusu Qian, Hanrong Ye, J. Fauconnier et al.

2024 47 citations View Analysis →

SugarCrepe: Fixing Hackable Benchmarks for Vision-Language Compositionality

Cheng-Yu Hsieh, Jieyu Zhang, Zixian Ma et al.

2023 210 citations View Analysis →

ViC-Bench: Benchmarking Visual-Interleaved Chain-of-Thought Capability in MLLMs with Free-Style Intermediate State Representations

Xuecheng Wu, Jiaxing Liu, Danlei Huang et al.

2025 6 citations View Analysis →

MMCOMPOSITION: Revisiting the Compositionality of Pre-trained Vision-Language Models

Hang Hua, Yunlong Tang, Ziyun Zeng et al.

2024 25 citations View Analysis →

Inverse IFEval: Can LLMs Unlearn Stubborn Training Conventions to Follow Real Instructions?

Qinyan Zhang, Xinping Lei, Ruijie Miao et al.

2025 2 citations View Analysis →

LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs

Omkar Thawakar, Dinura Dissanayake, Ketan More et al.

2025 119 citations View Analysis →

GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering

Drew A. Hudson, Christopher D. Manning

2019 2851 citations

MC-Bench: A Benchmark for Multi-Context Visual Grounding in the Era of MLLMs

Yunqiu Xu, Linchao Zhu, Yi Yang

2024 31 citations View Analysis →

MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

Pan Lu, Hritik Bansal, Tony Xia et al.

2023 1323 citations View Analysis →

An Explainable Toolbox for Evaluating Pre-trained Vision-Language Models

Tiancheng Zhao, Tianqi Zhang, Mingwei Zhu et al.

2022 23 citations

Segment Anything

A. Kirillov, Eric Mintun, Nikhila Ravi et al.

2023 12274 citations View Analysis →

MM-CondChain: A Programmatically Verified Benchmark for Visually Grounded Deep Compositional Reasoning

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Analysis

Background

Core Problem

Innovation

Methodology

Experiments

Results

Applications

Limitations & Outlook

Plain Language Accessible to non-experts

ELI14 Explained like you're 14

Glossary

Multimodal Large Language Models (MLLMs)

Compositional Reasoning

Verifiable Programmatic Intermediate Representation (VPIR)

Planner

Composer

Path F1

True-path

False-path

Deep Compositional Reasoning

Visual Workflow

Open Questions Unanswered questions from this research

Applications

Immediate Applications

Autonomous Driving

Intelligent Surveillance

Human-Computer Interaction

Long-term Vision

Comprehensive Evaluation of Visual Reasoning

Improvement of Multimodal Fusion Methods

Abstract

References (20)

Related Papers

Visual-ERM: Reward Modeling for Visual Equivalence

Out of Sight, Out of Mind? Evaluating State Evolution in Video World Models

InterEdit: Navigating Text-Guided Multi-Human 3D Motion Editing

Alternating Gradient Flow Utility: A Unified Metric for Structural Pruning and Dynamic Routing in Deep Networks

EVATok: Adaptive Length Video Tokenization for Efficient Visual Autoregressive Generation

OmniStream: Mastering Perception, Reconstruction and Action in Continuous Streams