MM-CondChain: A Programmatically Verified Benchmark for Visually Grounded Deep Compositional Reasoning
MM-CondChain uses VPIR for visually grounded deep compositional reasoning, with top model achieving only 53.33 Path F1.
Key Findings
Methodology
This study introduces MM-CondChain, a benchmark for evaluating multimodal large language models' (MLLMs) capability in visually grounded deep compositional reasoning. The methodology involves an agentic synthesis pipeline comprising a Planner for layer-by-layer generation of compositional conditions and a Verifiable Programmatic Intermediate Representation (VPIR) to ensure mechanical verifiability of each layer's condition. A Composer then assembles these verified layers into complete instructions.
Key Results
- Experiments across three visual domains (natural images, data charts, and GUI trajectories) show that even the strongest model achieves only 53.33 Path F1, with performance dropping sharply as depth or predicate complexity increases.
- Models perform significantly worse on False-path than on True-path, indicating a tendency to assume conditions hold under complex scenarios.
- In the GUI domain, models perform the worst, with the best F1 being only 40.19, lower than in natural images and data charts.
Significance
This research fills a gap in existing benchmarks by introducing MM-CondChain, which systematically evaluates deep compositional reasoning capabilities of MLLMs. By verifying multi-factor visual conditions, this benchmark provides a more comprehensive assessment framework for MLLMs' capabilities, significantly impacting the development of visual reasoning, especially in complex visual workflows.
Technical Contribution
The technical contributions of MM-CondChain include the innovative use of VPIR to ensure mechanical verifiability of each layer's condition, avoiding logical conflicts and unclear visual references. Additionally, the agentic synthesis pipeline allows for scalable construction of complex workflow-style data.
Novelty
MM-CondChain is the first to systematically evaluate deeply compositional visual conditions. Its novelty lies in introducing a Verifiable Programmatic Intermediate Representation (VPIR), ensuring logical consistency compared to existing benchmarks.
Limitations
- Current models still perform poorly in deep compositional reasoning, especially on False-path, indicating deficiencies in detecting violated conditions.
- Performance in the GUI domain is the worst, likely due to the need for reasoning over multi-frame trajectories, user actions, and interface state transitions.
Future Work
Future research directions include improving models' performance in deep compositional reasoning, particularly in False-path accuracy. Additionally, exploring more complex visual domains and conditions could further challenge and enhance models' reasoning capabilities.
AI Executive Summary
In recent years, multimodal large language models (MLLMs) have been increasingly applied in visual workflows, such as navigating graphical user interfaces (GUIs). However, existing benchmarks primarily focus on shallow compositions or independent constraints, neglecting the evaluation of deeply chained compositional conditionals. Against this backdrop, this paper introduces MM-CondChain, a benchmark for visually grounded deep compositional reasoning.
Each benchmark instance in MM-CondChain is organized as a multi-layer reasoning chain, where every layer contains a non-trivial compositional condition grounded in visual evidence. To answer correctly, an MLLM must perceive the image in detail, reason over multiple visual elements at each step, and follow the execution path to the final outcome. To scalably construct such workflow-style data, the authors propose an agentic synthesis pipeline: a Planner orchestrates layer-by-layer generation of compositional conditions, while a Verifiable Programmatic Intermediate Representation (VPIR) ensures each layer's condition is mechanically verifiable. A Composer then assembles these verified layers into complete instructions.
Using this pipeline, the authors construct benchmarks across three visual domains: natural images, data charts, and GUI trajectories. Experimental results show that even the strongest model attains only 53.33 Path F1, with sharp drops as depth or predicate complexity grows. This finding confirms that visually grounded deep compositional reasoning remains a fundamental challenge.
The introduction of MM-CondChain fills a gap in existing benchmarks by providing a comprehensive assessment framework for MLLMs' capabilities in verifying multi-factor visual conditions. This is significant for advancing the field of visual reasoning, especially in complex visual workflows.
Nevertheless, current models still perform poorly in deep compositional reasoning, particularly on False-path, indicating deficiencies in detecting violated conditions. Future research directions include improving models' performance in deep compositional reasoning, particularly in False-path accuracy. Additionally, exploring more complex visual domains and conditions could further challenge and enhance models' reasoning capabilities.
Deep Analysis
Background
Multimodal large language models (MLLMs) have shown great potential in visual reasoning tasks in recent years. As technology advances, these models are expected to go beyond simple visual question answering and tackle complex visual workflows. However, existing benchmarks mostly focus on shallow compositions or independent constraints, neglecting the evaluation of deep compositional conditions. This lack of depth evaluation limits our comprehensive understanding of MLLMs' capabilities in complex visual tasks. To address this gap, this paper introduces MM-CondChain, a benchmark specifically designed to evaluate visually grounded deep compositional reasoning capabilities.
Core Problem
Existing benchmarks fall short in evaluating MLLMs' deep compositional reasoning capabilities. Specifically, these benchmarks typically involve single-layer compositions or independent constraints, without systematically exploring multi-layer compositional reasoning capabilities. This lack of depth evaluation limits our comprehensive understanding of MLLMs' capabilities in complex visual tasks.
Innovation
The core innovations of MM-CondChain lie in its agentic synthesis pipeline and Verifiable Programmatic Intermediate Representation (VPIR).
- �� Agentic synthesis pipeline: A Planner generates compositional conditions layer by layer, ensuring logical consistency of each layer.
- �� Verifiable Programmatic Intermediate Representation (VPIR): Ensures mechanical verifiability of each layer's condition, avoiding logical conflicts and unclear visual references.
- �� Composer: Assembles verified layers into complete instructions, ensuring the integrity and accuracy of instructions.
Methodology
The construction process of MM-CondChain includes the following key steps:
- �� Planner: Responsible for generating compositional conditions layer by layer, ensuring logical consistency of each layer.
- �� Verifiable Programmatic Intermediate Representation (VPIR): Used to verify the mechanical verifiability of each layer's condition, avoiding logical conflicts.
- �� Composer: Assembles verified layers into complete instructions, ensuring the integrity and accuracy of instructions.
- �� Dataset construction: Benchmarks are constructed across three visual domains: natural images, data charts, and GUI trajectories.
Experiments
The experimental design includes testing various MLLMs across three visual domains: natural images, data charts, and GUI trajectories. Benchmarks used include SAM2023, GQA2019, ChartQA2022, and AITZ2024a. Evaluation metrics include True-path and False-path accuracy, as well as the average Path F1. Experiments also include ablation studies to explore different models' performance under various conditions.
Results
Experimental results show that even the strongest model achieves only 53.33 Path F1, with performance dropping sharply as depth or predicate complexity increases. Models perform significantly worse on False-path than on True-path, indicating a tendency to assume conditions hold under complex scenarios. In the GUI domain, models perform the worst, with the best F1 being only 40.19, lower than in natural images and data charts.
Applications
Application scenarios of MM-CondChain include evaluating MLLMs' capabilities in complex visual tasks. This is significant for applications requiring precise visual reasoning, such as autonomous driving, intelligent surveillance, and human-computer interaction. By verifying multi-factor visual conditions, this benchmark provides a more comprehensive assessment framework for MLLMs' capabilities.
Limitations & Outlook
Despite providing a comprehensive assessment framework, current models still perform poorly in deep compositional reasoning, particularly on False-path. Additionally, performance in the GUI domain is the worst, likely due to the need for reasoning over multi-frame trajectories, user actions, and interface state transitions. Future research directions include improving models' performance in deep compositional reasoning, particularly in False-path accuracy.
Plain Language Accessible to non-experts
Imagine you're in a kitchen preparing a complex dish. First, you need to gather all the ingredients according to the recipe, similar to the Planner in MM-CondChain, which generates compositional conditions layer by layer. Next, you must ensure each step is followed precisely, like chopping vegetables or boiling water, akin to the Verifiable Programmatic Intermediate Representation (VPIR) ensuring mechanical verifiability of each condition. Finally, you combine all the steps to complete the dish, like the Composer assembling verified layers into complete instructions. The entire process requires precise operations and strict adherence to each step to ensure the final dish is delicious, just as MM-CondChain ensures logical consistency and accuracy of each condition.
ELI14 Explained like you're 14
Hey there! Imagine you're playing a super complex game. This game has many levels, each with different tasks, like finding hidden treasures or solving puzzles. To pass each level, you need to complete each task step by step, just like the Planner in MM-CondChain helps you generate tasks layer by layer. Each task has specific rules you must follow, like the Verifiable Programmatic Intermediate Representation (VPIR) ensuring each task's rules are correctly executed. Finally, you need to complete all tasks to win the game, just like the Composer assembling all verified tasks into a complete victory plan. This game requires you to carefully observe and accurately execute each task to ultimately win!
Glossary
Multimodal Large Language Models (MLLMs)
MLLMs are artificial intelligence models capable of processing and understanding multiple data modalities, such as text, images, and audio.
In this paper, MLLMs are used to evaluate their capabilities in visually grounded deep compositional reasoning.
Compositional Reasoning
Compositional reasoning involves reasoning and decision-making by combining multiple conditions or factors.
MM-CondChain evaluates models' reasoning capabilities by combining multiple visual conditions.
Verifiable Programmatic Intermediate Representation (VPIR)
VPIR is a programmatic representation used to verify the mechanical verifiability of each layer's condition, ensuring logical consistency.
VPIR is used in MM-CondChain to verify the correctness of each layer's condition.
Planner
A Planner is a component responsible for generating compositional conditions layer by layer, ensuring logical consistency of each layer.
In MM-CondChain, the Planner generates compositional conditions for each benchmark instance.
Composer
A Composer is a component that assembles verified layers into complete instructions, ensuring the integrity and accuracy of instructions.
In MM-CondChain, the Composer assembles verified layers into complete instructions.
Path F1
Path F1 is a metric evaluating model performance on True-path and False-path, measuring overall performance.
In experiments, Path F1 is used to evaluate model performance under different conditions.
True-path
True-path refers to the execution path the model must follow when all conditions hold.
True-path accuracy is used to evaluate model performance when conditions hold.
False-path
False-path refers to the execution path the model must follow when a condition is minimally perturbed.
False-path accuracy is used to evaluate model performance when conditions do not hold.
Deep Compositional Reasoning
Deep compositional reasoning involves reasoning and decision-making through multi-layer compositional conditions.
MM-CondChain evaluates models' capabilities in deep compositional reasoning.
Visual Workflow
A visual workflow refers to a series of tasks or steps requiring visual input and reasoning.
In this paper, visual workflows are used to evaluate models' capabilities in complex visual tasks.
Open Questions Unanswered questions from this research
- 1 Despite providing a comprehensive assessment framework, current models still perform poorly in deep compositional reasoning, particularly on False-path. Future research needs to explore improving models' capabilities in detecting violated conditions.
- 2 In the GUI domain, models perform the worst, likely due to the need for reasoning over multi-frame trajectories, user actions, and interface state transitions. Future research could explore more complex visual domains and conditions to further challenge and enhance models' reasoning capabilities.
- 3 Current benchmarks focus mainly on three visual domains: natural images, data charts, and GUI trajectories. Future research could explore other visual domains, such as video analysis and 3D scene understanding, to evaluate models' capabilities in a broader range of visual tasks.
- 4 While VPIR ensures mechanical verifiability of each layer's condition, handling uncertainty and noise in real-world applications remains a challenge. Future research could explore more robust verification methods.
- 5 In training multimodal large language models, effectively combining visual and language information to enhance reasoning capabilities remains an open question. Future research could explore more effective multimodal fusion methods.
Applications
Immediate Applications
Autonomous Driving
MM-CondChain can be used to evaluate autonomous driving systems' decision-making capabilities in complex traffic environments, ensuring systems can correctly identify and respond to various visual conditions.
Intelligent Surveillance
In intelligent surveillance systems, MM-CondChain can be used to evaluate systems' event detection and response capabilities in complex scenarios, ensuring accuracy and reliability.
Human-Computer Interaction
MM-CondChain can be used to evaluate human-computer interaction systems' response capabilities in complex interfaces, ensuring systems can correctly understand and respond to multimodal inputs.
Long-term Vision
Comprehensive Evaluation of Visual Reasoning
MM-CondChain can serve as a standard framework for comprehensively evaluating MLLMs' reasoning capabilities in various visual tasks, advancing the field of visual reasoning.
Improvement of Multimodal Fusion Methods
Through the evaluation results of MM-CondChain, more effective multimodal fusion methods can be explored and improved to enhance models' reasoning capabilities in complex visual tasks.
Abstract
Multimodal Large Language Models (MLLMs) are increasingly used to carry out visual workflows such as navigating GUIs, where the next step depends on verified visual compositional conditions (e.g., "if a permission dialog appears and the color of the interface is green, click Allow") and the process may branch or terminate early. Yet this capability remains under-evaluated: existing benchmarks focus on shallow-compositions or independent-constraints rather than deeply chained compositional conditionals. In this paper, we introduce MM-CondChain, a benchmark for visually grounded deep compositional reasoning. Each benchmark instance is organized as a multi-layer reasoning chain, where every layer contains a non-trivial compositional condition grounded in visual evidence and built from multiple objects, attributes, or relations. To answer correctly, an MLLM must perceive the image in detail, reason over multiple visual elements at each step, and follow the resulting execution path to the final outcome. To scalably construct such workflow-style data, we propose an agentic synthesis pipeline: a Planner orchestrates layer-by-layer generation of compositional conditions, while a Verifiable Programmatic Intermediate Representation (VPIR) ensures each layer's condition is mechanically verifiable. A Composer then assembles these verified layers into complete instructions. Using this pipeline, we construct benchmarks across three visual domains: natural images, data charts, and GUI trajectories. Experiments on a range of MLLMs show that even the strongest model attains only 53.33 Path F1, with sharp drops on hard negatives and as depth or predicate complexity grows, confirming that deep compositional reasoning remains a fundamental challenge.
References (20)
MM-IFEngine: Towards Multimodal Instruction Following
Shengyuan Ding, Shenxi Wu, Xiangyu Zhao et al.
FollowBench: A Multi-level Fine-grained Constraints Following Benchmark for Large Language Models
Yuxin Jiang, Yufei Wang, Xingshan Zeng et al.
Benchmarking Complex Instruction-Following with Multiple Constraints Composition
Bosi Wen, Pei Ke, Xiaotao Gu et al.
Generalizing Verifiable Instruction Following
Valentina Pyatkin, Saumya Malik, Victoria Graf et al.
Adaptive Classifier-Free Guidance via Dynamic Low-Confidence Masking
Pengxiang Li, Shilin Yan, Joey Tsai et al.
Instruction-Following Evaluation for Large Language Models
Jeffrey Zhou, Tianjian Lu, Swaroop Mishra et al.
CrossLMM: Decoupling Long Video Sequences from LMMs via Dual Cross-Attention Mechanisms
Shilin Yan, Jiaming Han, Joey Tsai et al.
Visual CoT: Advancing Multi-Modal Language Models with a Comprehensive Dataset and Benchmark for Chain-of-Thought Reasoning
Hao Shao, Shengju Qian, Han Xiao et al.
MIA-Bench: Towards Better Instruction Following Evaluation of Multimodal LLMs
Yusu Qian, Hanrong Ye, J. Fauconnier et al.
SugarCrepe: Fixing Hackable Benchmarks for Vision-Language Compositionality
Cheng-Yu Hsieh, Jieyu Zhang, Zixian Ma et al.
ViC-Bench: Benchmarking Visual-Interleaved Chain-of-Thought Capability in MLLMs with Free-Style Intermediate State Representations
Xuecheng Wu, Jiaxing Liu, Danlei Huang et al.
MMCOMPOSITION: Revisiting the Compositionality of Pre-trained Vision-Language Models
Hang Hua, Yunlong Tang, Ziyun Zeng et al.
Inverse IFEval: Can LLMs Unlearn Stubborn Training Conventions to Follow Real Instructions?
Qinyan Zhang, Xinping Lei, Ruijie Miao et al.
LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs
Omkar Thawakar, Dinura Dissanayake, Ketan More et al.
GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering
Drew A. Hudson, Christopher D. Manning
MC-Bench: A Benchmark for Multi-Context Visual Grounding in the Era of MLLMs
Yunqiu Xu, Linchao Zhu, Yi Yang
MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts
Pan Lu, Hritik Bansal, Tony Xia et al.
An Explainable Toolbox for Evaluating Pre-trained Vision-Language Models
Tiancheng Zhao, Tianqi Zhang, Mingwei Zhu et al.