EndoCoT: Scaling Endogenous Chain-of-Thought Reasoning in Diffusion Models

TL;DR

EndoCoT activates MLLMs' reasoning potential, achieving 92.1% accuracy, 8.3% higher than the baseline.

cs.CV 🔴 Advanced 2026-03-13 14 views

Xuanlang Dai Yujie Zhou Long Xing Jiazi Bu Xilin Wei Yuhong Liu Beichen Zhang Kai Chen Yuhang Zang

Multimodal Large Language Models Diffusion Models Endogenous Chain-of-Thought Reasoning Depth Terminal Thought Alignment

Key Findings

Methodology

The EndoCoT framework activates the reasoning potential of Multimodal Large Language Models (MLLMs) by iteratively refining latent thought states through an iterative thought guidance module. This module dynamically adjusts guidance during the reasoning process, ensuring the reasoning trajectory remains aligned with textual supervision. The framework then bridges these refined states to the Diffusion Model's (DiT) denoising process. A terminal thought grounding module is applied to ensure the reasoning trajectory remains grounded in textual supervision by aligning the final state with ground-truth answers. These components enable the MLLM text encoder to deliver meticulously reasoned guidance, allowing the DiT to execute it progressively and solve complex tasks step-by-step.

Key Results

Across diverse benchmarks such as Maze, TSP, VSP, and Sudoku, the EndoCoT framework achieved an average accuracy of 92.1%, outperforming the strongest baseline by 8.3 percentage points. This demonstrates the framework's enhanced reasoning capability in complex tasks.
In the Maze task, the EndoCoT framework exhibited exceptional performance in spatial reasoning, effectively decomposing complex instructions into actionable denoising steps.
In the Sudoku task, the EndoCoT framework significantly improved task completion accuracy through meticulous reasoning guidance, showcasing its potential in logical reasoning tasks.

Significance

The EndoCoT framework holds significant implications for both academia and industry. By addressing the insufficient reasoning depth of MLLMs, it provides more accurate guidance for complex tasks. This approach not only enhances the performance of existing diffusion models in complex tasks but also offers new perspectives for future multimodal reasoning research. Particularly in tasks requiring deep reasoning, such as spatial and logical reasoning, the EndoCoT framework demonstrates its unique advantages.

Technical Contribution

The technical contributions of EndoCoT lie in its fundamental differences from state-of-the-art methods. By introducing the iterative thought guidance module and terminal thought grounding module, the framework offers new theoretical guarantees and engineering possibilities. Unlike traditional single-step encoding methods, EndoCoT dynamically adjusts guidance during the reasoning process, ensuring the reasoning trajectory remains aligned with textual supervision. This approach not only increases reasoning depth but also enhances the model's adaptability in complex tasks.

Novelty

The novelty of the EndoCoT framework lies in its first introduction of endogenous chain-of-thought reasoning into the reasoning process of multimodal large language models. Compared to previous work, this framework significantly improves reasoning depth and accuracy by iteratively refining latent thought states. This approach offers new insights for solving complex tasks, especially those requiring step-by-step reasoning.

Limitations

The EndoCoT framework may encounter performance bottlenecks in certain types of complex tasks, especially those requiring substantial computational resources.
The iterative process of the framework may lead to increased computational overhead, affecting its application in resource-constrained environments.
In some tasks, the terminal thought grounding module may not entirely eliminate reasoning errors, impacting the accuracy of the final results.

Future Work

Future research directions include optimizing the computational efficiency of the EndoCoT framework for application in resource-constrained environments. Additionally, exploring how to apply this framework to a broader range of complex tasks, such as natural language understanding and generation tasks, is an important research direction. Further work could also focus on improving the terminal thought grounding module to enhance the accuracy and consistency of the reasoning process.

AI Executive Summary

In recent years, Multimodal Large Language Models (MLLMs) have been widely integrated into diffusion frameworks, primarily serving as text encoders to tackle complex tasks such as spatial reasoning. However, this paradigm suffers from two critical limitations: first, the MLLMs text encoder exhibits insufficient reasoning depth. Single-step encoding fails to activate the Chain-of-Thought process, which is essential for MLLMs to provide accurate guidance for complex tasks. Second, the guidance remains invariant during the decoding process. Invariant guidance during decoding prevents the Diffusion Model (DiT) from progressively decomposing complex instructions into actionable denoising steps, even with correct MLLM encodings. To address these issues, we propose the Endogenous Chain-of-Thought (EndoCoT), a novel framework that first activates MLLMs' reasoning potential by iteratively refining latent thought states through an iterative thought guidance module, and then bridges these states to the DiT's denoising process. Second, a terminal thought grounding module is applied to ensure the reasoning trajectory remains grounded in textual supervision by aligning the final state with ground-truth answers. With these two components, the MLLM text encoder delivers meticulously reasoned guidance, enabling the DiT to execute it progressively and ultimately solve complex tasks in a step-by-step manner.

Extensive evaluations across diverse benchmarks such as Maze, TSP, VSP, and Sudoku demonstrate that the EndoCoT framework achieves an average accuracy of 92.1%, outperforming the strongest baseline by 8.3 percentage points. This indicates a significant enhancement in the reasoning capability for complex tasks. The EndoCoT framework holds significant implications for both academia and industry. By addressing the insufficient reasoning depth of MLLMs, it provides more accurate guidance for complex tasks. This approach not only enhances the performance of existing diffusion models in complex tasks but also offers new perspectives for future multimodal reasoning research. Particularly in tasks requiring deep reasoning, such as spatial and logical reasoning, the EndoCoT framework demonstrates its unique advantages.

However, the EndoCoT framework may encounter performance bottlenecks in certain types of complex tasks, especially those requiring substantial computational resources. The iterative process of the framework may lead to increased computational overhead, affecting its application in resource-constrained environments. In some tasks, the terminal thought grounding module may not entirely eliminate reasoning errors, impacting the accuracy of the final results. Future research directions include optimizing the computational efficiency of the EndoCoT framework for application in resource-constrained environments. Additionally, exploring how to apply this framework to a broader range of complex tasks, such as natural language understanding and generation tasks, is an important research direction. Further work could also focus on improving the terminal thought grounding module to enhance the accuracy and consistency of the reasoning process.

Deep Analysis

Background

In recent years, Multimodal Large Language Models (MLLMs) have made significant progress in handling complex tasks, particularly those requiring the fusion of multimodal information. Traditionally, MLLMs have been used as text encoders in conjunction with diffusion models (DiT) to tackle complex spatial reasoning tasks. However, these methods exhibit significant limitations in reasoning depth and dynamic guidance. Previous research has primarily focused on enhancing the encoding capabilities of models, but during the decoding process, the guidance information often remains unchanged, limiting the model's performance in complex tasks. The EndoCoT framework aims to address these longstanding issues by introducing endogenous chain-of-thought reasoning to enhance the reasoning capabilities of MLLMs.

Core Problem

Existing multimodal large language models face two major issues when handling complex tasks: first, insufficient reasoning depth, where single-step encoding fails to activate the chain-of-thought process, leading to inaccurate guidance; second, invariant guidance during the decoding process, which hinders the diffusion model from progressively decomposing complex instructions into actionable denoising steps. These issues limit the model's performance in complex tasks, particularly those requiring step-by-step reasoning.

Innovation

The core innovations of the EndoCoT framework include:

1. Introducing an iterative thought guidance module that iteratively refines latent thought states, activating the reasoning potential of MLLMs. This method dynamically adjusts guidance information during the reasoning process, ensuring the reasoning trajectory remains aligned with textual supervision.

2. Applying a terminal thought grounding module that aligns the final state with ground-truth answers, ensuring the reasoning trajectory remains grounded in textual supervision. This method significantly improves reasoning depth and accuracy.

3. For the first time, introducing endogenous chain-of-thought reasoning into the reasoning process of multimodal large language models, providing new insights for solving complex tasks, especially those requiring step-by-step reasoning.

Methodology

The detailed methodology of the EndoCoT framework includes the following steps:

�� Iterative Thought Guidance Module: Iteratively refines latent thought states, activating the reasoning potential of MLLMs. Input is the initial thought state, the process includes multiple iterations, each refining the thought state, output is the updated thought state.
�� Terminal Thought Grounding Module: Aligns the final state with ground-truth answers, ensuring the reasoning trajectory remains grounded in textual supervision. Input is the final thought state, the process includes alignment operations, output is the aligned thought state.
�� Bridges the updated thought state to the diffusion model's denoising process, ensuring the model can progressively execute guidance information to solve complex tasks.

Experiments

The experimental design includes multiple benchmarks such as Maze, TSP, VSP, and Sudoku. The datasets used include publicly available standard datasets, with baseline methods being existing state-of-the-art multimodal large language models. Evaluation metrics include accuracy and reasoning depth. Key hyperparameters include the number of iterations and alignment precision. The experiments also include ablation studies to verify the contribution of each module.

Results

Experimental results show that the EndoCoT framework performs excellently across multiple benchmarks, achieving an average accuracy of 92.1%, outperforming the strongest baseline by 8.3 percentage points. In the Maze task, the EndoCoT framework exhibited exceptional performance in spatial reasoning, effectively decomposing complex instructions into actionable denoising steps. In the Sudoku task, the EndoCoT framework significantly improved task completion accuracy through meticulous reasoning guidance, showcasing its potential in logical reasoning tasks.

Applications

The application scenarios of the EndoCoT framework include complex tasks requiring deep reasoning, such as spatial and logical reasoning. In these tasks, the EndoCoT framework can provide more accurate guidance, improving task completion accuracy and efficiency. The industry can leverage this framework to develop more intelligent multimodal systems, enhancing automation levels.

Limitations & Outlook

The EndoCoT framework may encounter performance bottlenecks in certain types of complex tasks, especially those requiring substantial computational resources. The iterative process of the framework may lead to increased computational overhead, affecting its application in resource-constrained environments. In some tasks, the terminal thought grounding module may not entirely eliminate reasoning errors, impacting the accuracy of the final results. Future research directions include optimizing the computational efficiency of the EndoCoT framework for application in resource-constrained environments. Additionally, exploring how to apply this framework to a broader range of complex tasks, such as natural language understanding and generation tasks, is an important research direction.

Plain Language Accessible to non-experts

Imagine you're in a kitchen cooking a meal. Traditional multimodal large language models are like a chef who follows a recipe step-by-step. They can follow instructions well, but if the recipe isn't detailed enough, they might struggle. The EndoCoT framework, on the other hand, is like an experienced chef who not only follows the recipe but also adjusts the cooking based on the ingredients and the guests' preferences. This framework continuously checks and adjusts the reasoning process to ensure each step accurately leads to a delicious final result. Just like this chef tastes and adjusts the seasoning throughout the cooking process, the EndoCoT framework continuously adjusts the reasoning process when solving complex tasks to ensure the final result is accurate. This way, even when faced with complex tasks, the EndoCoT framework can flexibly adapt and produce satisfactory results, just like the chef in the kitchen.

ELI14 Explained like you're 14

Hey there! Imagine you're playing a super complex puzzle game. Traditional game guides are like a walkthrough book that tells you what to do step-by-step, but if the guide isn't detailed enough, you might get stuck. The EndoCoT framework is like a super smart game assistant that not only tells you what to do but also gives new suggestions based on your progress and game changes. This framework is like an assistant that keeps learning and adjusting, checking your progress at every step to make sure you can smoothly finish the game. Just like in the game, you might face new challenges, but with this smart assistant, you can always find a way to solve them and win the game! Isn't that cool?

Glossary

Multimodal Large Language Models

A type of language model capable of processing multiple modalities of information (e.g., text, images, audio) and is typically used for solving complex tasks.

In this paper, MLLMs primarily serve as text encoders, combined with diffusion models to handle complex tasks.

Diffusion Models

A probabilistic model used for data generation that produces target data through a step-by-step denoising process.

In this paper, diffusion models are used to progressively decompose complex instructions into actionable denoising steps.

Chain-of-Thought

A reasoning process that solves complex tasks by iteratively refining thought states.

In this paper, the chain-of-thought is used to activate the reasoning potential of MLLMs.

Iterative Thought Guidance Module

A module that iteratively refines latent thought states to activate the reasoning potential of MLLMs.

In this paper, this module is used to dynamically adjust guidance information during the reasoning process.

Terminal Thought Grounding Module

A module that ensures the reasoning trajectory remains grounded in textual supervision by aligning the final state with ground-truth answers.

In this paper, this module is used to enhance the accuracy and consistency of the reasoning process.

Maze

A spatial reasoning task that requires the model to find the correct path in a complex maze.

In this paper, the Maze task is used to evaluate the spatial reasoning capabilities of the EndoCoT framework.

TSP (Traveling Salesman Problem)

A combinatorial optimization problem that requires finding the shortest path to visit a series of cities.

In this paper, the TSP is used to evaluate the reasoning capabilities of the EndoCoT framework.

VSP (Visual Search Problem)

A task that requires finding a specific target in a complex visual scene.

In this paper, the VSP is used to evaluate the visual reasoning capabilities of the EndoCoT framework.

Sudoku

A logical reasoning game that requires filling numbers so that each row, column, and small square contains the numbers 1 to 9.

In this paper, the Sudoku task is used to evaluate the logical reasoning capabilities of the EndoCoT framework.

Accuracy

A metric for evaluating model performance, representing the proportion of correct predictions made by the model.

In this paper, accuracy is used to evaluate the performance of the EndoCoT framework across various benchmarks.

Open Questions Unanswered questions from this research

1 Current multimodal large language models still face issues with insufficient reasoning depth when handling complex tasks requiring deep reasoning. Although the EndoCoT framework improves reasoning depth by introducing endogenous chain-of-thought reasoning, the reasoning process may still be limited in certain specific tasks. Future research needs to further explore how to enhance the reasoning capabilities of models for broader task applications.
2 In resource-constrained environments, the computational overhead of the EndoCoT framework may become a bottleneck. Although the framework performs excellently in complex tasks, its iterative process may lead to increased computational resource consumption. Future research needs to explore how to optimize computational efficiency for application in resource-constrained environments.
3 The terminal thought grounding module may not completely eliminate reasoning errors in some tasks, impacting the accuracy of the final results. Although this module improves the accuracy and consistency of the reasoning process, errors may still exist in some complex tasks. Future research needs to explore how to improve this module to enhance the accuracy of the reasoning process.
4 The EndoCoT framework may encounter performance bottlenecks in certain types of complex tasks, especially those requiring substantial computational resources. Future research needs to explore how to optimize the framework's performance for broader task applications.
5 Although the EndoCoT framework performs excellently across multiple benchmarks, its performance in practical applications still needs further validation. Future research needs to explore how to apply this framework to a broader range of complex tasks, such as natural language understanding and generation tasks.

Applications

Immediate Applications

Complex Task Solving

The EndoCoT framework can be used to solve complex tasks requiring deep reasoning, such as spatial and logical reasoning. By providing more accurate guidance, the framework can improve task completion accuracy and efficiency.

Multimodal System Development

The industry can leverage the EndoCoT framework to develop more intelligent multimodal systems, enhancing automation levels. These systems can perform excellently in tasks requiring multimodal information fusion.

Enhanced Reasoning Capabilities

The EndoCoT framework can be used to enhance the reasoning capabilities of existing multimodal large language models, allowing them to perform more effectively in complex tasks.

Long-term Vision

Natural Language Understanding and Generation

In the future, the EndoCoT framework can be applied to natural language understanding and generation tasks, improving model performance in these tasks.

Widespread Application in Intelligent Systems

As the EndoCoT framework continues to be optimized, it can be applied to a wider range of intelligent systems in the future, enhancing their performance in complex tasks.

Abstract

Recently, Multimodal Large Language Models (MLLMs) have been widely integrated into diffusion frameworks primarily as text encoders to tackle complex tasks such as spatial reasoning. However, this paradigm suffers from two critical limitations: (i) MLLMs text encoder exhibits insufficient reasoning depth. Single-step encoding fails to activate the Chain-of-Thought process, which is essential for MLLMs to provide accurate guidance for complex tasks. (ii) The guidance remains invariant during the decoding process. Invariant guidance during decoding prevents DiT from progressively decomposing complex instructions into actionable denoising steps, even with correct MLLM encodings. To this end, we propose Endogenous Chain-of-Thought (EndoCoT), a novel framework that first activates MLLMs' reasoning potential by iteratively refining latent thought states through an iterative thought guidance module, and then bridges these states to the DiT's denoising process. Second, a terminal thought grounding module is applied to ensure the reasoning trajectory remains grounded in textual supervision by aligning the final state with ground-truth answers. With these two components, the MLLM text encoder delivers meticulously reasoned guidance, enabling the DiT to execute it progressively and ultimately solve complex tasks in a step-by-step manner. Extensive evaluations across diverse benchmarks (e.g., Maze, TSP, VSP, and Sudoku) achieve an average accuracy of 92.1%, outperforming the strongest baseline by 8.3 percentage points.

cs.CV cs.CL

References (20)

Qwen-Image Technical Report

Chenfei Wu, Jiahao Li, Jingren Zhou et al.

2025 391 citations ⭐ Influential View Analysis →

ChronoEdit: Towards Temporal Reasoning for Image Editing and World Simulation

J. Wu, Xuanchi Ren, Tianchang Shen et al.

2025 8 citations ⭐ Influential View Analysis →

DiffThinker: Towards Generative Multimodal Reasoning with Diffusion Models

Zefeng He, Xiaoye Qu, Yafu Li et al.

2025 3 citations ⭐ Influential View Analysis →

Thinking With Videos: Multimodal Tool-Augmented Reinforcement Learning for Long Video Reasoning

Haoji Zhang, Xin Gu, Jiawen Li et al.

2025 39 citations View Analysis →

Learning Transferable Visual Models From Natural Language Supervision

Alec Radford, Jong Wook Kim, Chris Hallacy et al.

2021 44752 citations View Analysis →

Thinking Images

Paul Kaiser, Marc Downie, J. Birringer

2008 51 citations

D-AR: Diffusion via Autoregressive Models

Ziteng Gao, Mike Zheng Shou

2025 5 citations View Analysis →

Graph of Thoughts: Solving Elaborate Problems with Large Language Models

Maciej Besta, Nils Blach, Aleš Kubíček et al.

2023 1171 citations View Analysis →

Think Silently, Think Fast: Dynamic Latent Compression of LLM Reasoning Chains

Wenhui Tan, Jiaze Li, Jianzhong Ju et al.

2025 38 citations View Analysis →

Flow Matching for Generative Modeling

Y. Lipman, Ricky T. Q. Chen, Heli Ben-Hamu et al.

2022 3558 citations View Analysis →

CODI: Compressing Chain-of-Thought into Continuous Space via Self-Distillation

Zhenyi Shen, Hanqi Yan, Linhai Zhang et al.

2025 105 citations View Analysis →

Video-of-Thought: Step-by-Step Video Reasoning from Perception to Cognition

Hao Fei, Shengqiong Wu, Wei Ji et al.

2024 160 citations View Analysis →

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang et al.

2025 3557 citations View Analysis →

A Very Big Video Reasoning Suite

Maijunxian Wang, Ruisi Wang, Juyi Lin et al.

2026 1 citations View Analysis →

Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm

Jingqi Tong, Yurong Mou, Hangcheng Li et al.

2025 18 citations View Analysis →

Real-Time Intermediate Flow Estimation for Video Frame Interpolation

Zhewei Huang, Tianyuan Zhang, Wen Heng et al.

2020 290 citations View Analysis →

A Survey on Test-Time Scaling in Large Language Models: What, How, Where, and How Well?

Qiyuan Zhang, Fuyuan Lyu, Zexu Sun et al.

2025 118 citations View Analysis →

The Art of Scaling Test-Time Compute for Large Language Models

Aradhye Agarwal, Ayan Sengupta, Tanmoy Chakraborty

2025 6 citations View Analysis →

Chain-of-Visual-Thought: Teaching VLMs to See and Think Better with Continuous Visual Tokens

Yiming Qin, Bomin Wei, Jiaxin Ge et al.

2025 14 citations View Analysis →

Soft Thinking: Unlocking the Reasoning Potential of LLMs in Continuous Concept Space

Zhen Zhang, Xuehai He, Weixiang Yan et al.

2025 56 citations View Analysis →

EndoCoT: Scaling Endogenous Chain-of-Thought Reasoning in Diffusion Models

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Analysis

Background

Core Problem

Innovation

Methodology

Experiments

Results

Applications

Limitations & Outlook

Plain Language Accessible to non-experts

ELI14 Explained like you're 14

Glossary

Multimodal Large Language Models

Diffusion Models

Chain-of-Thought

Iterative Thought Guidance Module

Terminal Thought Grounding Module

Maze

TSP (Traveling Salesman Problem)

VSP (Visual Search Problem)

Sudoku

Accuracy

Open Questions Unanswered questions from this research

Applications

Immediate Applications

Complex Task Solving

Multimodal System Development

Enhanced Reasoning Capabilities

Long-term Vision

Natural Language Understanding and Generation

Widespread Application in Intelligent Systems

Abstract

References (20)

Related Papers

Visual-ERM: Reward Modeling for Visual Equivalence

Out of Sight, Out of Mind? Evaluating State Evolution in Video World Models

InterEdit: Navigating Text-Guided Multi-Human 3D Motion Editing

Alternating Gradient Flow Utility: A Unified Metric for Structural Pruning and Dynamic Routing in Deep Networks

EVATok: Adaptive Length Video Tokenization for Efficient Visual Autoregressive Generation

MM-CondChain: A Programmatically Verified Benchmark for Visually Grounded Deep Compositional Reasoning