MM-WebAgent: A Hierarchical Multimodal Web Agent for Webpage Generation

Key Findings

Methodology

MM-WebAgent employs a hierarchical agentic framework to coordinate multimodal webpage generation through hierarchical planning and iterative self-reflection. The approach integrates global layout planning and local element generation to ensure visual consistency and semantic coherence. The framework includes global layout planning, local element planning, and a multi-level self-reflection mechanism, each responsible for the overall design of the page structure, semantic and stylistic generation of elements, and final layout and style optimization.

Key Results

Experimental results show that MM-WebAgent outperforms baseline methods in multimodal element generation and integration. In the MM-WebAgent-Bench benchmark, it achieved an average score of 0.75, significantly higher than code-generation and agent-based baselines, particularly excelling in the generation and integration of multimodal elements such as images, videos, and charts.
In the WebGen-Bench test, despite the task primarily testing functional backend code, MM-WebAgent still demonstrated competitive performance, showcasing its capability in complex multimodal generation tasks.
Ablation studies indicate that the introduction of hierarchical planning and self-reflection mechanisms significantly improves the coordination of multimodal content and overall performance, particularly in local metrics such as images and videos.

Significance

The introduction of MM-WebAgent addresses the issues of style inconsistency and poor global coherence in existing multimodal webpage generation methods. By incorporating hierarchical planning and self-reflection mechanisms, this approach holds significant implications for both academia and industry. It not only enhances the automation of web design but also provides new insights for the generation and integration of multimodal content, especially in applications requiring high visual consistency and semantic coherence.

Technical Contribution

MM-WebAgent offers several key technical contributions. Firstly, it views multimodal generation as a hierarchical planning and optimization process, distinct from traditional code-generation methods. Secondly, it introduces a multi-level self-reflection mechanism capable of iteratively optimizing content and layout at both local and global levels. Additionally, the framework proposes a new benchmark and evaluation protocol for multimodal webpage generation, providing tools for systematic performance assessment.

Novelty

The novelty of MM-WebAgent lies in its combination of a hierarchical agentic framework and self-reflection mechanism. This approach is the first to treat multimodal webpage generation as a structured plan-and-refine process, offering higher visual and semantic consistency compared to existing code-generation and agent-based methods.

Limitations

Currently, MM-WebAgent may encounter performance bottlenecks when handling extremely complex webpage layouts, especially in scenarios requiring a large number of multimodal elements.
Although the self-reflection mechanism improves content quality, it may increase computational overhead in some cases, affecting generation efficiency.
The method may require further optimization when handling specific types of multimodal content, such as dynamic videos, to improve generation smoothness and consistency.

Future Work

Future research directions include optimizing the self-reflection mechanism to reduce computational overhead, extending the framework to support more types of multimodal content (e.g., 3D models), and validating it on larger-scale real-world datasets. Additionally, exploring integration with other generative models to enhance generation quality and efficiency is an important direction.

AI Executive Summary

In modern web design, the rapid development of AI-generated content (AIGC) tools enables the on-demand creation of images, videos, and visualizations. However, directly integrating these tools into automated webpage generation often results in style inconsistency and poor global coherence, as elements are generated in isolation.

MM-WebAgent proposes a hierarchical agentic framework for multimodal webpage generation that coordinates AIGC-based element generation through hierarchical planning and iterative self-reflection. The framework jointly optimizes global layout, local multimodal content, and their integration, producing coherent and visually consistent webpages.

Methodologically, MM-WebAgent employs global layout planning to define the page structure and style attributes, upon which local element plans are constructed to guide downstream generators in producing semantically appropriate and stylistically compatible assets. To emulate iterative design, the framework introduces a self-reflection mechanism that optimizes the generated webpage through local, context, and global levels of self-reflection.

Experimental results demonstrate that MM-WebAgent outperforms code-generation and agent-based baselines in multimodal element generation and integration, particularly excelling in the generation and integration of images, videos, and charts. In the MM-WebAgent-Bench benchmark, it achieved an average score of 0.75, significantly higher than other methods.

This research holds significant implications for both academia and industry, providing new insights for the generation and integration of multimodal content, especially in applications requiring high visual consistency and semantic coherence. However, the current approach may encounter performance bottlenecks when handling extremely complex webpage layouts. Future research directions include optimizing the self-reflection mechanism to reduce computational overhead and extending the framework to support more types of multimodal content.

Deep Analysis

Background

With the advancement of AI technologies, the field of web design is continuously evolving. Traditional webpage generation methods primarily rely on code-generation techniques, synthesizing HTML/CSS code by parsing user natural language requests. However, such methods have limitations in handling multimodal content, as webpages are not purely text and code but include heterogeneous elements like images, videos, and charts. These elements' content, style, and geometry must cohere with the global layout and semantic intent. Existing pipelines typically populate these elements via retrieval or placeholders, generating or inserting assets independently, which often leads to style inconsistency and poor global coherence. To address these challenges, MM-WebAgent proposes a novel hierarchical multimodal webpage generation framework that optimizes multimodal content generation and integration through hierarchical planning and self-reflection mechanisms.

Core Problem

The core problem in multimodal webpage generation is how to coordinate the generation of heterogeneous elements to ensure they are visually and semantically consistent with the global layout. Traditional code-generation methods often treat multimodal assets as static or externally provided, lacking the ability to generate novel, semantically aligned, and stylistically coherent multimodal content. Additionally, existing webpage generation methods struggle with complex webpage layouts, leading to style inconsistency and poor global coherence. These issues not only affect the visual appeal of webpages but also impact user experience. Thus, achieving coordinated generation and integration of elements in multimodal webpage generation is a pressing challenge.

Innovation

The core innovations of MM-WebAgent lie in its combination of a hierarchical agentic framework and self-reflection mechanism. Firstly, the method views multimodal generation as a hierarchical planning and optimization process, distinct from traditional code-generation methods. Through global layout planning and local element planning, it ensures that multimodal components are natively integrated into the page structure. Secondly, it introduces a multi-level self-reflection mechanism capable of iteratively optimizing content and layout at both local and global levels. This mechanism performs targeted edits by analyzing the generated webpage screenshots and HTML code, ensuring consistent layout, spacing, and visual style. Additionally, the framework proposes a new benchmark and evaluation protocol for multimodal webpage generation, providing tools for systematic performance assessment.

Methodology

The hierarchical agentic framework of MM-WebAgent includes the following key steps:

�� Global Layout Planning: Defines the section hierarchy, spatial organization, and page-level style attributes, introducing explicit placeholders for multimodal components.

�� Local Element Planning: Constructs local plans based on the global context to guide content generation. Each local plan includes context information and meta attributes, describing the functional role and style guidance of elements.

�� Plan Execution: Converts the global layout plan into the HTML/CSS structure of the webpage and generates corresponding assets using designated generation tools.

�� Self-Reflection Mechanism: Optimizes the generated webpage through local, context, and global levels of self-reflection, ensuring each element is semantically correct and visually sound.

Experiments

The experimental design includes performance evaluation on the MM-WebAgent-Bench and WebGen-Bench benchmarks. MM-WebAgent-Bench tests the diversity and quality of multimodal webpage generation, including informational, analytical, creative, and commercial use cases. WebGen-Bench primarily tests functional backend code, logic, and component completeness. Baseline methods include code-generation and agent-based methods such as OpenAI-GPT-5.1 and Qwen2.5-Coder. Evaluation metrics include global layout, style coherence, aesthetics, and the quality and integration of images, videos, and charts. Ablation studies are conducted to verify the effectiveness of hierarchical planning and self-reflection mechanisms.

Results

Experimental results demonstrate that MM-WebAgent outperforms baseline methods in multimodal element generation and integration. In the MM-WebAgent-Bench benchmark, it achieved an average score of 0.75, significantly higher than code-generation and agent-based baselines, particularly excelling in the generation and integration of multimodal elements such as images, videos, and charts. In the WebGen-Bench test, despite the task primarily testing functional backend code, MM-WebAgent still demonstrated competitive performance, showcasing its capability in complex multimodal generation tasks. Ablation studies indicate that the introduction of hierarchical planning and self-reflection mechanisms significantly improves the coordination of multimodal content and overall performance, particularly in local metrics such as images and videos.

Applications

MM-WebAgent has potential value in multiple application scenarios. Firstly, in web design, the method can automatically generate visually consistent and semantically coherent webpages, reducing the workload of manual design. Secondly, in e-commerce platforms, it can generate personalized product display pages, enhancing the user shopping experience. Additionally, in education and training, the method can be used to generate interactive learning materials, improving learning outcomes. To achieve these applications, it is essential to ensure that the generated content is visually and semantically aligned with user needs and provides a good user experience.

Limitations & Outlook

Despite its outstanding performance in multimodal webpage generation, MM-WebAgent has some limitations. Firstly, the method may encounter performance bottlenecks when handling extremely complex webpage layouts, especially in scenarios requiring a large number of multimodal elements. Secondly, although the self-reflection mechanism improves content quality, it may increase computational overhead in some cases, affecting generation efficiency. Additionally, the method may require further optimization when handling specific types of multimodal content, such as dynamic videos, to improve generation smoothness and consistency. Future research directions include optimizing the self-reflection mechanism to reduce computational overhead, extending the framework to support more types of multimodal content, and validating it on larger-scale real-world datasets.

Plain Language Accessible to non-experts

Imagine you're in a kitchen preparing a meal. You need to gather various ingredients like vegetables, meat, and spices, then follow a certain order to cook them, finally presenting a delicious dish. MM-WebAgent is like a smart chef that not only helps you prepare the ingredients but also plans the entire cooking process to ensure each dish's taste and presentation are perfect.

First, MM-WebAgent acts like a chef by planning the entire layout of the dish, which is the global layout of the webpage. It decides the placement and size of each ingredient (web element), ensuring the overall aesthetic and harmony.

Next, it creates detailed cooking plans for each ingredient, known as local element planning. Based on these plans, MM-WebAgent generates the specific content of each element, such as images, videos, and charts, ensuring they are visually and semantically consistent with the overall layout.

Finally, MM-WebAgent acts like an experienced chef, constantly tasting and adjusting the dish's flavor. Through the self-reflection mechanism, it iteratively optimizes the generated webpage, ensuring each element is in its best state, ultimately presenting a perfect webpage.

ELI14 Explained like you're 14

Hey there! Today, I'm going to tell you about something super cool called MM-WebAgent. Imagine you're playing a game where you need to design an awesome webpage. This webpage needs to look great and have all sorts of cool images and videos. Sounds a bit tricky, right?

Don't worry, MM-WebAgent is like your super helper. It helps you plan the entire layout of the webpage, just like building a Lego castle. Each Lego block (web element) has its own place and size, making the whole castle look harmonious and beautiful.

Then, MM-WebAgent designs detailed plans for each Lego block. It generates the specific content of each block, like images, videos, and charts, ensuring they match the theme of the castle.

Finally, MM-WebAgent acts like a careful architect, constantly checking and adjusting the structure of the castle. Through the self-reflection mechanism, it optimizes the generated webpage, ensuring each element is perfect. This way, you can easily design an awesome webpage!

Glossary

Hierarchical Agentic Framework

A framework that views multimodal generation as a hierarchical planning and optimization process. It ensures that multimodal components are natively integrated into the page structure through global layout planning and local element planning.

Used in MM-WebAgent to coordinate multimodal webpage generation.

Iterative Self-Reflection

A mechanism that performs targeted edits by analyzing the generated webpage screenshots and HTML code, ensuring consistent layout, spacing, and visual style.

Used to optimize the generated webpage, ensuring each element is semantically correct and visually sound.

Multimodal Generation

The process of generating content that includes multiple forms such as images, videos, and text.

Used in MM-WebAgent to generate multimodal content for webpages.

Global Layout Planning

The process of defining the section hierarchy, spatial organization, and page-level style attributes of a webpage.

Used in MM-WebAgent to ensure that multimodal components are natively integrated into the page structure.

Local Element Planning

The process of constructing local plans based on the global context to guide content generation.

Used in MM-WebAgent to generate semantically appropriate and stylistically compatible assets.

MM-WebAgent-Bench

A benchmark for evaluating the performance of multimodal webpage generation.

Used to verify MM-WebAgent's performance in multimodal element generation and integration.

WebGen-Bench

A benchmark primarily testing functional backend code, logic, and component completeness.

Used to verify MM-WebAgent's competitiveness in complex multimodal generation tasks.

OpenAI-GPT-5.1

A language model used for generating webpage layouts and multimodal elements.

Used in MM-WebAgent to implement hierarchical planning and self-reflection mechanisms.

Qwen2.5-Coder

A language model used for code generation.

Used as one of the baseline methods in the experiments.

Gemini-2.5-Pro

A language model used for code generation and multimodal content generation.

Used as one of the baseline methods in the experiments.

Open Questions Unanswered questions from this research

1 Current multimodal webpage generation methods still face performance bottlenecks when handling complex layouts. Existing technologies often lead to style inconsistency and poor global coherence when generating a large number of multimodal elements. To overcome these challenges, more efficient planning and optimization algorithms need to be developed to improve visual consistency and semantic coherence.
2 Although the self-reflection mechanism improves content quality, it may increase computational overhead in some cases, affecting generation efficiency. Future research needs to explore more efficient self-reflection algorithms to reduce computational overhead and improve generation efficiency.
3 Existing multimodal generation methods may require further optimization when handling dynamic content (e.g., videos) to improve generation smoothness and consistency. Future research can explore more advanced video generation techniques to enhance generation quality.
4 In multimodal webpage generation, effectively integrating different types of multimodal content (e.g., 3D models) remains an open question. Novel integration methods need to be developed to improve the diversity and quality of generation.
5 Existing multimodal webpage generation benchmarks may lack coverage of specific application scenarios when evaluating generation quality. Future research can develop more comprehensive benchmarks to better assess the performance of generation methods.

Applications

Immediate Applications

Automated Web Design

MM-WebAgent can be used to automatically generate visually consistent and semantically coherent webpages, reducing the workload of manual design and increasing design efficiency.

Personalized Product Display

In e-commerce platforms, the method can generate personalized product display pages, enhancing the user shopping experience.

Interactive Learning Materials

In education and training, MM-WebAgent can be used to generate interactive learning materials, improving learning outcomes.

Long-term Vision

Intelligent Content Generation Platform

MM-WebAgent can serve as a core component of an intelligent content generation platform, supporting various types of content generation and integration, enhancing the platform's intelligence level.

Multimodal Human-Computer Interaction System

The method can be used to develop multimodal human-computer interaction systems, supporting more natural and efficient human-computer interaction, improving user experience.

Abstract

The rapid progress of Artificial Intelligence Generated Content (AIGC) tools enables images, videos, and visualizations to be created on demand for webpage design, offering a flexible and increasingly adopted paradigm for modern UI/UX. However, directly integrating such tools into automated webpage generation often leads to style inconsistency and poor global coherence, as elements are generated in isolation. We propose MM-WebAgent, a hierarchical agentic framework for multimodal webpage generation that coordinates AIGC-based element generation through hierarchical planning and iterative self-reflection. MM-WebAgent jointly optimizes global layout, local multimodal content, and their integration, producing coherent and visually consistent webpages. We further introduce a benchmark for multimodal webpage generation and a multi-level evaluation protocol for systematic assessment. Experiments demonstrate that MM-WebAgent outperforms code-generation and agent-based baselines, especially on multimodal element generation and integration. Code & Data: https://aka.ms/mm-webagent.

cs.CV cs.AI cs.CL

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Analysis

Background

Core Problem

Innovation

Methodology

Experiments

Results

Applications

Limitations & Outlook

Plain Language Accessible to non-experts

ELI14 Explained like you're 14

Glossary

Hierarchical Agentic Framework

Iterative Self-Reflection

Multimodal Generation

Global Layout Planning

Local Element Planning

MM-WebAgent-Bench

WebGen-Bench

OpenAI-GPT-5.1

Qwen2.5-Coder

Gemini-2.5-Pro

Open Questions Unanswered questions from this research

Applications

Immediate Applications

Automated Web Design

Personalized Product Display

Interactive Learning Materials

Long-term Vision

Intelligent Content Generation Platform

Multimodal Human-Computer Interaction System

Abstract

Related Papers

Deployment-Aligned Low-Precision Neural Architecture Search for Spaceborne Edge AI

DeepTaxon: An Interpretable Retrieval-Augmented Multimodal Framework for Unified Species Identification and Discovery

Learn&Drop: Fast Learning of CNNs based on Layer Dropping

SS3D: End2End Self-Supervised 3D from Web Videos

PASR: Pose-Aware 3D Shape Retrieval from Occluded Single Views

A Non-Invasive Alternative to RFID: Self-Sufficient 3D Identification of Group-Housed Livestock