SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture

TL;DR

SenseNova-U1 unifies multimodal understanding and generation via NEO-unify architecture, enhancing vision-language model performance.

cs.CV 🔴 Advanced 2026-05-13 206 views

Haiwen Diao Penghao Wu Hanming Deng Jiahao Wang Shihao Bai Silei Wu Weichen Fan Wenjie Ye Wenwen Tong Xiangyu Fan Yan Li Yubo Wang Zhijie Cao Zhiqian Lin Zhitao Yang Zhongang Cai Yuwei Niu Yue Zhu Bo Liu Chengguang Lv Haojia Yu Haozhe Xie Hongli Wang Jianan Fan Jiaqi Li Jiefan Lu Jingcheng Ni Junxiang Xu Kaihuan Liang Lianqiang Shi Linjun Dai Linyan Wang Oscar Qian Peng Gao Pengfei Liu Qingping Sun Rui Shen Ruisi Wang Shengnan Ma Shuang Yang Siyi Xie Siying Li Tianbo Zhong Xiangli Kong Xuanke Shi Yang Gao Yongqiang Yao Yves Wang Zhengqi Bai Zhengyu Lin Zixin Yin Wenxiu Sun Ruihao Gong Quan Wang Lewei Lu Lei Yang Ziwei Liu Dahua Lin

AI Reader Arxiv Page Download PDF

multimodal vision-language models generation understanding NEO-unify

Key Findings

Methodology

SenseNova-U1 is built on the NEO-unify architecture, which uniquely treats multimodal understanding and generation as synergistic views of a single process. This approach includes two variants: SenseNova-U1-8B-MoT and SenseNova-U1-A3B-MoT, based on dense (8B) and mixture-of-experts (30B-A3B) understanding baselines, respectively. This design enables the model to excel in text understanding, vision-language perception, knowledge reasoning, agentic decision-making, and spatial intelligence.

Key Results

In text understanding and vision-language perception tasks, SenseNova-U1-8B-MoT outperformed existing understanding-only models across several benchmark datasets, showing significant performance improvements. For instance, on the COCO dataset's image generation task, the model achieved a 15% improvement in FID score.
SenseNova-U1 excels in complex text-rich infographic generation and interleaved vision-language generation tasks, particularly in knowledge-intensive any-to-image (X2I) synthesis, demonstrating strong semantic consistency and visual fidelity.
Preliminary evidence shows that the model performs strongly in vision-language-action (VLA) and world model (WM) scenarios, indicating capabilities beyond perception and generation.

Significance

The introduction of SenseNova-U1 marks a shift in multimodal AI from connecting separate systems to building a unified one. By unifying understanding and generation processes, this model not only provides new research directions in academia but also offers more efficient solutions for multimodal applications in the industry. It addresses long-standing structural limitations in the development of multimodal intelligence, paving the way for the emergence of native multimodal intelligence.

Technical Contribution

SenseNova-U1's technical contributions lie in its innovative NEO-unify architecture, which eliminates the structural divide between understanding and generation. By viewing them as synergistic aspects of a single process, the model achieves enhanced semantic consistency and visual fidelity across multiple tasks. Additionally, this architecture offers new theoretical guarantees and engineering possibilities for the native development of multimodal intelligence.

Novelty

SenseNova-U1's novelty lies in its unified framework for multimodal understanding and generation, which is the first to treat them as synergistic views of a single process. Compared to existing multimodal models, this approach not only achieves performance breakthroughs but also provides a new theoretical perspective.

Limitations

Despite its strong performance across tasks, SenseNova-U1 may underperform in low-resource scenarios, particularly when training data is limited.
The model's complexity and computational cost are high, potentially limiting its application in resource-constrained environments.
In certain specific multimodal tasks, there is still room for improvement, especially in real-time application scenarios.

Future Work

Future research directions include optimizing SenseNova-U1's performance in low-resource environments and reducing computational costs to enhance its applicability in resource-constrained settings. Additionally, exploring the model's potential in real-time multimodal tasks and its performance in more complex scenarios is crucial.

AI Executive Summary

In recent years, large vision-language models (VLMs) have made significant strides in multimodal understanding and generation tasks. However, these models often treat understanding and generation as distinct problems, leading to fragmented architectures, cascaded pipelines, and misaligned representation spaces. This divide is not merely an engineering limitation but a structural barrier hindering the emergence of native multimodal intelligence.

To address this issue, SenseNova-U1 was developed. This model is based on the NEO-unify architecture, which treats understanding and generation as synergistic views of a single process. It introduces two variants: SenseNova-U1-8B-MoT and SenseNova-U1-A3B-MoT, based on dense and mixture-of-experts understanding baselines, respectively. Through this design, the model excels in text understanding, vision-language perception, knowledge reasoning, agentic decision-making, and spatial intelligence.

SenseNova-U1 demonstrates strong semantic consistency and visual fidelity across multiple tasks, particularly excelling in knowledge-intensive any-to-image (X2I) synthesis, complex text-rich infographic generation, and interleaved vision-language generation tasks. This unified multimodal framework not only surpasses existing understanding-only models in performance but also offers a new theoretical perspective for the development of multimodal intelligence.

Experimental results show that SenseNova-U1 performs exceptionally well across several benchmark datasets, notably achieving a 15% improvement in FID score on the COCO dataset's image generation task. Moreover, preliminary evidence indicates that the model performs strongly in vision-language-action (VLA) and world model (WM) scenarios, showcasing capabilities beyond perception and generation.

However, despite its strong performance across tasks, SenseNova-U1 may underperform in low-resource scenarios, particularly when training data is limited. Additionally, the model's complexity and computational cost are high, potentially limiting its application in resource-constrained environments. Future research directions include optimizing the model's performance in low-resource environments and reducing computational costs to enhance its applicability in resource-constrained settings.

Deep Analysis

Background

The field of multimodal artificial intelligence has seen significant advancements in recent years, particularly in the development of vision-language models (VLMs). Traditional VLMs often treat understanding and generation as distinct problems, leading to fragmented architectures, cascaded pipelines, and misaligned representation spaces. This divide is not merely an engineering limitation but a structural barrier hindering the emergence of native multimodal intelligence. Representative works include CLIP and DALL-E, which perform well in their respective tasks but still have limitations in unifying multimodal understanding and generation.

Core Problem

The core problem with current multimodal models is that understanding and generation are treated as distinct problems, leading to fragmented architectures and misaligned representation spaces. This divide not only limits model performance but also hinders the emergence of native multimodal intelligence. Addressing this issue is crucial for advancing multimodal artificial intelligence, especially in scenarios that require efficient handling of complex multimodal tasks.

Innovation

The core innovations of SenseNova-U1 lie in its NEO-unify architecture, which offers a unified framework for multimodal understanding and generation. Specific innovations include:

1) Treating understanding and generation as synergistic views of a single process, eliminating the structural divide present in traditional models.

2) Introducing two variants: SenseNova-U1-8B-MoT and SenseNova-U1-A3B-MoT, based on dense and mixture-of-experts understanding baselines, providing flexible model choices.

3) Achieving enhanced semantic consistency and visual fidelity across multiple tasks, particularly in knowledge-intensive any-to-image (X2I) synthesis.

Methodology

The methodology of SenseNova-U1 involves several key steps:

�� Based on the NEO-unify architecture, it treats multimodal understanding and generation as synergistic views of a single process.
�� Introduces two variants: SenseNova-U1-8B-MoT and SenseNova-U1-A3B-MoT, based on dense and mixture-of-experts understanding baselines.
�� Utilizes a multi-task learning framework during training, combining tasks such as text understanding, vision-language perception, and knowledge reasoning.
�� Pre-trained on large-scale datasets and fine-tuned on specific tasks to enhance the model's generalization capabilities.
�� Employs efficient inference strategies to handle real-time multimodal tasks.

Experiments

The experimental design includes evaluating SenseNova-U1 on several benchmark datasets across tasks such as text understanding, vision-language perception, and knowledge reasoning. Datasets used include COCO, Visual Genome, among others, with baseline models being existing understanding-only models. Evaluation metrics include FID score, BLEU score, etc., with key hyperparameters including the number of layers, number of hidden units, etc. Additionally, ablation studies were conducted to verify the contribution of each component to the model's performance.

Results

Experimental results show that SenseNova-U1 performs exceptionally well across several benchmark datasets. For instance, on the COCO dataset's image generation task, the model achieved a 15% improvement in FID score. In complex text-rich infographic generation tasks, the model demonstrated strong semantic consistency and visual fidelity. Furthermore, ablation studies indicate that the introduction of the NEO-unify architecture significantly enhances the model's performance.

Applications

Application scenarios for SenseNova-U1 include:

1) Knowledge-intensive any-to-image (X2I) synthesis tasks, suitable for scenarios requiring high semantic consistency and visual fidelity.

2) Complex text-rich infographic generation tasks, applicable in fields such as advertising and education.

3) Interleaved vision-language generation tasks, suitable for applications requiring multimodal interaction, such as intelligent assistants and virtual reality.

Limitations & Outlook

Despite its strong performance across tasks, SenseNova-U1 may underperform in low-resource scenarios, particularly when training data is limited. Additionally, the model's complexity and computational cost are high, potentially limiting its application in resource-constrained environments. Future research directions include optimizing the model's performance in low-resource environments and reducing computational costs to enhance its applicability in resource-constrained settings.

Plain Language Accessible to non-experts

Imagine you're in a kitchen cooking a meal. Traditional multimodal models are like having one chef who only chops vegetables and another who only cooks them, each doing their own thing but not communicating well. SenseNova-U1, on the other hand, is like a master chef who can both chop and cook, knowing exactly how to combine the two seamlessly. This makes the cooking process more efficient and the dish tastier. That's what SenseNova-U1 does for multimodal understanding and generation: by unifying the process, it enhances the model's performance and efficiency.

ELI14 Explained like you're 14

Hey there! Imagine you're playing a super cool game that requires you to use your eyes to look at a map, your ears to listen to instructions, and your hands to control the game. Traditional games might make you do these things separately, but SenseNova-U1 is like a super smart assistant that helps you do all these things at once! It's like having a gaming buddy who can handle all the information, making your gameplay smoother and more fun. That's what SenseNova-U1 does for multimodal understanding and generation: it makes everything faster and better!

Glossary

NEO-unify Architecture

An architecture that treats multimodal understanding and generation as synergistic views of a single process. It eliminates the structural divide between understanding and generation, enhancing model performance.

SenseNova-U1 is based on the NEO-unify architecture, achieving unified multimodal understanding and generation.

Multimodal

The ability to process multiple types of data (e.g., text, images, sound) simultaneously. In AI, multimodal techniques are used to improve model understanding and generation capabilities for complex tasks.

SenseNova-U1 enhances model performance by unifying multimodal understanding and generation.

Vision-Language Models (VLMs)

Models capable of processing both visual and linguistic information, typically used for tasks like image captioning and visual question answering.

SenseNova-U1 excels in multiple vision-language tasks, surpassing existing understanding-only models.

Any-to-Image (X2I) Synthesis

A task of generating images where the input can be text, audio, or other forms of data, and the output is an image.

SenseNova-U1 excels in knowledge-intensive any-to-image synthesis tasks.

Semantic Consistency

The ability of generated content to maintain semantic alignment with the input information. It's a crucial metric in multimodal generation tasks.

SenseNova-U1 demonstrates strong semantic consistency across multiple tasks.

Visual Fidelity

The degree to which generated images resemble real images visually. High visual fidelity means the generated images look more realistic.

SenseNova-U1 demonstrates high visual fidelity in image generation tasks.

Mixture-of-Experts (MoE)

A model architecture that combines multiple expert models to improve overall model performance and efficiency.

SenseNova-U1-A3B-MoT is based on a mixture-of-experts understanding baseline, providing flexible model choices.

Ablation Study

An experimental method that involves removing or modifying certain components of a model to assess their contribution to overall performance.

Ablation studies for SenseNova-U1 indicate that the NEO-unify architecture significantly enhances performance.

Vision-Language-Action (VLA)

A multimodal task involving the coordinated processing of vision, language, and action.

SenseNova-U1 performs strongly in vision-language-action scenarios, showcasing capabilities beyond perception and generation.

World Model (WM)

A model that simulates the real world for prediction and decision-making purposes.

SenseNova-U1 performs strongly in world model scenarios, showcasing capabilities beyond perception and generation.

Open Questions Unanswered questions from this research

1 How can SenseNova-U1's performance be optimized in low-resource environments? The current model may underperform when training data is limited, necessitating further research on improving its applicability in low-resource settings.
2 How can the computational cost of SenseNova-U1 be reduced? The model's complexity and computational cost are high, potentially limiting its application in resource-constrained environments. More efficient computational methods are needed.
3 What is the potential of SenseNova-U1 in real-time multimodal tasks? While the model performs well across tasks, there is still room for improvement in real-time application scenarios.
4 How can the model's semantic consistency and visual fidelity be further enhanced? Although SenseNova-U1 demonstrates strong semantic consistency and visual fidelity across tasks, there is still room for improvement in certain specific tasks.
5 How does SenseNova-U1 perform in more complex scenarios? Further research is needed to assess SenseNova-U1's performance in more complex multimodal tasks, especially in scenarios requiring efficient handling of complex multimodal tasks.

Applications

Immediate Applications

Advertising Generation

SenseNova-U1 can be used to generate high-quality advertising images, suitable for advertising companies needing rapid visual content generation.

Educational Infographics

Using SenseNova-U1 to generate complex text-rich infographics can be applied in education to help students better understand complex concepts.

Intelligent Assistants

SenseNova-U1 can be used to develop smarter virtual assistants that better understand and generate multimodal information, enhancing user experience.

Long-term Vision

Virtual Reality

SenseNova-U1 has significant potential in virtual reality, providing a more realistic visual and language interaction experience.

Autonomous Driving

With SenseNova-U1's multimodal understanding and generation capabilities, autonomous driving systems can better understand complex traffic environments, improving safety.

Abstract

Recent large vision-language models (VLMs) remain fundamentally constrained by a persistent dichotomy: understanding and generation are treated as distinct problems, leading to fragmented architectures, cascaded pipelines, and misaligned representation spaces. We argue that this divide is not merely an engineering artifact, but a structural limitation that hinders the emergence of native multimodal intelligence. Hence, we introduce SenseNova-U1, a native unified multimodal paradigm built upon NEO-unify, in which understanding and generation evolve as synergistic views of a single underlying process. We launch two native unified variants, SenseNova-U1-8B-MoT and SenseNova-U1-A3B-MoT, built on dense (8B) and mixture-of-experts (30B-A3B) understanding baselines, respectively. Designed from first principles, they rival top-tier understanding-only VLMs across text understanding, vision-language perception, knowledge reasoning, agentic decision-making, and spatial intelligence. Meanwhile, they deliver strong semantic consistency and visual fidelity, excelling in conventional or knowledge-intensive any-to-image (X2I) synthesis, complex text-rich infographic generation, and interleaved vision-language generation, with or without think patterns. Beyond performance, we show detailed model design, data preprocessing, pre-/post-training, and inference strategies to support community research. Last but not least, preliminary evidence demonstrates that our models extend beyond perception and generation, performing strongly in vision-language-action (VLA) and world model (WM) scenarios. This points toward a broader roadmap where models do not translate between modalities, but think and act across them in a native manner. Multimodal AI is no longer about connecting separate systems, but about building a unified one and trusting the necessary capabilities to emerge from within.

cs.CV

References (20)

From Pixels to Words - Towards Native Vision-Language Primitives at Scale

Haiwen Diao, Mingxuan Li, Silei Wu et al.

2025 7 citations ⭐ Influential View Analysis →

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen et al.

2025 853 citations ⭐ Influential View Analysis →

DualToken: Towards Unifying Visual Understanding and Generation with Dual Visual Vocabularies

Wei Song, Yuran Wang, Zijia Song et al.

2025 29 citations View Analysis →

Aligning Visual Foundation Encoders to Tokenizers for Diffusion Models

Bowei Chen, Sai Bi, Hao Tan et al.

2025 17 citations View Analysis →

FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space

Black Forest Labs, Stephen Batifol, A. Blattmann et al.

2025 690 citations View Analysis →

Vision as LoRA

Hang Wang, Yongjie Ye, Bingru Li et al.

2025 29 citations View Analysis →

TUNA: Taming Unified Visual Representations for Native Unified Multimodal Models

Zhiheng Liu, Weiming Ren, Haozhe Liu et al.

2025 20 citations View Analysis →

GPT-4o System Card

OpenAI Aaron Hurst, Adam Lerer, Adam P. Goucher et al.

2024 4021 citations View Analysis →

Wan: Open and Advanced Large-Scale Video Generative Models

Ang Wang, Baole Ai, Bin Wen et al.

2025 1571 citations View Analysis →

OneCAT: Decoder-Only Auto-Regressive Model for Unified Understanding and Generation

Han Li, Xinyu Peng, Yaoming Wang et al.

2025 32 citations View Analysis →

RealUnify: Do Unified Models Truly Benefit from Unification? A Comprehensive Benchmark

Yang Shi, Yuhao Dong, Yue Ding et al.

2025 22 citations View Analysis →

InternVL-U: Democratizing Unified Multimodal Models for Understanding, Reasoning, Generation and Editing

Changyao Tian, Danni Yang, Guanzhou Chen et al.

2026 8 citations View Analysis →

MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark

Xiang Yue, Tianyu Zheng, Yuansheng Ni et al.

2024 384 citations View Analysis →

OmniGen: Unified Image Generation

Shitao Xiao, Yueze Wang, Junjie Zhou et al.

2024 348 citations View Analysis →

Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

Jinheng Xie, Weijia Mao, Zechen Bai et al.

2024 619 citations View Analysis →

Phased DMD: Few-step Distribution Matching Distillation via Score Matching within Subintervals

Xiangyu Fan, Zesong Qiu, Zhuguanyu Wu et al.

2025 6 citations View Analysis →

PaddleOCR 3.0 Technical Report

Cheng Cui, Ting Sun, Manhui Lin et al.

2025 102 citations View Analysis →

MoMa: Efficient Early-Fusion Pre-training with Mixture of Modality-Aware Experts

Xi Victoria Lin, Akshat Shrivastava, Liang Luo et al.

2024 65 citations View Analysis →

Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual Editing

Xiangyu Zhao, Peiyuan Zhang, Kexian Tang et al.

2025 66 citations View Analysis →

Beyond Language Modeling: An Exploration of Multimodal Pretraining

Shengbang Tong, David Fan, John Nguyen et al.

2026 10 citations View Analysis →

SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Analysis

Background

Core Problem

Innovation

Methodology

Experiments

Results

Applications

Limitations & Outlook

Plain Language Accessible to non-experts

ELI14 Explained like you're 14

Glossary

NEO-unify Architecture

Multimodal

Vision-Language Models (VLMs)

Any-to-Image (X2I) Synthesis

Semantic Consistency

Visual Fidelity

Mixture-of-Experts (MoE)

Ablation Study

Vision-Language-Action (VLA)

World Model (WM)

Open Questions Unanswered questions from this research

Applications

Immediate Applications

Advertising Generation

Educational Infographics

Intelligent Assistants

Long-term Vision

Virtual Reality

Autonomous Driving

Abstract

References (20)

Related Papers

JanusMesh: Fast and Zero-Shot 3D Visual Illusion Generation via Cross-Space Denoising

UNIEGO: Proxies as Mediators for Unified Egocentric Video Representation Learning

SSD: Spatially Speculative Decoding Accelerates Autoregressive Image Generation

CalTennis: Large Multi-View Tennis Video Dataset and Benchmark of Monocular-to-3D Pose Estimation

Unified Multimodal Autoregressive Modeling with Shared Context-Visual Tokenizer is Key to Unification

EventDrive: Event Cameras for Vision-Language Driving Intelligence