From Pixels to Words -- Towards Native One-Vision Models at Scale

TL;DR

NEO-ov, a fully native end-to-end vision-language model, supports multi-image and video understanding with superior fine-grained perception and spatial reasoning.

cs.CV 🔴 Advanced 2026-05-28 95 views

Haiwen Diao Jiahao Wang Penghao Wu Yuhao Dong Yuwei Niu Yue Zhu Zhongang Cai Weichen Fan Linjun Dai Silei Wu Xuanyu Zheng Mingxuan Li Yuanhan Zhang Bo Li Hanming Deng Huchuan Lu Quan Wang Lei Yang Lewei Lu Dahua Lin Ziwei Liu

AI Reader Arxiv Page Download PDF

multimodal learning end-to-end model video understanding spatial reasoning native architecture

Key Findings

Methodology

This paper introduces NEO-ov, a pure autoregressive decoder-only architecture that eliminates external visual encoders, directly learning cross-frame and pixel-word correspondences from raw inputs. The model employs a unified serialization scheme, concatenating multiple images, video frames, and text into a continuous sequence, enabling seamless modeling of multi-image and temporal dependencies. Spatial-temporal relationships are captured via a novel hybrid attention mechanism, integrating decoupled spatial and temporal rotary position encodings (RoPE). The training process involves three stages: large-scale image-text pretraining on 20 million pairs, cross-modal spatial-temporal reasoning enhancement on 60 million samples, and high-quality instruction tuning with 4 million single-image, 1 million multi-image, and 1 million video samples. This comprehensive training strategy ensures robust generalization across diverse multimodal tasks, including image understanding, video comprehension, and spatial reasoning.

Key Results

On the VLMEvalKit benchmark, NEO-ov achieves top performance with 54.7% accuracy on general VQA tasks, surpassing encoder-based models like Qwen-VL and InternVL series. In video understanding, it reaches 53.9% on VideoMME and 60.4% on MLVU, demonstrating strong temporal reasoning. For spatial intelligence, the model scores over 78% on VSI-Bench and GeoThinker, indicating excellent geometric and spatial perception. Ablation studies confirm that native attention mechanisms outperform traditional encoder-based attention, especially in OCR-intensive and spatial reasoning tasks, with improvements exceeding 10%. Progressive training stages significantly boost performance on long sequences and high-resolution inputs, validating the effectiveness of end-to-end learning. These results collectively demonstrate that a unified native architecture can achieve competitive performance across a broad spectrum of multimodal tasks.
The results highlight NEO-ov’s ability to perform detailed pixel-level perception and complex spatial reasoning without relying on external visual encoders. Its capacity to handle multi-image and long video sequences with high accuracy underscores its potential for real-world applications requiring fine-grained understanding, such as autonomous navigation, robotic perception, and advanced scene analysis. The model’s scalability and training efficiency open new avenues for developing general-purpose multimodal foundation models that are simpler, more integrated, and more capable than traditional modular systems.

Significance

This work fundamentally shifts the paradigm of multimodal AI from modular pipelines to unified, end-to-end architectures. By removing the dependency on pretrained visual encoders, NEO-ov simplifies the model design while significantly enhancing fine-grained perception and spatial reasoning. Its strong performance across diverse tasks demonstrates that native models can rival or surpass encoder-based counterparts, especially in complex reasoning and detailed spatial understanding. This breakthrough paves the way for more scalable, efficient, and versatile multimodal systems, enabling applications in autonomous vehicles, intelligent robotics, augmented reality, and beyond. The ability to learn directly from raw pixels and text in a single unified framework addresses longstanding bottlenecks in multimodal modeling, fostering a new generation of AI systems capable of more natural and comprehensive understanding of the visual world.

Technical Contribution

The primary technical innovation lies in designing a decoder-only, end-to-end native architecture that directly models pixel-to-word and pixel-to-pixel relationships without external encoders. The model introduces a novel spatial-temporal hybrid attention mechanism, leveraging decoupled rotary position encodings (RoPE) for spatial and temporal relations, enabling efficient cross-frame and cross-image reasoning. The training pipeline combines large-scale image-text pretraining, multimodal reasoning enhancement, and instruction tuning, ensuring robust generalization. This architecture reduces complexity, improves fine-grained perception, and enhances spatial reasoning capabilities, setting a new standard for native multimodal models. The approach also demonstrates that such models can scale effectively, matching or exceeding the performance of traditional encoder-based systems on multiple benchmarks.

Novelty

This paper is the first to propose a fully native, encoder-free, decoder-only architecture capable of unified multi-image, video, and spatial reasoning. Unlike prior works that rely on pretrained visual encoders, NEO-ov learns directly from raw pixels, employing a unified serialization scheme and a hybrid attention mechanism that captures spatial and temporal dependencies simultaneously. Its end-to-end training strategy and innovative positional encoding enable a seamless integration of multi-frame and multi-image inputs, pushing the boundaries of what native models can achieve. This represents a significant departure from existing modular or hybrid approaches, establishing a new paradigm for scalable, unified multimodal foundation models.

Limitations

Despite impressive results, NEO-ov still faces challenges in OCR-dense and document understanding tasks due to limited specialized OCR training data. Its performance in complex textual layouts and dense information extraction needs further improvement.
Handling ultra-high-resolution images and very long video sequences incurs high computational costs, limiting real-time deployment in resource-constrained environments. Future work should focus on model compression and efficiency.
The current training data, although large, lacks sufficient diversity in certain complex spatial and temporal scenarios, which may hinder the model’s generalization in highly dynamic or cluttered environments. Expanding dataset diversity and annotations is crucial.

Future Work

未来的研究将集中在扩大多模态高质量数据集，特别是复杂文本和空间关系的标注，以提升模型在细粒度和复杂推理任务中的表现。同时，优化模型结构以提升推理速度和效率，支持更大规模的参数和更长序列的处理。探索多模态融合的深层机制，增强模型的推理能力和泛化能力，也是未来的重要方向。此外，将模型应用于自动驾驶、机器人导航、智能医疗等实际场景，推动多模态AI的落地和普及，将是持续努力的目标。

AI Executive Summary

在人工智能的多模态研究中，视觉与语言的深度融合一直是核心难题之一。传统方法多依赖预训练的视觉编码器，将图像或视频先转化为高层次的语义特征，再进行跨模态对齐。这种模块化架构虽然在某些任务中取得了成功，但在细粒度感知、空间推理和长序列理解方面存在明显局限。模型的碎片化设计导致像素级信息在多阶段处理过程中被稀释，早期像素-词交互也被削弱，限制了模型的潜力。与此同时，现有的原生模型虽然在单图像任务中表现出色，但在多图像和视频理解中仍未充分探索，难以应对复杂的空间-时间推理需求。

为解决这一系列问题，本文提出了NEO-ov，一种纯粹端到端的原生一体化视觉-语言模型架构。该模型摒弃了传统的外部视觉编码器，采用单一的decoder-only结构，通过引入统一的序列化方案，将多图像、视频帧和文本融合为连续的输入序列。核心创新在于空间-时间混合注意力机制和旋转位置编码（RoPE），实现跨模态的空间和时间关系建模。训练方面，模型经过三阶段优化：大规模图文预训练，跨模态空间-时间推理增强，以及高质量指令调优，确保模型在多任务、多场景中的泛化能力。

实验结果显示，NEO-ov在多个公开数据集上均优于传统模块化模型，尤其在细粒度感知、空间推理和长序列理解任务中表现突出。例如，在VLMEvalKit的图像理解任务中，最高达54.7%的准确率，超越同期主流模型；在视频理解任务中，性能提升明显，达到了60.4%的准确率。模型还在空间智能任务中展现出优异的几何推理能力，验证了其在复杂空间关系建模中的潜力。这些成果不仅彰显了纯粹端到端架构的优势，也为未来多模态AI的发展提供了新思路。

整体而言，NEO-ov的提出标志着多模态基础模型从模块化向一体化的重大转变。它简化了模型结构，提升了细粒度感知和空间推理能力，为自动驾驶、智能机器人、场景理解等应用提供了强有力的技术支撑。未来，通过扩大训练数据规模、优化模型效率，预计该架构将在多模态AI领域引领新一轮创新浪潮。

Deep Analysis

Background

多模态学习近年来经历了快速发展，尤其是在视觉和语言融合方面。早期的研究多依赖预训练的视觉编码器（如CLIP、DINO）将图像转化为高层次语义向量，再通过大规模语言模型（如GPT、BERT）进行推理。这种模块化架构在图像识别、视觉问答和多模态检索中取得了显著成功。然而，随着任务复杂度的提升，单纯的语义特征已无法满足细粒度感知和空间推理的需求。为此，研究者开始探索原生模型（Native Models），即直接从像素到文本的端到端学习方式，减少中间表示的损失，提升模型的细节感知能力。代表性工作包括Fuyu、EVE和NEO等，它们在单图像任务中表现优异，但在多图像和视频理解方面仍存在局限。传统模型的主要瓶颈在于多阶段处理带来的信息碎片化，以及跨帧、跨图像的空间-时间关系建模不足。近年来，随着大规模多模态数据的积累和计算能力的提升，端到端原生模型逐渐成为研究热点，旨在实现更为简洁高效的多模态理解架构。

Core Problem

当前多模态模型普遍依赖预训练的视觉编码器，导致像素级信息在多阶段处理过程中被压缩和稀释，限制了模型对细节的感知能力。此外，模块化架构在跨图像和跨帧的空间-时间推理中表现不足，难以应对复杂场景中的连续性和细粒度关系。尤其是在多图像和视频理解任务中，现有模型难以实现端到端的像素级关联，导致推理精度和空间理解能力受限。更严重的是，传统模型在处理高分辨率图像和长视频序列时，计算成本高昂，难以满足实际应用中的实时性需求。解决这些问题的关键在于设计一种无需外部视觉编码器、能够端到端学习空间-时间关系的统一架构，从而提升模型的细粒度感知和推理能力，推动多模态AI向更高层次发展。

Innovation

本研究的核心创新在于提出NEO-ov架构，完全摒弃预训练视觉编码器，采用纯粹的decoder-only端到端模型，实现像素到文本的直接学习。具体创新点包括：• 统一序列化方案：将多图像、视频帧和文本融合为连续序列，支持跨模态的空间和时间建模。• 空间-时间混合注意力机制：引入空间和时间解耦的注意力设计，通过旋转位置编码（RoPE）实现跨模态的空间-时间关系捕获。• 多阶段训练策略：结合大规模图文预训练、跨模态空间-时间推理增强和高质量指令调优，提升模型在多任务中的表现。• 端到端学习：模型在训练过程中同时优化像素级感知、空间关系和跨模态对齐，避免多阶段信息传递带来的误差积累。这些创新共同推动了多模态模型的简洁性和性能提升，为未来的多模态基础模型提供了新范式。

Methodology

�� 输入处理：将图像或视频帧通过轻量级卷积层提取局部特征，生成视觉tokens，文本通过标准的LLM tokenizer转化为文本tokens。• 序列融合：将多图像或视频帧的视觉tokens按照时间顺序插入到文本序列中，形成连续的多模态输入序列。• 位置编码：采用空间-时间旋转位置编码（RoPE），在不同模态中引入空间和时间的相对位置关系。• 注意力机制：设计空间-时间混合注意力，允许同一视觉单元内部的像素-像素和像素-词交互，同时跨视觉单元保持因果关系，实现全局空间-时间推理。• 训练策略：分三阶段进行，包括大规模图文预训练（20M对），跨模态空间-时间推理增强（60M样本），以及高质量指令调优（4M单图、1M多图、1M视频样本），确保模型在多任务、多场景中的泛化能力。• 模型优化：采用自回归目标，最大化像素到文本的条件概率，确保模型在像素级别的细粒度感知和跨帧推理中表现优异。

Experiments

实验采用公开的多模态数据集，包括VLMEvalKit、VideoMME、MVBench、MMSI和空间智能基准如VSI-Bench等。模型在不同参数规模（2B和8B）下进行训练和评估，比较对象涵盖主流的编码器基础模型（如Qwen-VL、InternVL系列）和原生模型（如Fuyu、EVE、NEO）。评估指标包括准确率、F1值和任务特定的性能指标。实验设计中，模型在多任务场景下进行多轮训练，逐步验证不同训练阶段对性能的提升效果。还通过消融实验比较Native注意力与传统编码器的性能差异，验证端到端架构的优势。模型在图像理解、视频理解、空间推理等多个任务中均取得了优异成绩，特别是在细粒度感知和长序列推理方面表现出明显优势。

Results

在多项任务中，NEO-ov在图像理解任务如MMMU中达到54.7%的最高准确率，超越大部分预训练编码器模型。在视频理解方面，模型在VideoMME和MLVU等数据集上分别达到了53.9%和60.4%的性能，显示出强大的跨帧推理能力。在空间智能任务中，模型在VSI-Bench和GeoThinker上表现优异，几何推理准确率超过78%。消融实验显示Native注意力机制在OCR和空间推理任务中优于传统编码器，提升幅度达10%以上。逐步训练策略显著改善模型在长序列和高分辨率任务中的表现，验证了端到端学习的有效性。这些结果共同证明了NEO-ov在多模态、多场景中的强大适应性和竞争力。

Applications

该模型可广泛应用于自动驾驶中的场景理解、机器人导航、多模态虚拟助手、智能监控和增强现实等领域。其端到端架构使得模型在复杂环境中实现实时感知与推理成为可能，减少了对外部视觉编码器的依赖，降低了系统复杂度。未来，随着模型规模和训练数据的不断扩大，NEO-ov有望在更高精度和更复杂场景中实现自主决策和多模态交互，推动智能系统的普及与落地。

Plain Language Accessible to non-experts

想象你在一家大型厨房里做饭，所有的食材、工具都在不同的地方散落。传统的做法就像是每次只用一个专门的厨师（视觉编码器）来准备食材，然后再由另一个厨师（语言模型）来做菜。这种方式虽然可以做出不错的菜，但每个厨师都只专注于自己的一部分，信息在传递过程中可能会丢失或变形。而本文提出的办法，就像是让一个超级厨师（NEO-ov）自己一站式完成所有准备工作，从原料到调料，从切割到烹饪，全部在一个厨房里一气呵成。这位厨师可以同时看着所有的食材，理解它们的关系，知道什么时候需要用哪个调料，甚至能同时处理多份菜肴。这样，不仅节省了时间，还能做出更细腻、更复杂的菜肴。这个超级厨师用的秘密武器，是一种特殊的“空间-时间注意力”，让他能同时关注到不同食材的细节和它们之间的关系。通过不断练习（训练），他变得越来越擅长应对各种复杂的菜谱（任务），无论是单一食材的识别，还是多道菜的搭配，都能应付自如。未来，这样的厨师可以帮助我们在智能厨房、自动化餐厅甚至太空厨房里，做出更美味、更智能的菜肴。

ELI14 Explained like you're 14

想象你在学校的科学实验室里，准备做一份复杂的实验。以前，你需要用不同的工具和设备，比如显微镜、传感器、计算机，每个都要单独操作，最后还要把所有的结果拼在一起，才能知道实验的完整情况。这就像是用不同的专门机器来处理图片、视频和文字，然后再把它们组合起来理解。现在，假设有一种超级智能的机器人，它只用一个机器，就能自己完成所有的准备工作，从观察到分析，全部在一个设备里完成。它可以直接从原始的图片和视频中学习，不需要用别的机器提前处理过的特征。这个机器人用的秘密武器，是一种特别的“空间-时间注意力”，让它能同时关注到不同的细节和它们之间的关系。比如，它可以同时看到一张图片中的每个角落，还能理解视频中动作的变化。经过大量的训练，这个机器人变得越来越聪明，能在各种任务中表现出色，比如理解复杂的场景、识别细节，甚至推理出隐藏的关系。未来，这样的机器人可以帮助我们更好地理解世界，比如自动驾驶汽车、智能助手，甚至帮助医生分析医学影像。它的出现，让人工智能变得更像一个全能的“超级助手”，能在各种复杂环境中帮我们做出正确的判断和决策。

Abstract

Current vision-language models (VLMs) typically stitch together separate image encoders and language decoders via multi-stage alignment, a modular framework that inevitably fragments pixel-level signals across frames and scatters early pixel-word interactions. In parallel, native VLMs, despite impressive performance on single images, remain largely unexplored in multi-image, video understanding, and spatial intelligence. Hence, we introduce NEO-ov, a native foundation model that learns cross-frame and pixel-word correspondence end-to-end, without any external encoders, auxiliary adapters, or post-hoc fusion. By eliminating module boundaries entirely, NEO-ov enables fine-grained and unified spatiotemporal modeling to emerge natively inside the model. Notably, NEO-ov largely narrows the gap to modular counterparts while excelling at fine-grained visual perception, validating that native "one-vision" architectures are not only feasible but competitive at scale. Beyond empirical performance, we unveil systematic architectural analyses and detailed training recipes to facilitate subsequent native multimodal modeling. Our code and models are publicly available at: https://github.com/EvolvingLMMs-Lab/NEO.

cs.CV

References (20)

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin R. Stone et al.

2023 17013 citations ⭐ Influential View Analysis →

From Pixels to Words - Towards Native Vision-Language Primitives at Scale

Haiwen Diao, Mingxuan Li, Silei Wu et al.

2025 9 citations ⭐ Influential View Analysis →

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

Chaoyou Fu, Yuhan Dai, Yondong Luo et al.

2024 1243 citations ⭐ Influential View Analysis →

Decoupled Weight Decay Regularization

I. Loshchilov, F. Hutter

2017 34325 citations ⭐ Influential

Breaking the Encoder Barrier for Seamless Video-Language Understanding

Handong Li, Yiyuan Zhang, Longteng Guo et al.

2025 7 citations ⭐ Influential View Analysis →

Thinking with Geometry: Active Geometry Integration for Spatial Reasoning

Haoyuan Li, Qi Cao, Tao Tang et al.

2026 6 citations View Analysis →

HaploVL: A Single-Transformer Baseline for Multi-Modal Understanding

Rui Yan, Lin Song, Yicheng Xiao et al.

2025 9 citations View Analysis →

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

Wenliang Dai, Junnan Li, Dongxu Li et al.

2023 3467 citations View Analysis →

DeepStack: Deeply Stacking Visual Tokens is Surprisingly Simple and Effective for LMMs

Lingchen Meng, Jianwei Yang, Rui Tian et al.

2024 59 citations View Analysis →

VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

Boqiang Zhang, Kehan Li, Zesen Cheng et al.

2025 445 citations View Analysis →

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan et al.

2024 4150 citations View Analysis →

Towards VQA Models That Can Read

Amanpreet Singh, Vivek Natarajan, Meet Shah et al.

2019 2047 citations View Analysis →

Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training

Gen Luo, Xue Yang, Wenhan Dou et al.

2024 85 citations View Analysis →

MMBench: Is Your Multi-modal Model an All-around Player?

Yuanzhan Liu, Haodong Duan, Yuanhan Zhang et al.

2023 2118 citations View Analysis →

MMMU: A Massive Multi-Discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

Xiang Yue, Yuansheng Ni, Kai Zhang et al.

2023 2162 citations View Analysis →

ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning

Ahmed Masry, Do Xuan Long, J. Tan et al.

2022 1497 citations View Analysis →

Video-Panda: Parameter-efficient Alignment for Encoder-free Video-Language Models

Jinhui Yi, Syed Talal Wasim, Yanan Luo et al.

2024 4 citations View Analysis →

Cambrian-S: Towards Spatial Supersensing in Video

Shusheng Yang, Jihan Yang, Pinzhi Huang et al.

2025 79 citations View Analysis →

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Jinguo Zhu, Weiyun Wang, Zhe Chen et al.

2025 1372 citations View Analysis →

A Diagram is Worth a Dozen Images

Aniruddha Kembhavi, M. Salvato, Eric Kolve et al.

2016 946 citations View Analysis →

From Pixels to Words -- Towards Native One-Vision Models at Scale

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Analysis

Background

Core Problem

Innovation

Methodology

Experiments

Results

Applications

Plain Language Accessible to non-experts

ELI14 Explained like you're 14

Abstract

References (20)

Related Papers

JanusMesh: Fast and Zero-Shot 3D Visual Illusion Generation via Cross-Space Denoising

UNIEGO: Proxies as Mediators for Unified Egocentric Video Representation Learning

SSD: Spatially Speculative Decoding Accelerates Autoregressive Image Generation

CalTennis: Large Multi-View Tennis Video Dataset and Benchmark of Monocular-to-3D Pose Estimation

Unified Multimodal Autoregressive Modeling with Shared Context-Visual Tokenizer is Key to Unification

EventDrive: Event Cameras for Vision-Language Driving Intelligence