Paper Insights - AI Arxiv Paper Analysis

cs.CV 2604.15309

MM-WebAgent: A Hierarchical Multimodal Web Agent for Webpage Generation

MM-WebAgent uses hierarchical planning and self-reflection to generate consistent multimodal webpages, improving layout and style coherence.

Yan Li, Zezi Zeng, Yifan Yang et al.

2026-04-17 33

cs.CV 2604.15308

RAD-2: Scaling Reinforcement Learning in a Generator-Discriminator Framework

RAD-2 scales reinforcement learning in autonomous driving, reducing collision rate by 56% using a generator-discriminator framework.

Hao Gao, Shaoyu Chen, Yifan Zhu et al.

2026-04-17 34

cs.CV 2604.15280

Why Do Vision Language Models Struggle To Recognize Human Emotions?

Proposes a multi-stage context enrichment strategy to improve vision-language models' performance in human emotion recognition.

Madhav Agarwal, Sotirios A. Tsaftaris, Laura Sevilla-Lara et al.

2026-04-17 34

cs.CV 2604.15271

SegWithU: Uncertainty as Perturbation Energy for Single-Forward-Pass Risk-Aware Medical Image Segmentation

SegWithU models uncertainty as perturbation energy for single-forward-pass risk-aware medical image segmentation.

Tianhao Fu, Austin Wang, Charles Chen et al.

2026-04-17 32

cs.CV 2603.24581

Latent-WAM: Latent World Action Modeling for End-to-End Autonomous Driving

Latent-WAM achieves efficient end-to-end autonomous driving with spatially-aware and dynamics-informed latent world representations, scoring 89.3 on NAVSIM v2.

Linbo Wang, Yupeng Zheng, Qiang Chen et al.

2026-03-26 53

cs.CV 2603.24577

EndoVGGT: GNN-Enhanced Depth Estimation for Surgical 3D Reconstruction

EndoVGGT enhances surgical 3D reconstruction with DeGAT, improving PSNR by 24.6% and SSIM by 9.1%.

Falong Fan, Yi Xie, Arnis Lektauers et al.

2026-03-26 47

cs.CV 2603.24575

VFIG: Vectorizing Complex Figures in SVG with Vision-Language Models

VFIG uses vision-language models for complex figure-to-SVG conversion, achieving a VLM-Judge score of 0.829.

Qijia He, Xunmei Liu, Hammaad Memon et al.

2026-03-26 41

cs.CV 2603.23501

MedObvious: Exposing the Medical Moravec's Paradox in VLMs via Clinical Triage

MedObvious exposes the Medical Moravec's Paradox in VLMs via a 1,880-task benchmark for clinical triage.

Ufaq Khan, Umair Nawaz, L D M S S Teja et al.

2026-03-25 45

cs.CV 2603.23500

UniGRPO: Unified Policy Optimization for Reasoning-Driven Visual Generation

UniGRPO optimizes text and image generation policies using GRPO, enhancing reasoning-driven visual generation quality.

Jie Liu, Zilyu Ye, Linxiao Yuan et al.

2026-03-25 132

cs.CV 2603.23499

DA-Flow: Degradation-Aware Optical Flow Estimation with Diffusion Models

DA-Flow combines diffusion and convolutional features to enhance optical flow estimation in degraded videos.

Jaewon Min, Jaeeun Lee, Yeji Choi et al.

2026-03-25 48

cs.CV 2603.23497

WildWorld: A Large-Scale Dataset for Dynamic World Modeling with Actions and Explicit State toward Generative ARPG

WildWorld dataset offers over 450 actions and explicit state annotations for generative ARPG dynamic world modeling.

Zhen Li, Zian Meng, Shuwei Shi et al.

2026-03-25 48

cs.CV 2603.23495

VISion On Request: Enhanced VLLM efficiency with sparse, dynamically selected, vision-language interactions

VISOR method enhances LVLM efficiency by sparsely selecting vision-language interactions, reducing inference cost.

Adrian Bulat, Alberto Baldrati, Ioannis Maniadis Metaxas et al.

2026-03-25 39

cs.CV 2603.23489

AgentRVOS: Reasoning over Object Tracks for Zero-Shot Referring Video Object Segmentation

AgentRVOS combines SAM3 and MLLM for zero-shot video object segmentation, achieving leading performance.

Woojeong Jin, Jaeho Lee, Heeseong Shin et al.

2026-03-25 116

cs.CV 2603.23447

3DCity-LLM: Empowering Multi-modality Large Language Models for 3D City-scale Perception and Understanding

3DCity-LLM enhances 3D city-scale perception with a coarse-to-fine feature encoding strategy, leveraging a 1.2M-sample dataset.

Yiping Chen, Jinpeng Li, Wenyu Ke et al.

2026-03-25 41

cs.CV 2603.22285

VideoDetective: Clue Hunting via both Extrinsic Query and Intrinsic Relevance for Long Video Understanding

VideoDetective enhances long video understanding by integrating extrinsic query and intrinsic relevance, boosting VideoMME-long accuracy by 7.5%.

Ruoliu Yang, Chu Wu, Caifeng Shan et al.

2026-03-24 39

cs.CV 2603.22283

End-to-End Training for Unified Tokenization and Latent Denoising

UNITE achieves unified tokenization and latent diffusion with an autoencoder, reaching FID 2.12 on ImageNet.

Shivam Duggal, Xingjian Bai, Zongze Wu et al.

2026-03-24 41

cs.CV 2603.22280

DualCoT-VLA: Visual-Linguistic Chain of Thought via Parallel Reasoning for Vision-Language-Action Models

DualCoT-VLA enhances vision-language-action models with parallel reasoning for complex tasks, achieving state-of-the-art performance.

Zhide Zhong, Junfeng Li, Junjie He et al.

2026-03-24 136

cs.CV 2603.22279

3D-Layout-R1: Structured Reasoning for Language-Instructed Spatial Editing

3D-Layout-R1 achieves language-guided spatial layout editing via scene graph reasoning, with a 15% IoU increase and 25% reduction in center-distance error.

Haoyu Zhen, Xiaolong Li, Yilin Zhao et al.

2026-03-24 104

cs.CV 2603.20192

LumosX: Relate Any Identities with Their Attributes for Personalized Video Generation

LumosX uses relational self-attention and cross-attention for personalized video generation, enhancing face-attribute alignment.

Jiazheng Xing, Fei Du, Hangjie Yuan et al.

2026-03-21 42

cs.CV 2603.20185

VideoSeek: Long-Horizon Video Agent with Tool-Guided Seeking

VideoSeek actively seeks critical evidence using video logic flow, reducing frame usage by 93% and improving LVBench accuracy by 10.2 points.

Jingyang Lin, Jialian Wu, Jiang Liu et al.

2026-03-21 48