Paper Insights - AI Arxiv Paper Analysis

cs.CV 2606.11187

Next Forcing: Causal World Modeling with Multi-Chunk Prediction

Next Forcing introduces multi-chunk prediction to accelerate training and improve accuracy in high-frame-rate video generation, achieving 94.1% success on RoboTwin.

Gangwei Xu, Qihang Zhang, Jiaming Zhou et al.

2026-06-10 64

cs.CV 2606.11186

AnyMod-LLVE: Low-Light Video Enhancement with Modality-Agnostic Inference

AMNet introduces modality-agnostic inference for low-light video enhancement, maintaining high performance even with missing auxiliary modalities, outperforming state-of-the-art methods.

Hangfeng Liang, Yutao Hu, Yanhan Hu et al.

2026-06-10 87

cs.CV 2606.11176

Data Journalist Agent: Transforming Data into Verifiable Multimodal Stories

Introduces Data2Story, a multi-agent framework transforming data into verifiable multimodal stories with evidence traceability and interactive content.

Kevin Qinghong Lin, Batu EI, Yuhong Shi et al.

2026-06-10 87

cs.CV 2606.11148

MOFA-VTON: More Fashion Possibilities with Fine-Grained Adaptations in Virtual Try-On

MOFA-VTON employs diffusion models with dual-region masks and cross-attention-based layout adjustment, enabling user-controlled, fine-grained virtual try-on with diverse styles.

Xiaoyu Han, Chenyang Wang, Jing Wang et al.

2026-06-10 53

cs.CV 2606.09788

POTATR: A Lightweight Image-to-Graph Model for Page-Level Table Extraction

POTATR is a lightweight 29M-parameter image-to-graph model that significantly improves page-level table extraction accuracy and efficiency.

Brandon Smock, Libin Liang, Max Sokolov et al.

2026-06-09 59

cs.CV 2606.07451

TEVI: Text-Conditioned Editing of Visual Representations via Sparse Autoencoders for Improved Vision-Language Alignment

TEVI leverages sparse autoencoders with text conditioning to refine image embeddings, significantly improving vision-language alignment and retrieval accuracy.

Sweta Mahajan, Sukrut Rao, Jiahao Xie et al.

2026-06-06 71

cs.CV 2606.07433

Watch, Remember, Reason: Human-View Video Understanding with MLLMs

This paper introduces a unified framework based on watching, remembering, and reasoning, significantly advancing long video understanding with multimodal LLMs.

Jiahao Meng, Yue Tan, Qi Xu et al.

2026-06-06 66

cs.CV 2606.06485

PAR3D: A Unified 3D-MLLM with Part-Aware Representation for Scene Understanding

PAR3D introduces part-aware 3D multimodal large language models, significantly enhancing fine-grained scene understanding via the ScenePart dataset.

Shaohui Dai, Yansong Qu, You Shen et al.

2026-06-05 78

cs.CV 2606.06477

Complexity-Balanced Diffusion Splitting

Proposes Complexity-Balanced Diffusion Splitting (CBS), using Dirichlet energy and trajectory acceleration to estimate local complexity, improving synthesis quality by ~35%.

Noam Issachar, Dani Lischinski, Raanan Fattal

2026-06-05 70

cs.CV 2606.06476

Thinking with Imagination: Agentic Visual Spatial Reasoning with World Simulators

Proposes Astra framework combining RL-trained VLM policy with Bagel-based world simulator for imagination-driven spatial reasoning, improving MMSI-Bench accuracy from 45.1% to 49.5%.

Chenming Zhu, Jingli Lin, Yilin Long et al.

2026-06-05 192

cs.CV 2606.06407

A Vision-language Framework for Comparative Reasoning in Radiology

Proposes MedReCo, an entity-aware vision-language framework with over 690,000 images for clinical case retrieval and change description.

Tengfei Zhang, Ziheng Zhao, Lisong Dai et al.

2026-06-05 80

cs.CV 2606.06390

HomeWorld: A Unified Floorplan-to-Furnished Framework for Generating Controllable, Densely Interactive Whole-Home Scenes

HomeWorld introduces a hierarchical, multimodal framework trained on 300K real floorplans, using LLMs and diffusion models to generate controllable, diverse, and realistic whole-home scenes.

Wenbo Li, Xiaoliang Ju, Zipeng Qin et al.

2026-06-05 94

cs.CV 2606.06369

Visual Commonsense Driven Knowledge Refinements for Scene Graph Generation

Proposes a knowledge refinement framework using automatic rule mining and ASP-based abductive reasoning, improving scene graph generation with +4-8% F1@50 across benchmarks.

Maëlic Neau, Salim Baloch, Jakob Suchan et al.

2026-06-05 95

cs.CV 2606.06363

GMBFormer: An NDVI-Guided Global Memory Bank Transformer for Urban Green-Space Extraction from Ultra-High-Resolution Imagery

GMBFormer integrates NDVI-guided global memory with Transformer for urban green-space extraction, achieving a mean IoU of 89.25%.

Hao Lei, Xi Cheng, Chenlu Shu et al.

2026-06-05 63

cs.CV 2606.06361

Physics in 2-Steps: Locking Motion Priors Before Visual Refinement Erases Them

Proposes PhaseLock, a training-free framework that extracts motion priors from 2-step inference, improving physical consistency by 6.2 points on average.

Woojung Han, Seil Kang, Youngjun Jun et al.

2026-06-05 63

cs.CV 2606.02580

Thinking in Blender: Staged Executable Inverse Graphics with Vision-Language Models

Proposes SEIG, a staged framework leveraging pretrained vision-language models (VLMs) to reconstruct editable 3D scenes from a single image, achieving high fidelity in geometry, materials, and lighting.

Guangzhao He, Rundong Luo, Wei-Chiu Ma et al.

2026-06-02 111

cs.CV 2606.02569

AdaCodec: A Predictive Visual Code for Video MLLMs

AdaCodec employs predictive visual coding, transmitting full reference frames only when prediction is costly, reducing visual tokens by 84.7% and boosting long-video understanding efficiency.

Haowen Hou, Zhen Huang, Zheming Liang et al.

2026-06-02 139

cs.CV 2606.02564

VLMs are Good Teachers for Video Reasoning via Adaptive Test-Time Optimization

This paper introduces VLM as a teacher for video reasoning via test-time online optimization, achieving a 16.7-point performance boost, surpassing traditional methods.

Junhao Cheng, Liang Hou, Tianxiong Zhong et al.

2026-06-02 82

cs.CV 2605.31603

Lumos-Nexus: Efficient Frequency Bridging with Homogeneous Latent Space for Video Unified Models

Lumos-Nexus employs a two-stage training and UPFB to bridge frequencies, boosting video fidelity and reasoning-driven generation.

Jiazheng Xing, Hangjie Yuan, Lingling Cai et al.

2026-05-30 68

cs.CV 2605.31597

SOCO: Benchmarking Semantic Object Correspondence in Vision Foundation Models

Introduces SOCO benchmark with over 1 million keypoint pairs across 100 categories, revealing that vision foundation models encode strong semantics but poorly transfer correspondences across categories.

Olaf Dünkel, Basavaraj Sunagad, Haoran Wang et al.

2026-05-30 64