Paper Insights - AI Arxiv Paper Analysis

cs.RO 2603.19233

Not All Features Are Created Equal: A Mechanistic Study of Vision-Language-Action Models

Vision pathways dominate action generation in VLA models; language sensitivity is task-dependent.

Bryce Grant, Xijia Zhao, Peng Wang

2026-03-20 57

cs.CV 2603.19231

MonoArt: Progressive Structural Reasoning for Monocular Articulated 3D Reconstruction

MonoArt uses progressive structural reasoning for monocular 3D reconstruction, achieving improved accuracy and speed on the PartNet-Mobility dataset.

Haitian Li, Haozhe Xie, Junxiang Xu et al.

2026-03-20 109

cs.RO 2603.19229

NavTrust: Benchmarking Trustworthiness for Embodied Navigation

NavTrust benchmarks embodied navigation robustness by systematically introducing RGB, depth, and instruction corruptions, revealing significant robustness gaps in current models.

Huaide Jiang, Yash Chaudhary, Yuping Wang et al.

2026-03-20 53

cs.CV 2603.19227

Bridging Semantic and Kinematic Conditions with Diffusion-based Discrete Motion Tokenizer

MoTok method reduces trajectory error from 0.72 cm to 0.08 cm and FID from 0.083 to 0.029 on HumanML3D.

Chenyang Gu, Mingyuan Zhang, Haozhe Xie et al.

2026-03-20 53

cs.CV 2603.19228

SAMA: Factorized Semantic Anchoring and Motion Alignment for Instruction-Guided Video Editing

SAMA achieves instruction-guided video editing through semantic anchoring and motion alignment, significantly enhancing editing precision and motion consistency.

Xinyao Zhang, Wenkai Dong, Yuxin Song et al.

2026-03-20 101

cs.CV 2603.19224

EffectErase: Joint Video Object Removal and Insertion for High-Quality Effect Erasing

EffectErase uses reciprocal learning for high-quality video object removal and insertion, leveraging the VOR dataset.

Yang Fu, Yike Zheng, Ziyun Dai et al.

2026-03-20 50

cs.CL 2603.19223

F2LLM-v2: Inclusive, Performant, and Efficient Embeddings for a Multilingual World

F2LLM-v2 offers efficient multilingual embeddings using a two-stage training and matryoshka learning, supporting over 200 languages.

Ziyin Zhang, Zihan Liao, Hang Yu et al.

2026-03-20 58

cs.CV 2603.19222

Spectrally-Guided Diffusion Noise Schedules

Spectrally-guided per-instance diffusion noise schedules enhance low-step generative quality.

Carlos Esteves, Ameesh Makadia

2026-03-20 51

cs.LG 2603.19221

Online Learning and Equilibrium Computation with Ranking Feedback

Proposed a new algorithm for online learning with ranking feedback, addressing the absence of traditional numeric feedback.

Mingyang Liu, Yongshan Chen, Zhiyuan Fan et al.

2026-03-20 54

cs.CL 2603.19220

Nemotron-Cascade 2: Post-Training LLMs with Cascade RL and Multi-Domain On-Policy Distillation

Nemotron-Cascade 2 achieves top-tier reasoning with Cascade RL and multi-domain distillation in a 30B MoE model.

Zhuolin Yang, Zihan Liu, Yang Chen et al.

2026-03-20 56

cs.CV 2603.19219

DriveTok: 3D Driving Scene Tokenization for Unified Multi-View Reconstruction and Understanding

DriveTok leverages 3D deformable cross-attention for efficient multi-view reconstruction and understanding, excelling on the nuScenes dataset.

Dong Zhuo, Wenzhao Zheng, Sicheng Zuo et al.

2026-03-20 56

cs.CV 2603.19216

DreamPartGen: Semantically Grounded Part-Level 3D Generation via Collaborative Latent Denoising

DreamPartGen achieves semantically grounded part-level 3D generation via collaborative latent denoising, improving geometric fidelity by 53%.

Tianjiao Yu, Xinzhuo Li, Muntasir Wahed et al.

2026-03-20 45

cs.CV 2603.19209

Do VLMs Need Vision Transformers? Evaluating State Space Models as Vision Encoders

State Space Models (SSM) outperform Vision Transformers (ViT) as vision encoders in VLMs, especially in VQA and localization tasks.

Shang-Jui Ray Kuo, Paola Cascante-Bonilla

2026-03-20 48

cs.LG 2603.19204

Robustness, Cost, and Attack-Surface Concentration in Phishing Detection

A cost-aware evasion framework reveals robustness gaps in phishing detection; median evasion cost is 2, with over 80% attacks on three low-cost features.

Julian Allagan, Mohamed Elbakary, Zohreh Safari et al.

2026-03-20 79

cs.RO 2603.19201

OmniVTA: Visuo-Tactile World Modeling for Contact-Rich Robotic Manipulation

OmniVTA integrates predictive contact modeling with high-frequency tactile feedback for breakthroughs in contact-rich manipulation.

Yuhang Zheng, Songen Gu, Weize Li et al.

2026-03-20 51

cs.RO 2603.19199

FASTER: Rethinking Real-Time Flow VLAs

FASTER introduces Horizon-Aware Schedule to significantly reduce reaction latency in VLA models.

Yuxiang Lu, Zhe Liu, Xianzhe Fan et al.

2026-03-20 78

eess.AS 2603.19195

How Auditory Knowledge in LLM Backbones Shapes Audio Language Models: A Holistic Evaluation

The study explores how LLMs encode auditory knowledge and its impact on audio language models.

Ke-Han Lu, Szu-Wei Fu, Chao-Han Huck Yang et al.

2026-03-20 33

cs.AI 2603.19191

OS-Themis: A Scalable Critic Framework for Generalist GUI Rewards

OS-Themis framework improves GUI agent performance by 10.3% on AndroidWorld using a multi-agent critic mechanism.

Zehao Li, Zhenyu Wu, Yibo Zhao et al.

2026-03-20 67

cs.RO 2603.19183

Sparse Autoencoders Reveal Interpretable and Steerable Features in VLA Models

Sparse Autoencoders reveal interpretable and steerable features in VLA models, enhancing generalization on the LIBERO benchmark.

Aiden Swann, Lachlain McGranahan, Hugo Buurmeijer et al.

2026-03-20 50

cs.AI 2603.19182

Box Maze: A Process-Control Architecture for Reliable LLM Reasoning

Box Maze framework reduces LLM reasoning error rate to below 1% through memory grounding, structured inference, and boundary enforcement.

Zou Qiang

2026-03-20 57