Not All Features Are Created Equal: A Mechanistic Study of Vision-Language-Action Models
Vision pathways dominate action generation in VLA models; language sensitivity is task-dependent.
Bryce Grant, Xijia Zhao, Peng Wang
Vision pathways dominate action generation in VLA models; language sensitivity is task-dependent.
Bryce Grant, Xijia Zhao, Peng Wang
MonoArt uses progressive structural reasoning for monocular 3D reconstruction, achieving improved accuracy and speed on the PartNet-Mobility dataset.
Haitian Li, Haozhe Xie, Junxiang Xu et al.
NavTrust benchmarks embodied navigation robustness by systematically introducing RGB, depth, and instruction corruptions, revealing significant robustness gaps in current models.
Huaide Jiang, Yash Chaudhary, Yuping Wang et al.
MoTok method reduces trajectory error from 0.72 cm to 0.08 cm and FID from 0.083 to 0.029 on HumanML3D.
Chenyang Gu, Mingyuan Zhang, Haozhe Xie et al.
SAMA achieves instruction-guided video editing through semantic anchoring and motion alignment, significantly enhancing editing precision and motion consistency.
Xinyao Zhang, Wenkai Dong, Yuxin Song et al.
EffectErase uses reciprocal learning for high-quality video object removal and insertion, leveraging the VOR dataset.
Yang Fu, Yike Zheng, Ziyun Dai et al.
F2LLM-v2 offers efficient multilingual embeddings using a two-stage training and matryoshka learning, supporting over 200 languages.
Ziyin Zhang, Zihan Liao, Hang Yu et al.
Spectrally-guided per-instance diffusion noise schedules enhance low-step generative quality.
Carlos Esteves, Ameesh Makadia
Proposed a new algorithm for online learning with ranking feedback, addressing the absence of traditional numeric feedback.
Mingyang Liu, Yongshan Chen, Zhiyuan Fan et al.
Nemotron-Cascade 2 achieves top-tier reasoning with Cascade RL and multi-domain distillation in a 30B MoE model.
Zhuolin Yang, Zihan Liu, Yang Chen et al.
DriveTok leverages 3D deformable cross-attention for efficient multi-view reconstruction and understanding, excelling on the nuScenes dataset.
Dong Zhuo, Wenzhao Zheng, Sicheng Zuo et al.
DreamPartGen achieves semantically grounded part-level 3D generation via collaborative latent denoising, improving geometric fidelity by 53%.
Tianjiao Yu, Xinzhuo Li, Muntasir Wahed et al.
State Space Models (SSM) outperform Vision Transformers (ViT) as vision encoders in VLMs, especially in VQA and localization tasks.
Shang-Jui Ray Kuo, Paola Cascante-Bonilla
A cost-aware evasion framework reveals robustness gaps in phishing detection; median evasion cost is 2, with over 80% attacks on three low-cost features.
Julian Allagan, Mohamed Elbakary, Zohreh Safari et al.
OmniVTA integrates predictive contact modeling with high-frequency tactile feedback for breakthroughs in contact-rich manipulation.
Yuhang Zheng, Songen Gu, Weize Li et al.
FASTER introduces Horizon-Aware Schedule to significantly reduce reaction latency in VLA models.
Yuxiang Lu, Zhe Liu, Xianzhe Fan et al.
The study explores how LLMs encode auditory knowledge and its impact on audio language models.
Ke-Han Lu, Szu-Wei Fu, Chao-Han Huck Yang et al.
OS-Themis framework improves GUI agent performance by 10.3% on AndroidWorld using a multi-agent critic mechanism.
Zehao Li, Zhenyu Wu, Yibo Zhao et al.
Sparse Autoencoders reveal interpretable and steerable features in VLA models, enhancing generalization on the LIBERO benchmark.
Aiden Swann, Lachlain McGranahan, Hugo Buurmeijer et al.
Box Maze framework reduces LLM reasoning error rate to below 1% through memory grounding, structured inference, and boundary enforcement.
Zou Qiang