Generation Models Know Space: Unleashing Implicit 3D Priors for Scene Understanding
VEGA-3D leverages implicit 3D priors in video generation models to enhance scene understanding.
Xianjin Wu, Dingkang Liang, Tianrui Feng et al.
VEGA-3D leverages implicit 3D priors in video generation models to enhance scene understanding.
Xianjin Wu, Dingkang Liang, Tianrui Feng et al.
Matryoshka Gaussian Splatting (MGS) enables continuous level of detail control without sacrificing full-capacity rendering quality.
Zhilin Guo, Boqiao Zhang, Hakan Aktas et al.
MonoArt uses progressive structural reasoning for monocular 3D reconstruction, achieving improved accuracy and speed on the PartNet-Mobility dataset.
Haitian Li, Haozhe Xie, Junxiang Xu et al.
SAMA achieves instruction-guided video editing through semantic anchoring and motion alignment, significantly enhancing editing precision and motion consistency.
Xinyao Zhang, Wenkai Dong, Yuxin Song et al.
MoTok method reduces trajectory error from 0.72 cm to 0.08 cm and FID from 0.083 to 0.029 on HumanML3D.
Chenyang Gu, Mingyuan Zhang, Haozhe Xie et al.
EffectErase uses reciprocal learning for high-quality video object removal and insertion, leveraging the VOR dataset.
Yang Fu, Yike Zheng, Ziyun Dai et al.
Spectrally-guided per-instance diffusion noise schedules enhance low-step generative quality.
Carlos Esteves, Ameesh Makadia
DriveTok leverages 3D deformable cross-attention for efficient multi-view reconstruction and understanding, excelling on the nuScenes dataset.
Dong Zhuo, Wenzhao Zheng, Sicheng Zuo et al.
DreamPartGen achieves semantically grounded part-level 3D generation via collaborative latent denoising, improving geometric fidelity by 53%.
Tianjiao Yu, Xinzhuo Li, Muntasir Wahed et al.
State Space Models (SSM) outperform Vision Transformers (ViT) as vision encoders in VLMs, especially in VQA and localization tasks.
Shang-Jui Ray Kuo, Paola Cascante-Bonilla
ARIADNE uses DPO and RL for coronary angiography, achieving a centerline Dice of 0.838.
Zhan Jin, Yu Luo, Yizhou Zhang et al.
Introduced Spatio-Temporal Token Scoring (STTS) to enhance video VLMs efficiency by 62% with minimal performance drop.
Jianrui Zhang, Yue Yang, Rohun Tripathi et al.
Loc3R-VLM enables language-based localization and 3D reasoning from monocular video input, outperforming existing methods.
Kevin Qu, Haozhe Qi, Mihai Dusmanu et al.
LoST efficiently tokenizes 3D shapes by semantic salience for autoregressive generation, using only 0.1%-10% of tokens.
Niladri Shekhar Dutt, Zifan Shi, Paul Guerrero et al.
Video models exhibit reasoning via Chain-of-Steps mechanism during diffusion denoising steps.
Ruisi Wang, Zhongang Cai, Fanyi Pu et al.
SegviGen repurposes 3D generative models for part segmentation, achieving a 40% improvement in interactive segmentation using only 0.32% labeled data.
Lin Li, Haoran Feng, Zehuan Huang et al.
MessyKitchens achieves high-precision monocular 3D scene reconstruction using the MOD algorithm, significantly enhancing the physical plausibility of inter-object contacts.
Junaid Ahmed Ansari, Ran Ding, Fabio Pizzati et al.
M^3 integrates multi-view foundation models with monocular Gaussian splatting SLAM, reducing ATE RMSE by 64.3%.
Kerui Ren, Guanghao Li, Changjian Jiang et al.
PUMA model improves success rate by 6.3% in dynamic environments using historical optical flow and world queries.
Heng Fang, Shangru Li, Shuhan Wang et al.
GlyphPrinter enhances glyph accuracy using Region-Grouped Direct Preference Optimization, surpassing existing methods.
Xincheng Shuai, Ziye Li, Henghui Ding et al.