MM-WebAgent: A Hierarchical Multimodal Web Agent for Webpage Generation
MM-WebAgent uses hierarchical planning and self-reflection to generate consistent multimodal webpages, improving layout and style coherence.
Yan Li, Zezi Zeng, Yifan Yang et al.
MM-WebAgent uses hierarchical planning and self-reflection to generate consistent multimodal webpages, improving layout and style coherence.
Yan Li, Zezi Zeng, Yifan Yang et al.
RAD-2 scales reinforcement learning in autonomous driving, reducing collision rate by 56% using a generator-discriminator framework.
Hao Gao, Shaoyu Chen, Yifan Zhu et al.
Proposes a multi-stage context enrichment strategy to improve vision-language models' performance in human emotion recognition.
Madhav Agarwal, Sotirios A. Tsaftaris, Laura Sevilla-Lara et al.
SegWithU models uncertainty as perturbation energy for single-forward-pass risk-aware medical image segmentation.
Tianhao Fu, Austin Wang, Charles Chen et al.
Latent-WAM achieves efficient end-to-end autonomous driving with spatially-aware and dynamics-informed latent world representations, scoring 89.3 on NAVSIM v2.
Linbo Wang, Yupeng Zheng, Qiang Chen et al.
EndoVGGT enhances surgical 3D reconstruction with DeGAT, improving PSNR by 24.6% and SSIM by 9.1%.
Falong Fan, Yi Xie, Arnis Lektauers et al.
VFIG uses vision-language models for complex figure-to-SVG conversion, achieving a VLM-Judge score of 0.829.
Qijia He, Xunmei Liu, Hammaad Memon et al.
MedObvious exposes the Medical Moravec's Paradox in VLMs via a 1,880-task benchmark for clinical triage.
Ufaq Khan, Umair Nawaz, L D M S S Teja et al.
UniGRPO optimizes text and image generation policies using GRPO, enhancing reasoning-driven visual generation quality.
Jie Liu, Zilyu Ye, Linxiao Yuan et al.
DA-Flow combines diffusion and convolutional features to enhance optical flow estimation in degraded videos.
Jaewon Min, Jaeeun Lee, Yeji Choi et al.
WildWorld dataset offers over 450 actions and explicit state annotations for generative ARPG dynamic world modeling.
Zhen Li, Zian Meng, Shuwei Shi et al.
VISOR method enhances LVLM efficiency by sparsely selecting vision-language interactions, reducing inference cost.
Adrian Bulat, Alberto Baldrati, Ioannis Maniadis Metaxas et al.
AgentRVOS combines SAM3 and MLLM for zero-shot video object segmentation, achieving leading performance.
Woojeong Jin, Jaeho Lee, Heeseong Shin et al.
3DCity-LLM enhances 3D city-scale perception with a coarse-to-fine feature encoding strategy, leveraging a 1.2M-sample dataset.
Yiping Chen, Jinpeng Li, Wenyu Ke et al.
VideoDetective enhances long video understanding by integrating extrinsic query and intrinsic relevance, boosting VideoMME-long accuracy by 7.5%.
Ruoliu Yang, Chu Wu, Caifeng Shan et al.
UNITE achieves unified tokenization and latent diffusion with an autoencoder, reaching FID 2.12 on ImageNet.
Shivam Duggal, Xingjian Bai, Zongze Wu et al.
DualCoT-VLA enhances vision-language-action models with parallel reasoning for complex tasks, achieving state-of-the-art performance.
Zhide Zhong, Junfeng Li, Junjie He et al.
3D-Layout-R1 achieves language-guided spatial layout editing via scene graph reasoning, with a 15% IoU increase and 25% reduction in center-distance error.
Haoyu Zhen, Xiaolong Li, Yilin Zhao et al.
LumosX uses relational self-attention and cross-attention for personalized video generation, enhancing face-attribute alignment.
Jiazheng Xing, Fei Du, Hangjie Yuan et al.
VideoSeek actively seeks critical evidence using video logic flow, reducing frame usage by 93% and improving LVBench accuracy by 10.2 points.
Jingyang Lin, Jialian Wu, Jiang Liu et al.