Visual-ERM: Reward Modeling for Visual Equivalence
Visual-ERM enhances vision-to-code tasks with fine-grained visual rewards, significantly outperforming existing models.
Ziyu Liu, Shengyuan Ding, Xinyu Fang et al.
Visual-ERM enhances vision-to-code tasks with fine-grained visual rewards, significantly outperforming existing models.
Ziyu Liu, Shengyuan Ding, Xinyu Fang et al.
STEVO-Bench evaluates video world models' ability to evolve state during observation interruptions, revealing limitations.
Ziqi Ma, Mengzhan Liufu, Georgia Gkioxari
InterEdit uses Semantic-Aware Plan Token Alignment and Interaction-Aware Frequency Token Alignment for multi-human 3D motion editing.
Yebin Yang, Di Wen, Lei Qi et al.
Introduces Alternating Gradient Flow (AGF) to prevent structural collapse under 75% compression on ImageNet-1K.
Tianhao Qian, Zhuoxuan Li, Jinde Cao et al.
EVATok achieves efficient visual autoregressive generation with adaptive video tokenization, saving 24.4% tokens on average.
Tianwei Xiong, Jun Hao Liew, Zilong Huang et al.
MM-CondChain uses VPIR for visually grounded deep compositional reasoning, with top model achieving only 53.33 Path F1.
Haozhan Shen, Shilin Yan, Hongwei Xue et al.
OmniStream achieves perception, reconstruction, and action in visual streams using causal spatiotemporal attention and 3D-RoPE, excelling across 29 datasets.
Yibin Yan, Jilan Xu, Shangzhe Di et al.
DreamVideo-Omni achieves multi-subject video customization with latent identity reinforcement learning, enhancing identity fidelity and motion control precision.
Yujie Wei, Xinyu Liu, Shiwei Zhang et al.
AutoGaze autoregressively selects multi-scale video patches, reducing redundancy and enhancing efficiency, enabling 1K-frame 4K video processing.
Baifeng Shi, Stephanie Fu, Long Lian et al.
EndoCoT activates MLLMs' reasoning potential, achieving 92.1% accuracy, 8.3% higher than the baseline.
Xuanlang Dai, Yujie Zhou, Long Xing et al.
BiGain enhances diffusion models by frequency separation, improving classification accuracy by 7.15% and FID by 0.34.
Jiacheng Liu, Shengkun Tang, Jiacheng Cui et al.
RDNet enhances salient object detection in optical remote sensing images using dynamic adaptive modules.
Bin Wan, Runmin Cong, Xiaofei Zhou et al.
O3N framework achieves state-of-the-art performance on QuadOcc and Human360Occ benchmarks using polar-spiral topology for 360° spatial representation.
Mengfei Duan, Hao Shi, Fei Teng et al.
Introduced UniCAC benchmark to evaluate 24 algorithms under various optical aberrations.
Xiaolong Qian, Qi Jiang, Yao Gao et al.
COMIC system uses LLM critics to generate sketch comedy videos near professional quality.
Susung Hong, Brian Curless, Ira Kemelmacher-Shlizerman et al.
V2M-Zero generates time-aligned music from video using event curves, achieving significant improvements in audio quality and beat alignment across datasets.
Yan-Bo Lin, Jonah Casebeer, Long Mai et al.
DynVLA uses Dynamics CoT to predict compact world dynamics, excelling on datasets like NAVSIM.
Shuyao Shang, Bing Zhan, Yunfei Yan et al.