cs.CV 2606.14703

Gaze Heads: How VLMs Look at What They Describe

This study identifies a small set of attention heads—gaze heads—in VLMs that causally track the current description region, enabling effective inference-time control via attention masks.

Rohit Gandikota, David Bau

2026-06-13 48
cs.CL 2606.14626

Characterizing Cultural Localization in AI-Generated Stories

Proposes a method combining lexical token analysis and multi-word similarity to quantify cultural localization in AI-generated stories, revealing only 9-17% of vocabulary accounts for cultural differences.

Shaily Bhatt, Supriti Vijay, Jeremiah Milbauer et al.

2026-06-13 55
cs.CV 2606.13679

InterleaveThinker: Reinforcing Agentic Interleaved Generation

InterleaveThinker employs a multi-agent framework with a planner and critic, achieving high-quality interleaved text-image generation with step-wise reinforcement learning, improving performance on benchmarks by over 50%.

Dian Zheng, Harry Lee, Manyuan Zhang et al.

2026-06-12 70
cs.CV 2606.13676

Modality Forcing for Scalable Spatial Generation

Proposes Modality Forcing, a post-training method enabling a single DiT model to jointly generate image and sparse depth data, achieving 57% reduction in AbsRel and scaling with model size.

Bardienus Pieter Duisterhof, Deva Ramanan, Jeffrey Ichnowski et al.

2026-06-12 98
cs.CL 2606.13634

Operads for compositional reasoning in LLMs

Introduces operads as a formal framework for question decomposition, with operadic consistency correlating strongly with model accuracy across multiple datasets.

Nathaniel Bottman, Kyle Richardson

2026-06-12 1 citations 62