Paper Insights - AI Arxiv Paper Analysis

cs.AI 2604.24697

Can Current Agents Close the Discovery-to-Application Gap? A Case Study in Minecraft

SciCrafter evaluates AI's discovery-to-application ability in Minecraft; current models achieve only 26% success.

Zhou Ziheng, Huacong Tang, Jinyuan Zhang et al.

2026-04-28 22

cs.AI 2604.22597

Rethinking Math Reasoning Evaluation: A Robust LLM-as-a-Judge Framework Beyond Symbolic Rigidity

Proposes an LLM-based evaluation framework to enhance math reasoning assessment accuracy beyond symbolic math limitations.

Erez Yosef, Oron Anschel, Shunit Haviv Hakimi et al.

2026-04-24 29

cs.AI 2604.22436

AgentSearchBench: A Benchmark for AI Agent Search in the Wild

AgentSearchBench improves agent search ranking quality using execution signals, bridging the gap between semantics and performance.

Bin Wu, Arastun Mammadli, Xiaoyu Zhang et al.

2026-04-24 26

cs.AI 2604.19689

A-MAR: Agent-based Multimodal Art Retrieval for Fine-Grained Artwork Understanding

A-MAR framework enhances multimodal art retrieval explanation quality through structured reasoning plans.

Shuai Wang, Hongyi Zhu, Jia-Hong Huang et al.

2026-04-22 34

cs.AI 2604.19638

SafetyALFRED: Evaluating Safety-Conscious Planning of Multimodal Large Language Models

SafetyALFRED evaluates safety planning in multimodal LLMs in kitchen settings, finding good hazard recognition but low risk mitigation success.

Josue Torres-Fonseca, Naihao Deng, Yinpei Dai et al.

2026-04-22 33

cs.AI 2604.19301

Large Language Models Exhibit Normative Conformity

Large language models exhibit normative conformity, revealing underlying mechanisms.

Mikako Bito, Keita Nishimoto, Kimitaka Asatani et al.

2026-04-21 34

cs.AI 2604.18584

MathNet: a Global Multimodal Benchmark for Mathematical Reasoning and Retrieval

MathNet provides a global multimodal benchmark for mathematical reasoning and retrieval, covering 30,676 Olympiad-level problems from 47 countries.

Shaden Alshammari, Kevin Wen, Abrar Zainal et al.

2026-04-21 1 citations 32

cs.AI 2604.18576

Agentic Forecasting using Sequential Bayesian Updating of Linguistic Beliefs

BLF system achieves state-of-the-art binary forecasting performance on ForecastBench using sequential Bayesian updating of linguistic beliefs.

Kevin Murphy

2026-04-21 31

cs.AI 2604.18543

ClawEnvKit: Automatic Environment Generation for Claw-Like Agents

ClawEnvKit automates environment generation for claw-like agents, reducing costs by 13,800x.

Xirui Li, Ming Li, Derry Xu et al.

2026-04-21 34

cs.AI 2604.16278

Learning to Reason with Insight for Informal Theorem Proving

Proposed DeepInsightTheorem framework enhances informal theorem proving by identifying core techniques, significantly outperforming baselines.

Yunhe Li, Hao Shi, Bowen Deng et al.

2026-04-18 33

cs.AI 2604.16258

Characterising LLM-Generated Competency Questions: a Cross-Domain Empirical Study using Open and Closed Models

Using the CompCQ framework, this study analyzes LLM-generated competency questions across domains, revealing generation characteristics.

Reham Alharbi, Valentina Tamma, Terry R. Payne et al.

2026-04-18 25

cs.AI 2604.15306

Generalization in LLM Problem Solving: The Case of the Shortest Path

The study shows language models exhibit strong spatial transfer in shortest path problems but fail in length scaling due to recursive instability.

Yao Tong, Jiayuan Ye, Anastasia Borovykh et al.

2026-04-17 47

cs.AI 2604.15302

Diagnosing LLM Judge Reliability: Conformal Prediction Sets and Transitivity Violations

Diagnosing LLM judge reliability using transitivity analysis and conformal prediction sets, revealing 33%-67% documents with at least one 3-cycle.

Manan Gupta, Dhruv Kumar

2026-04-17 35

cs.AI 2604.15294

How Do LLMs and VLMs Understand Viewpoint Rotation Without Vision? An Interpretability Study

Study reveals LLMs and VLMs struggle with viewpoint rotation understanding without vision, proposes VRUBench dataset, and improves performance via selective fine-tuning.

Zhen Yang, Ping Jian, Zhongbin Guo et al.

2026-04-17 33

cs.AI 2604.15210

Learning to Think Like a Cartoon Captionist: Incongruity-Resolution Supervision for Multimodal Humor Understanding

Introduces IRS framework, enhancing multimodal humor understanding with incongruity-resolution supervision, 72B model approaches expert level on NYCC.

Hatice Merve Vural, Doga Kukul, Ege Erdem Ozlu et al.

2026-04-17 29

cs.AI 2604.15190

Meituan Merchant Business Diagnosis via Policy-Guided Dual-Process User Simulation

Policy-Guided Hybrid Simulation (PGHS) achieves 8.80% group simulation error on Meituan, improving over baselines by 45.8% and 40.9%.

Ziyang Chen, Renbing Chen, Daowei Li et al.

2026-04-17 32

cs.AI 2604.01221

HippoCamp: Benchmarking Contextual Agents on Personal Computers

HippoCamp benchmarks multimodal file management agents, revealing limitations in user environments with top accuracy only 48.3%.

Zhe Yang, Shulin Tian, Kairui Hu et al.

2026-04-02 67

cs.AI 2603.24582

The Stochastic Gap: A Markovian Framework for Pre-Deployment Reliability and Oversight-Cost Auditing in Agentic Artificial Intelligence

Proposed a Markovian framework for auditing agentic AI reliability and oversight cost, improving state-action blind mass by 12.53%.

Biplab Pal, Santanu Bhattacharya

2026-03-26 46

cs.AI 2603.19191

OS-Themis: A Scalable Critic Framework for Generalist GUI Rewards

OS-Themis framework improves GUI agent performance by 10.3% on AndroidWorld using a multi-agent critic mechanism.

Zehao Li, Zhenyu Wu, Yibo Zhao et al.

2026-03-20 65

cs.AI 2603.19182

Box Maze: A Process-Control Architecture for Reliable LLM Reasoning

Box Maze framework reduces LLM reasoning error rate to below 1% through memory grounding, structured inference, and boundary enforcement.

Zou Qiang

2026-03-20 54