Can Current Agents Close the Discovery-to-Application Gap? A Case Study in Minecraft
SciCrafter evaluates AI's discovery-to-application ability in Minecraft; current models achieve only 26% success.
Zhou Ziheng, Huacong Tang, Jinyuan Zhang et al.
SciCrafter evaluates AI's discovery-to-application ability in Minecraft; current models achieve only 26% success.
Zhou Ziheng, Huacong Tang, Jinyuan Zhang et al.
Proposes an LLM-based evaluation framework to enhance math reasoning assessment accuracy beyond symbolic math limitations.
Erez Yosef, Oron Anschel, Shunit Haviv Hakimi et al.
AgentSearchBench improves agent search ranking quality using execution signals, bridging the gap between semantics and performance.
Bin Wu, Arastun Mammadli, Xiaoyu Zhang et al.
A-MAR framework enhances multimodal art retrieval explanation quality through structured reasoning plans.
Shuai Wang, Hongyi Zhu, Jia-Hong Huang et al.
SafetyALFRED evaluates safety planning in multimodal LLMs in kitchen settings, finding good hazard recognition but low risk mitigation success.
Josue Torres-Fonseca, Naihao Deng, Yinpei Dai et al.
Large language models exhibit normative conformity, revealing underlying mechanisms.
Mikako Bito, Keita Nishimoto, Kimitaka Asatani et al.
MathNet provides a global multimodal benchmark for mathematical reasoning and retrieval, covering 30,676 Olympiad-level problems from 47 countries.
Shaden Alshammari, Kevin Wen, Abrar Zainal et al.
BLF system achieves state-of-the-art binary forecasting performance on ForecastBench using sequential Bayesian updating of linguistic beliefs.
Kevin Murphy
ClawEnvKit automates environment generation for claw-like agents, reducing costs by 13,800x.
Xirui Li, Ming Li, Derry Xu et al.
Proposed DeepInsightTheorem framework enhances informal theorem proving by identifying core techniques, significantly outperforming baselines.
Yunhe Li, Hao Shi, Bowen Deng et al.
Using the CompCQ framework, this study analyzes LLM-generated competency questions across domains, revealing generation characteristics.
Reham Alharbi, Valentina Tamma, Terry R. Payne et al.
The study shows language models exhibit strong spatial transfer in shortest path problems but fail in length scaling due to recursive instability.
Yao Tong, Jiayuan Ye, Anastasia Borovykh et al.
Diagnosing LLM judge reliability using transitivity analysis and conformal prediction sets, revealing 33%-67% documents with at least one 3-cycle.
Manan Gupta, Dhruv Kumar
Study reveals LLMs and VLMs struggle with viewpoint rotation understanding without vision, proposes VRUBench dataset, and improves performance via selective fine-tuning.
Zhen Yang, Ping Jian, Zhongbin Guo et al.
Introduces IRS framework, enhancing multimodal humor understanding with incongruity-resolution supervision, 72B model approaches expert level on NYCC.
Hatice Merve Vural, Doga Kukul, Ege Erdem Ozlu et al.
Policy-Guided Hybrid Simulation (PGHS) achieves 8.80% group simulation error on Meituan, improving over baselines by 45.8% and 40.9%.
Ziyang Chen, Renbing Chen, Daowei Li et al.
HippoCamp benchmarks multimodal file management agents, revealing limitations in user environments with top accuracy only 48.3%.
Zhe Yang, Shulin Tian, Kairui Hu et al.
Proposed a Markovian framework for auditing agentic AI reliability and oversight cost, improving state-action blind mass by 12.53%.
Biplab Pal, Santanu Bhattacharya
OS-Themis framework improves GUI agent performance by 10.3% on AndroidWorld using a multi-agent critic mechanism.
Zehao Li, Zhenyu Wu, Yibo Zhao et al.
Box Maze framework reduces LLM reasoning error rate to below 1% through memory grounding, structured inference, and boundary enforcement.
Zou Qiang