OS-Themis: A Scalable Critic Framework for Generalist GUI Rewards
OS-Themis framework improves GUI agent performance by 10.3% on AndroidWorld using a multi-agent critic mechanism.
Zehao Li, Zhenyu Wu, Yibo Zhao et al.
OS-Themis framework improves GUI agent performance by 10.3% on AndroidWorld using a multi-agent critic mechanism.
Zehao Li, Zhenyu Wu, Yibo Zhao et al.
Box Maze framework reduces LLM reasoning error rate to below 1% through memory grounding, structured inference, and boundary enforcement.
Zou Qiang
Proposes a reference-free simulation framework by training independent user and recommender simulators for more realistic dialogues.
Jerome Ramos, Feng Xia, Xi Wang et al.
Adaptive Domain Models leverage Bayesian distillation and warm rotation for efficient training in geometric and neuromorphic AI.
Houston Haynes
LEAFE framework internalizes recovery agency from reflective experience, enhancing Pass@k performance in long-horizon tasks.
Rui Ge, Yichao Fu, Yuyang Qian et al.
The study finds that counterfactual explanation metrics do not align with user perception, necessitating more human-centered evaluation methods.
Felix Liedeker, Basil Ell, Philipp Cimiano et al.
OpenSeeker democratizes frontier search agents by fully open-sourcing training data, utilizing controllable QA synthesis and denoised trajectory synthesis.
Yuwen Du, Rui Ye, Shuo Tang et al.
Proposes a cognitive architecture viewing the psyche as an operating system for constructing AGI.
Anton Kolonin, Vladimir Krykov
Developed a chatbot for maternal health in India using stage-aware triage and hybrid retrieval, achieving 86.7% emergency recall.
Smriti Jha, Vidhi Jain, Jianyu Xu et al.
CRYSTAL benchmark evaluates multimodal reasoning transparency using Match F1 and Ordered Match F1, revealing systematic flaws in existing models.
Wayner Barrios, SouYoung Jin
Structured distillation reduces personalized agent memory tokens by 11x while preserving retrieval capabilities.
Sydney Lewis
The study enhances performance in non-verifiable LLM post-training using reasoning LLM judges, with gpt-oss-120b as the gold standard.
Yixin Liu, Yue Yu, DiJia Su et al.
Porfolio-CEGAR-SEQ algorithm optimizes object packing and scheduling in 3D printing, reducing the number of printing plates used.
Pavel Surynek