Learn from Weaknesses: Automated Domain Specialization for Small Computer-Use Agents
LearnWeak framework uses a stronger reference agent to identify model weaknesses, synthesizes targeted tasks, and improves small CUAs by 11.6% on average across 8 domains.
Key Findings
Methodology
This paper introduces the LearnWeak framework, which combines weak point detection, targeted task synthesis, and behavior correction. The core involves using a stronger teacher agent to compare with a student model, automatically detecting weaknesses through trajectory comparison and success/failure verification. Based on these reports, the system synthesizes domain-specific tasks using a combination of weakness-focused and exploration strategies, guided by screenshots and environment metadata. During training, an error-aware preference optimization (DPO) distinguishes between planning and execution errors, enabling fine-grained behavioral updates. The data generation process employs multi-round iterative expansion, focusing on unresolved weaknesses, while the model is fine-tuned via LoRA modules to adapt efficiently. Experiments on the OSWorld benchmark demonstrate that this approach improves average success rates by 11.6 percentage points over baselines, with models surpassing their teachers in several domains. The methodology effectively reduces annotation costs and enhances domain-specific performance.
Key Results
- On the OSWorld dataset, the specialized EvoCUA-8B model trained with LearnWeak achieved an average success rate of 62.24%, up from 50.69%, representing an 11.6 percentage point improvement. Similarly, OpenCUA-7B improved from 37.65% to 48.72%. The gains are consistent across diverse domains such as office applications, system utilities, visual editing, and coding tasks. Multi-round iterative data synthesis outperformed single-pass and weakly supervised baselines, especially in complex tasks. The error-aware DPO training further enhanced behavioral correction, outperforming standard supervised fine-tuning and other offline strategies. Ablation studies confirmed that generating data from the model’s own failure cases yields the best results, with the number of generation rounds optimized around 3-4. Overall, the approach demonstrates robust domain adaptation and significant performance improvements.
- The experimental results highlight that targeted, weak point-driven data synthesis combined with fine-grained behavioral correction substantially outperforms traditional data augmentation and naive fine-tuning. The success rates across multiple domains show consistent improvements, with some tasks like VSCode and Gimp even surpassing the teacher models. The ablation studies reveal that the source of weakness reports (from the model itself) and iterative generation are critical for success. The approach's ability to adapt small models efficiently to specific domains with minimal human annotation marks a significant step forward in scalable AI deployment. These findings suggest that focusing training on unresolved weaknesses is a promising direction for future research.
- The results also demonstrate that error-aware preference optimization (DPO) effectively targets specific failure modes, leading to more precise behavior correction than traditional methods. The multi-round iterative process enables the model to progressively close the performance gap in various software environments. The combination of automated data synthesis and behavioral fine-tuning reduces reliance on manual annotation, making domain specialization more scalable and cost-effective. The experimental validation across multiple software domains confirms the method’s robustness and generalizability, paving the way for deploying small, efficient, domain-adapted agents in real-world applications.
Significance
This research addresses a critical bottleneck in deploying small, cost-efficient computer-use agents across diverse software environments. By automating the detection of weaknesses and targeted task synthesis, it significantly reduces reliance on manual annotation, enabling scalable domain adaptation. The integration of error-aware behavioral correction further refines model performance, making small models competitive with larger, proprietary systems. This approach has profound implications for practical AI deployment in edge devices, enterprise automation, and personalized assistants, where resource constraints and privacy considerations demand efficient, adaptable solutions. The methodology also advances the theoretical understanding of self-supervised, weakness-driven learning, contributing to the broader field of autonomous model improvement. Overall, it paves the way for more intelligent, autonomous, and scalable AI systems capable of continuous self-improvement in real-world settings.
Technical Contribution
The paper’s key technical innovations include: 1) an automated weakness detection mechanism based on teacher-student trajectory comparison, eliminating the need for manual annotations; 2) a targeted task synthesis pipeline guided by weakness reports and screenshots, employing both exploitation and exploration strategies; 3) the introduction of an error-aware preference optimization (DPO), which distinguishes between planning and execution errors, enabling more precise behavioral updates; 4) a multi-round iterative data expansion process that progressively focuses on unresolved weaknesses; 5) the use of LoRA modules for parameter-efficient fine-tuning, allowing rapid domain adaptation without catastrophic forgetting. These contributions collectively enable small models to efficiently learn domain-specific skills with minimal supervision, outperforming existing baselines in multi-domain settings.
Novelty
This work is novel in its integration of automated, weakly supervised data synthesis with fine-grained behavioral correction for domain adaptation of small CUAs. Unlike prior methods that rely on manual annotation or broad data augmentation, this approach leverages the model’s own failure signals to generate targeted training data. The introduction of error-aware biasing (DPO) for behavior correction at the step level is a significant advancement over traditional imitation or reinforcement learning strategies. Additionally, the multi-round iterative process, guided by weak point reports derived solely from model comparisons, represents a new paradigm for scalable, self-supervised domain specialization. These innovations collectively push the frontier of autonomous, data-efficient model adaptation in complex software environments.
Limitations
- The effectiveness of weak point detection heavily depends on the strength of the reference teacher agent; if the teacher’s performance is limited, the weak point identification and subsequent data synthesis may be suboptimal, especially in highly complex or novel domains.
- Multi-round iterative generation incurs substantial computational costs, which could hinder scalability in large-scale multi-domain applications or real-time scenarios.
- The parameter-efficient LoRA fine-tuning, while effective, may face limitations when模型规模极大或域数极多时的参数更新效率,限制了其在超大模型或极多域场景中的应用。
- 当前方法主要在静态环境下验证,面对动态变化的任务需求和用户行为,模型的持续学习和适应能力仍需进一步研究。
- 未来需要结合元学习、自我监督等技术,提升模型的泛化能力和自适应能力,解决动态环境中的持续学习问题。
Future Work
未来的研究方向包括:引入更强的自我监督机制,以提升弱点检测的准确性和泛化能力;探索更高效的多轮生成策略,降低计算成本;结合元学习技术,实现模型在新域中的快速适应;扩展到多模态、多任务场景,推动通用智能系统的发展;以及在真实动态环境中验证方法的鲁棒性和持续学习能力。这些努力将推动自动化、个性化智能代理的广泛应用,迈向更自主、更智能的未来。
AI Executive Summary
在人工智能快速演进的今天,智能代理在自动化任务中的作用日益凸显。尤其是在多域、多任务环境中,如何让小型模型高效适应不同软件和操作场景,成为研究的焦点。传统方法依赖大量人工标注,成本高昂且难以扩展,限制了模型在实际应用中的推广。为解决这一难题,本文提出了LearnWeak框架,旨在实现无需人工干预的自动化域适应。该方法核心在于利用性能更强的参考代理,通过比较教师-学生轨迹,自动检测模型在特定域中的弱点。随后,结合截图和环境信息,采用多轮迭代策略,合成针对性任务,逐步扩展训练数据集。训练阶段引入误差感知偏好优化(DPO),区分计划和执行错误,实现行为的细粒度修正。这一流程显著提升了模型在8个软件域中的成功率,平均提升11.6个百分点,部分任务甚至超越了教师模型。实验结果表明,弱点导向的训练策略优于传统数据增强和无目标的生成方法,极大降低了人工成本,推动了自动化、多任务、多域智能代理的研究前沿。未来,结合更强的参考模型和优化生成效率,将进一步拓展其在复杂环境中的应用潜力,为智能系统的个性化和自主学习提供坚实基础。
Deep Analysis
Background
随着深度学习和强化学习的快速发展,智能代理在自动化任务中的应用不断拓展。早期研究集中在大规模预训练模型(如GPT系列、Claude)上,这些模型在通用任务中表现优异,但在特定软件域的应用中仍存在性能瓶颈。近年来,面向特定任务微调的小模型(如EvoCUA、OpenCUA)逐渐成为研究热点,因其推理速度快、部署成本低,适合边缘设备。然而,现有方法普遍依赖大量人工标注数据,且在多域适应中表现不佳,难以实现高效迁移。部分研究尝试通过强化学习或迁移学习进行优化,但仍面临数据不足、泛化差等问题。近年来,自动化数据合成和无标注学习逐渐崭露头角,试图解决标注成本高昂和数据稀缺的难题。尽管如此,如何针对模型的具体弱点进行数据生成和行为修正,仍是当前的研究难点。本文在此背景下,提出了基于弱点识别的自动化域适应框架,结合多轮迭代和误差感知优化,填补了现有技术在无标注、目标导向训练方面的空白。
Core Problem
小型计算机使用代理(CUA)在多域、多任务环境中表现出明显的性能差异,尤其在特定软件应用中存在较大弱点。传统微调方法难以高效识别和修正这些弱点,依赖大量人工标注,成本高且不易扩展。现有的自动化数据生成策略多为盲目探索,未能针对模型的具体缺陷进行优化,导致训练数据的针对性不足,影响模型的性能提升。此外,模型在计划和执行两个层面都可能出现错误,如何区分并有针对性地修正,成为提升模型表现的关键。解决这一问题,不仅需要高效的弱点检测机制,还需要精细的行为修正策略,以实现模型的快速适应和持续改进。这些挑战限制了小型CUA在实际应用中的推广,亟需一种既自动化又高效的解决方案。
Innovation
本文的创新点主要体现在:1)提出无标注的弱点识别机制,通过教师-学生轨迹对比自动检测模型缺陷;2)设计基于弱点报告的目标任务合成策略,结合截图引导和多轮迭代,有效扩展训练数据,提升针对性和多样性;3)引入误差感知偏好优化(DPO),在训练中区分计划与执行错误,实现行为的细粒度修正,优于传统的模仿学习和强化学习方法;4)采用多轮迭代的弱点导向数据扩展流程,逐步逼近目标域的性能极限;5)利用LoRA模块实现参数高效微调,保证预训练能力的同时快速适应新域。整体框架融合了强化学习、迁移学习和自动化数据生成的最新技术,显著提升了小模型的域适应能力。这些创新共同推动了自动化、多任务、多域智能代理的研究前沿。
Methodology
- �� 目标:利用教师-学生轨迹对比,自动检测模型在特定域中的弱点,生成针对性任务,提升模型性能。
- �� 弱点检测:通过比较教师代理(性能更强)与学生模型在相同环境中的轨迹,利用验证器(V)判断成功或失败,提取失败任务和失败原因报告。
- �� 任务合成:基于弱点报告,结合截图和环境元数据,采用两种策略:弱点导向合成(针对弱点生成任务)和探索导向合成(覆盖未探索区域),多轮迭代扩展训练集。
- �� 数据筛选:通过多轮筛选,聚焦未解决的弱点区域,确保数据的针对性和多样性。
- �� 训练阶段:利用偏好优化(DPO),在行为级别区分计划和执行错误,动态调整训练目标,强化模型在弱点上的修正能力。
- �� 参数微调:采用LoRA模块,只更新特定参数,保持预训练能力,提升训练效率。
- �� 评估:在OSWorld数据集上,比较不同生成轮次、教师策略和训练目标的效果,验证方法的有效性。
Experiments
- �� 数据集:使用OSWorld,涵盖办公软件、系统工具、视觉编辑和编程任务,针对8个软件域进行训练和测试。
- �� 基线模型:包括大模型(Claude、Kimi)、小模型(EvoCUA、OpenCUA)及其微调版本。
- �� 训练策略:比较传统微调(SFT)、偏好优化(DPO)和本文提出的弱点导向多轮生成方法。
- �� 评估指标:主要衡量成功率(成功完成任务的比例),在不同域和不同模型上进行对比。
- �� 超参数:生成轮次N设为3-5轮,偏好温度β调节模型修正强度,LoRA参数更新比例控制训练成本。
- �� 消融实验:验证弱点报告来源、生成轮次、训练目标对性能的影响。
- �� 结果分析:通过多次实验,确认多轮迭代结合弱点导向策略显著优于单轮、盲探索和传统微调方法。
Results
- �� 在OSWorld测试集上,经过LearnWeak微调的EvoCUA-8B模型在8个软件域中平均成功率由50.69%提升至62.24%,提升11.6个百分点,部分任务如VSCode、Gimp甚至超越了教师模型。• OpenCUA-7B模型也实现了从37.65%到48.72%的提升,表现出良好的跨域适应能力。• 多轮迭代数据生成显著优于单轮和无弱点导向方法,尤其在复杂任务中表现出更强的修正能力。• 误差感知偏好优化(DPO)在行为修正中优于传统的SFT和其他离线策略,提升模型在计划和执行两个层面上的表现。• ablation研究显示,弱点报告来源于模型自身的失败案例效果最佳,生成轮次N在中间值达到最大性能,验证了多轮优化的有效性。• 综合来看,该方法在多任务、多域环境中实现了显著性能提升,验证了其在实际应用中的潜力。
Applications
- �� 立即应用:该技术可用于开发面向边缘设备的智能助手,自动适应不同用户的操作习惯,提升用户体验。企业可以利用该方法,无需大量人工标注,快速定制行业专用的智能软件助手,如财务、设计或客服系统,提升效率和自动化水平。教育领域也能借助该方法,自动生成个性化学习助手,适应不同学生的学习风格。• 长远愿景:未来,基于弱点导向的自动化微调将推动智能代理实现更高的自主学习能力,适应不断变化的任务环境。随着模型规模的扩大和算法的优化,能在更复杂的真实场景中实现零标注迁移,甚至实现跨模态、多任务的泛化能力,逐步迈向真正的通用智能系统。
Limitations & Outlook
- �� 该方法依赖参考代理的性能,若参考模型能力不足,可能导致弱点识别不准确,从而影响数据合成效果,限制在极端复杂或新颖域的适应性。• 多轮迭代生成虽提升了数据针对性,但也带来了计算成本的增加,尤其在大规模多域场景中,训练时间和资源消耗较高。• 模型微调采用LoRA模块,虽然高效,但在极端多域或超大模型中可能面临参数更新瓶颈,限制了扩展性。• 目前方法主要在静态环境验证,面对动态变化的任务需求和用户行为,模型的持续学习和适应能力仍需进一步研究。• 未来需要结合元学习和自我监督机制,提升模型的泛化和自适应能力。
Plain Language Accessible to non-experts
想象你有一个非常聪明的学生,他在学习不同的科目,比如数学、英语和科学。每次学习后,你会发现他在某些方面表现不佳,比如数学题总是算错,英语听力不行。为了帮助他变得更好,你可以观察他在哪些题型上出错,然后专门设计一些练习题,针对这些弱点反复练习。随着时间推移,他的弱点逐渐减少,成绩也越来越好。这个过程就像是让学生自己发现问题,然后有针对性地练习,逐步变得更厉害。这个方法也可以用在电脑模型上:让它自己找出哪里做得不好,然后专门练习那些地方,慢慢变得更聪明。这样不用人工告诉它每个细节,它自己学习,变得越来越强。
ELI14 Explained like you're 14
想象你有个超级厉害的机器人助手,它可以帮你做很多电脑上的任务,比如打开文件、写邮件、整理图片。但是,这个机器人在用某些软件时经常出错,比如用Word排版时总是排不好,或者用Excel做表格不熟练。为了让它变得更厉害,你可以观察它哪里出错,然后专门设计一些练习,让它反复练习那些容易出错的操作。每次它做错了,你就告诉它哪里错了,让它改正。经过多次练习,它就能在这些软件上变得非常熟练,甚至比之前更厉害。这个过程就像你教朋友学习新技能一样,先找出问题,然后集中练习,慢慢变得更棒。本文的方法也是这样:让电脑自己找出弱点,然后自动练习,变得更聪明,不用人工帮忙标注每个细节。
Glossary
Computer-Use Agent (CUA) (计算机使用代理)
一种在软件环境中通过感知屏幕和操作界面完成任务的智能策略,属于部分可观察决策过程(POMDP)。
论文中描述的核心智能体,用于自动化软件操作任务。
弱点识别 (Weakness Detection)
通过比较教师代理与学生模型在相同任务中的表现,自动检测模型在特定域中的缺陷或错误类型。
用于指导目标任务合成和模型微调的关键步骤。
偏好优化(Preference Optimization, DPO)
一种基于偏好学习的训练方法,动态区分计划错误与执行错误,实现行为的细粒度修正。
训练阶段用以提升模型在特定弱点上的修正能力。
LoRA(Low-Rank Adaptation)
一种参数高效的微调技术,通过插入低秩矩阵实现模型微调,保持预训练能力同时快速适应新任务。
本文采用以实现多域微调的高效参数更新。
多轮迭代(Multi-round Iteration)
反复进行弱点检测、任务合成和模型训练的循环过程,以逐步逼近目标域的性能极限。
数据生成和模型微调的核心策略。
OSWorld
一个涵盖多种桌面应用和操作系统工具的评测基准,用于验证CUA在多域环境中的性能。
本文实验的主要数据集。
行为修正(Behavioral Correction)
通过区分计划与执行错误,针对性地调整模型行为以修正特定任务中的失误。
训练中的关键目标。
自动化任务合成(Automated Task Synthesis)
利用弱点报告和截图引导,自动生成针对性训练任务,避免人工标注。
数据生成的核心技术。
Open Questions Unanswered questions from this research
- 1 如何在极端复杂或新颖的域中保持弱点检测的准确性?当前参考代理能力不足可能限制弱点识别的效果,未来需要结合自我监督或元学习机制提升模型的泛化能力。
- 2 多轮迭代生成的计算成本较高,如何在保证效果的同时降低训练时间和资源消耗?未来可探索更高效的生成策略或模型压缩技术。
- 3 模型微调的扩展性问题:在超大模型或多域场景中,参数更新可能成为瓶颈,如何设计更灵活的微调机制?
- 4 动态环境适应:当前方法主要在静态环境下验证,面对不断变化的任务需求和用户行为,如何实现持续学习和自我优化?
- 5 跨模态、多任务泛化:未来研究应关注模型在多模态信息和多任务场景中的迁移能力,推动通用智能系统的发展。
Applications
Immediate Applications
边缘设备智能助手
利用LearnWeak在边缘设备上实现个性化、自动化的软件操作助手,减少人工调教,提升用户体验。只需少量示例即可快速适应不同用户习惯和软件环境。
Long-term Vision
自主学习与持续适应
未来,模型将能在不断变化的环境中自主识别新弱点,自动生成训练任务,实现持续学习和自我优化,迈向更接近人类智能的水平。
Abstract
Computer-use agents (CUAs) have recently made substantial progress, but deploying a separate large expert for each software domain remains expensive. Small open computer-use agents are more practical specialization targets, but they remain substantially weaker and exhibit uneven domain-specific failures. A straightforward remedy is to synthesize large-scale training data for the target domain, yet we find that this naive approach yields only marginal improvements. Building on this observation, we introduce LearnWeak, an annotation-free specialization framework for small computer-use agents that uses a stronger reference agent to identify the student's weaknesses in the target domain, synthesize targeted tasks, and construct supervision automatically. LearnWeak further introduces an error-aware specialization objective that disentangles planning and execution errors, enabling more behaviorally precise updates than broad uniform supervision. On OSWorld, LearnWeak achieves average gains of 11.6 and 11.1 percentage points over EvoCUA-8B and OpenCUA-7B, respectively, across eight domains. We also validate that our student-aware dataset generation and training approaches outperform existing autonomous trajectory generation and training baselines. Our work highlights the importance of student awareness in both data synthesis and agent training, pointing toward a more principled and efficient path for specializing small computer-use agents in diverse domains.
References (20)
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
Tianbao Xie, Danyang Zhang, Jixuan Chen et al.
AgentTrek: Agent Trajectory Synthesis via Guiding Replay with Web Tutorials
Yiheng Xu, Dunjie Lu, Zhennan Shen et al.
ZeroGUI: Automating Online GUI Learning at Zero Human Cost
Chenyu Yang, Shiqian Su, Shi Liu et al.
EvoCUA: Evolving Computer Use Agents via Learning from Scalable Synthetic Experience
Taofeng Xue, Chong Peng, Mianqiu Huang et al.
AgentSynth: Scalable Task Generation for Generalist Computer-Use Agents
Jingxu Xie, Dylan Xu, Xuandong Zhao et al.
OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis
Qiushi Sun, Kanzhi Cheng, Zichen Ding et al.
OpenCUA: Open Foundations for Computer-Use Agents
Xinyuan Wang, Bowen Wang, Dunjie Lu et al.
Efficient Agent Training for Computer Use
Yanheng He, Jiahe Jin, Pengfei Liu
On the Effects of Data Scale on UI Control Agents
Wei Li, Will Bishop, Alice Li et al.
SEAgent: Self-Evolving Computer Use Agent with Autonomous Learning from Experience
Zeyi Sun, Ziyu Liu, Yuhang Zang et al.
Trial and Error: Exploration-Based Trajectory Optimization for LLM Agents
Yifan Song, Da Yin, Xiang Yue et al.
TinyAgent: Function Calling at the Edge
Lutfi Eren Erdogan, Nicholas Lee, Siddharth Jha et al.
Agent S2: A Compositional Generalist-Specialist Framework for Computer Use Agents
Saaket Agashe, Kyle Wong, Vincent Tu et al.
AgentDAM: Privacy Leakage Evaluation for Autonomous Web Agents
A. Zharmagambetov, Chuan Guo, Ivan Evtimov et al.
UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning
Haoming Wang, Haoyang Zou, Huatong Song et al.
PPTArena: A Benchmark for Agentic PowerPoint Editing
Michael Ofengenden, Yunze Man, Ziqi Pang et al.
Mobile-Agent-v3.5: Multi-platform Fundamental GUI Agents
Haiyang Xu, Xi Zhang, Hao Liu et al.
Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents
Boyu Gou, Ruohan Wang, Boyuan Zheng et al.
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents
Zhiyong Wu, Zhenyu Wu, Fangzhi Xu et al.