Learn from Weaknesses: Automated Domain Specialization for Small Computer-Use Agents

TL;DR

LearnWeak framework uses a stronger reference agent to identify model weaknesses, synthesizes targeted tasks, and improves small CUAs by 11.6% on average across 8 domains.

cs.LG 🔴 Advanced 2026-05-28 86 views

Suji Kim Kangsan Kim Sung Ju Hwang

AI Reader Arxiv Page Download PDF

AI Reinforcement Learning Domain Adaptation Automated Data Synthesis Model Fine-tuning

Key Findings

Methodology

This paper introduces the LearnWeak framework, which combines weak point detection, targeted task synthesis, and behavior correction. The core involves using a stronger teacher agent to compare with a student model, automatically detecting weaknesses through trajectory comparison and success/failure verification. Based on these reports, the system synthesizes domain-specific tasks using a combination of weakness-focused and exploration strategies, guided by screenshots and environment metadata. During training, an error-aware preference optimization (DPO) distinguishes between planning and execution errors, enabling fine-grained behavioral updates. The data generation process employs multi-round iterative expansion, focusing on unresolved weaknesses, while the model is fine-tuned via LoRA modules to adapt efficiently. Experiments on the OSWorld benchmark demonstrate that this approach improves average success rates by 11.6 percentage points over baselines, with models surpassing their teachers in several domains. The methodology effectively reduces annotation costs and enhances domain-specific performance.

Key Results

On the OSWorld dataset, the specialized EvoCUA-8B model trained with LearnWeak achieved an average success rate of 62.24%, up from 50.69%, representing an 11.6 percentage point improvement. Similarly, OpenCUA-7B improved from 37.65% to 48.72%. The gains are consistent across diverse domains such as office applications, system utilities, visual editing, and coding tasks. Multi-round iterative data synthesis outperformed single-pass and weakly supervised baselines, especially in complex tasks. The error-aware DPO training further enhanced behavioral correction, outperforming standard supervised fine-tuning and other offline strategies. Ablation studies confirmed that generating data from the model’s own failure cases yields the best results, with the number of generation rounds optimized around 3-4. Overall, the approach demonstrates robust domain adaptation and significant performance improvements.
The experimental results highlight that targeted, weak point-driven data synthesis combined with fine-grained behavioral correction substantially outperforms traditional data augmentation and naive fine-tuning. The success rates across multiple domains show consistent improvements, with some tasks like VSCode and Gimp even surpassing the teacher models. The ablation studies reveal that the source of weakness reports (from the model itself) and iterative generation are critical for success. The approach's ability to adapt small models efficiently to specific domains with minimal human annotation marks a significant step forward in scalable AI deployment. These findings suggest that focusing training on unresolved weaknesses is a promising direction for future research.
The results also demonstrate that error-aware preference optimization (DPO) effectively targets specific failure modes, leading to more precise behavior correction than traditional methods. The multi-round iterative process enables the model to progressively close the performance gap in various software environments. The combination of automated data synthesis and behavioral fine-tuning reduces reliance on manual annotation, making domain specialization more scalable and cost-effective. The experimental validation across multiple software domains confirms the method’s robustness and generalizability, paving the way for deploying small, efficient, domain-adapted agents in real-world applications.

Significance

This research addresses a critical bottleneck in deploying small, cost-efficient computer-use agents across diverse software environments. By automating the detection of weaknesses and targeted task synthesis, it significantly reduces reliance on manual annotation, enabling scalable domain adaptation. The integration of error-aware behavioral correction further refines model performance, making small models competitive with larger, proprietary systems. This approach has profound implications for practical AI deployment in edge devices, enterprise automation, and personalized assistants, where resource constraints and privacy considerations demand efficient, adaptable solutions. The methodology also advances the theoretical understanding of self-supervised, weakness-driven learning, contributing to the broader field of autonomous model improvement. Overall, it paves the way for more intelligent, autonomous, and scalable AI systems capable of continuous self-improvement in real-world settings.

Technical Contribution

The paper’s key technical innovations include: 1) an automated weakness detection mechanism based on teacher-student trajectory comparison, eliminating the need for manual annotations; 2) a targeted task synthesis pipeline guided by weakness reports and screenshots, employing both exploitation and exploration strategies; 3) the introduction of an error-aware preference optimization (DPO), which distinguishes between planning and execution errors, enabling more precise behavioral updates; 4) a multi-round iterative data expansion process that progressively focuses on unresolved weaknesses; 5) the use of LoRA modules for parameter-efficient fine-tuning, allowing rapid domain adaptation without catastrophic forgetting. These contributions collectively enable small models to efficiently learn domain-specific skills with minimal supervision, outperforming existing baselines in multi-domain settings.

Novelty

This work is novel in its integration of automated, weakly supervised data synthesis with fine-grained behavioral correction for domain adaptation of small CUAs. Unlike prior methods that rely on manual annotation or broad data augmentation, this approach leverages the model’s own failure signals to generate targeted training data. The introduction of error-aware biasing (DPO) for behavior correction at the step level is a significant advancement over traditional imitation or reinforcement learning strategies. Additionally, the multi-round iterative process, guided by weak point reports derived solely from model comparisons, represents a new paradigm for scalable, self-supervised domain specialization. These innovations collectively push the frontier of autonomous, data-efficient model adaptation in complex software environments.

Limitations

The effectiveness of weak point detection heavily depends on the strength of the reference teacher agent; if the teacher’s performance is limited, the weak point identification and subsequent data synthesis may be suboptimal, especially in highly complex or novel domains.
Multi-round iterative generation incurs substantial computational costs, which could hinder scalability in large-scale multi-domain applications or real-time scenarios.
The parameter-efficient LoRA fine-tuning, while effective, may face limitations when模型规模极大或域数极多时的参数更新效率，限制了其在超大模型或极多域场景中的应用。
当前方法主要在静态环境下验证，面对动态变化的任务需求和用户行为，模型的持续学习和适应能力仍需进一步研究。
未来需要结合元学习、自我监督等技术，提升模型的泛化能力和自适应能力，解决动态环境中的持续学习问题。

Future Work

未来的研究方向包括：引入更强的自我监督机制，以提升弱点检测的准确性和泛化能力；探索更高效的多轮生成策略，降低计算成本；结合元学习技术，实现模型在新域中的快速适应；扩展到多模态、多任务场景，推动通用智能系统的发展；以及在真实动态环境中验证方法的鲁棒性和持续学习能力。这些努力将推动自动化、个性化智能代理的广泛应用，迈向更自主、更智能的未来。

AI Executive Summary

在人工智能快速演进的今天，智能代理在自动化任务中的作用日益凸显。尤其是在多域、多任务环境中，如何让小型模型高效适应不同软件和操作场景，成为研究的焦点。传统方法依赖大量人工标注，成本高昂且难以扩展，限制了模型在实际应用中的推广。为解决这一难题，本文提出了LearnWeak框架，旨在实现无需人工干预的自动化域适应。该方法核心在于利用性能更强的参考代理，通过比较教师-学生轨迹，自动检测模型在特定域中的弱点。随后，结合截图和环境信息，采用多轮迭代策略，合成针对性任务，逐步扩展训练数据集。训练阶段引入误差感知偏好优化（DPO），区分计划和执行错误，实现行为的细粒度修正。这一流程显著提升了模型在8个软件域中的成功率，平均提升11.6个百分点，部分任务甚至超越了教师模型。实验结果表明，弱点导向的训练策略优于传统数据增强和无目标的生成方法，极大降低了人工成本，推动了自动化、多任务、多域智能代理的研究前沿。未来，结合更强的参考模型和优化生成效率，将进一步拓展其在复杂环境中的应用潜力，为智能系统的个性化和自主学习提供坚实基础。

Deep Analysis

Background

随着深度学习和强化学习的快速发展，智能代理在自动化任务中的应用不断拓展。早期研究集中在大规模预训练模型（如GPT系列、Claude）上，这些模型在通用任务中表现优异，但在特定软件域的应用中仍存在性能瓶颈。近年来，面向特定任务微调的小模型（如EvoCUA、OpenCUA）逐渐成为研究热点，因其推理速度快、部署成本低，适合边缘设备。然而，现有方法普遍依赖大量人工标注数据，且在多域适应中表现不佳，难以实现高效迁移。部分研究尝试通过强化学习或迁移学习进行优化，但仍面临数据不足、泛化差等问题。近年来，自动化数据合成和无标注学习逐渐崭露头角，试图解决标注成本高昂和数据稀缺的难题。尽管如此，如何针对模型的具体弱点进行数据生成和行为修正，仍是当前的研究难点。本文在此背景下，提出了基于弱点识别的自动化域适应框架，结合多轮迭代和误差感知优化，填补了现有技术在无标注、目标导向训练方面的空白。

Core Problem

小型计算机使用代理（CUA）在多域、多任务环境中表现出明显的性能差异，尤其在特定软件应用中存在较大弱点。传统微调方法难以高效识别和修正这些弱点，依赖大量人工标注，成本高且不易扩展。现有的自动化数据生成策略多为盲目探索，未能针对模型的具体缺陷进行优化，导致训练数据的针对性不足，影响模型的性能提升。此外，模型在计划和执行两个层面都可能出现错误，如何区分并有针对性地修正，成为提升模型表现的关键。解决这一问题，不仅需要高效的弱点检测机制，还需要精细的行为修正策略，以实现模型的快速适应和持续改进。这些挑战限制了小型CUA在实际应用中的推广，亟需一种既自动化又高效的解决方案。

Innovation

本文的创新点主要体现在：1）提出无标注的弱点识别机制，通过教师-学生轨迹对比自动检测模型缺陷；2）设计基于弱点报告的目标任务合成策略，结合截图引导和多轮迭代，有效扩展训练数据，提升针对性和多样性；3）引入误差感知偏好优化（DPO），在训练中区分计划与执行错误，实现行为的细粒度修正，优于传统的模仿学习和强化学习方法；4）采用多轮迭代的弱点导向数据扩展流程，逐步逼近目标域的性能极限；5）利用LoRA模块实现参数高效微调，保证预训练能力的同时快速适应新域。整体框架融合了强化学习、迁移学习和自动化数据生成的最新技术，显著提升了小模型的域适应能力。这些创新共同推动了自动化、多任务、多域智能代理的研究前沿。

Methodology

�� 目标：利用教师-学生轨迹对比，自动检测模型在特定域中的弱点，生成针对性任务，提升模型性能。
�� 弱点检测：通过比较教师代理（性能更强）与学生模型在相同环境中的轨迹，利用验证器（V）判断成功或失败，提取失败任务和失败原因报告。
�� 任务合成：基于弱点报告，结合截图和环境元数据，采用两种策略：弱点导向合成（针对弱点生成任务）和探索导向合成（覆盖未探索区域），多轮迭代扩展训练集。
�� 数据筛选：通过多轮筛选，聚焦未解决的弱点区域，确保数据的针对性和多样性。
�� 训练阶段：利用偏好优化（DPO），在行为级别区分计划和执行错误，动态调整训练目标，强化模型在弱点上的修正能力。
�� 参数微调：采用LoRA模块，只更新特定参数，保持预训练能力，提升训练效率。
�� 评估：在OSWorld数据集上，比较不同生成轮次、教师策略和训练目标的效果，验证方法的有效性。

Experiments

�� 数据集：使用OSWorld，涵盖办公软件、系统工具、视觉编辑和编程任务，针对8个软件域进行训练和测试。
�� 基线模型：包括大模型（Claude、Kimi）、小模型（EvoCUA、OpenCUA）及其微调版本。
�� 训练策略：比较传统微调（SFT）、偏好优化（DPO）和本文提出的弱点导向多轮生成方法。
�� 评估指标：主要衡量成功率（成功完成任务的比例），在不同域和不同模型上进行对比。
�� 超参数：生成轮次N设为3-5轮，偏好温度β调节模型修正强度，LoRA参数更新比例控制训练成本。
�� 消融实验：验证弱点报告来源、生成轮次、训练目标对性能的影响。
�� 结果分析：通过多次实验，确认多轮迭代结合弱点导向策略显著优于单轮、盲探索和传统微调方法。

Results

�� 在OSWorld测试集上，经过LearnWeak微调的EvoCUA-8B模型在8个软件域中平均成功率由50.69%提升至62.24%，提升11.6个百分点，部分任务如VSCode、Gimp甚至超越了教师模型。• OpenCUA-7B模型也实现了从37.65%到48.72%的提升，表现出良好的跨域适应能力。• 多轮迭代数据生成显著优于单轮和无弱点导向方法，尤其在复杂任务中表现出更强的修正能力。• 误差感知偏好优化（DPO）在行为修正中优于传统的SFT和其他离线策略，提升模型在计划和执行两个层面上的表现。• ablation研究显示，弱点报告来源于模型自身的失败案例效果最佳，生成轮次N在中间值达到最大性能，验证了多轮优化的有效性。• 综合来看，该方法在多任务、多域环境中实现了显著性能提升，验证了其在实际应用中的潜力。

Applications

�� 立即应用：该技术可用于开发面向边缘设备的智能助手，自动适应不同用户的操作习惯，提升用户体验。企业可以利用该方法，无需大量人工标注，快速定制行业专用的智能软件助手，如财务、设计或客服系统，提升效率和自动化水平。教育领域也能借助该方法，自动生成个性化学习助手，适应不同学生的学习风格。• 长远愿景：未来，基于弱点导向的自动化微调将推动智能代理实现更高的自主学习能力，适应不断变化的任务环境。随着模型规模的扩大和算法的优化，能在更复杂的真实场景中实现零标注迁移，甚至实现跨模态、多任务的泛化能力，逐步迈向真正的通用智能系统。

Limitations & Outlook

�� 该方法依赖参考代理的性能，若参考模型能力不足，可能导致弱点识别不准确，从而影响数据合成效果，限制在极端复杂或新颖域的适应性。• 多轮迭代生成虽提升了数据针对性，但也带来了计算成本的增加，尤其在大规模多域场景中，训练时间和资源消耗较高。• 模型微调采用LoRA模块，虽然高效，但在极端多域或超大模型中可能面临参数更新瓶颈，限制了扩展性。• 目前方法主要在静态环境验证，面对动态变化的任务需求和用户行为，模型的持续学习和适应能力仍需进一步研究。• 未来需要结合元学习和自我监督机制，提升模型的泛化和自适应能力。

Plain Language Accessible to non-experts

想象你有一个非常聪明的学生，他在学习不同的科目，比如数学、英语和科学。每次学习后，你会发现他在某些方面表现不佳，比如数学题总是算错，英语听力不行。为了帮助他变得更好，你可以观察他在哪些题型上出错，然后专门设计一些练习题，针对这些弱点反复练习。随着时间推移，他的弱点逐渐减少，成绩也越来越好。这个过程就像是让学生自己发现问题，然后有针对性地练习，逐步变得更厉害。这个方法也可以用在电脑模型上：让它自己找出哪里做得不好，然后专门练习那些地方，慢慢变得更聪明。这样不用人工告诉它每个细节，它自己学习，变得越来越强。

ELI14 Explained like you're 14

想象你有个超级厉害的机器人助手，它可以帮你做很多电脑上的任务，比如打开文件、写邮件、整理图片。但是，这个机器人在用某些软件时经常出错，比如用Word排版时总是排不好，或者用Excel做表格不熟练。为了让它变得更厉害，你可以观察它哪里出错，然后专门设计一些练习，让它反复练习那些容易出错的操作。每次它做错了，你就告诉它哪里错了，让它改正。经过多次练习，它就能在这些软件上变得非常熟练，甚至比之前更厉害。这个过程就像你教朋友学习新技能一样，先找出问题，然后集中练习，慢慢变得更棒。本文的方法也是这样：让电脑自己找出弱点，然后自动练习，变得更聪明，不用人工帮忙标注每个细节。

Glossary

Computer-Use Agent (CUA) (计算机使用代理)

一种在软件环境中通过感知屏幕和操作界面完成任务的智能策略，属于部分可观察决策过程（POMDP）。

论文中描述的核心智能体，用于自动化软件操作任务。

弱点识别 (Weakness Detection)

通过比较教师代理与学生模型在相同任务中的表现，自动检测模型在特定域中的缺陷或错误类型。

用于指导目标任务合成和模型微调的关键步骤。

偏好优化（Preference Optimization, DPO）

一种基于偏好学习的训练方法，动态区分计划错误与执行错误，实现行为的细粒度修正。

训练阶段用以提升模型在特定弱点上的修正能力。

LoRA（Low-Rank Adaptation）

一种参数高效的微调技术，通过插入低秩矩阵实现模型微调，保持预训练能力同时快速适应新任务。

本文采用以实现多域微调的高效参数更新。

多轮迭代（Multi-round Iteration）

反复进行弱点检测、任务合成和模型训练的循环过程，以逐步逼近目标域的性能极限。

数据生成和模型微调的核心策略。

OSWorld

一个涵盖多种桌面应用和操作系统工具的评测基准，用于验证CUA在多域环境中的性能。

本文实验的主要数据集。

行为修正（Behavioral Correction）

通过区分计划与执行错误，针对性地调整模型行为以修正特定任务中的失误。

训练中的关键目标。

自动化任务合成（Automated Task Synthesis）

利用弱点报告和截图引导，自动生成针对性训练任务，避免人工标注。

数据生成的核心技术。

Open Questions Unanswered questions from this research

1 如何在极端复杂或新颖的域中保持弱点检测的准确性？当前参考代理能力不足可能限制弱点识别的效果，未来需要结合自我监督或元学习机制提升模型的泛化能力。
2 多轮迭代生成的计算成本较高，如何在保证效果的同时降低训练时间和资源消耗？未来可探索更高效的生成策略或模型压缩技术。
3 模型微调的扩展性问题：在超大模型或多域场景中，参数更新可能成为瓶颈，如何设计更灵活的微调机制？
4 动态环境适应：当前方法主要在静态环境下验证，面对不断变化的任务需求和用户行为，如何实现持续学习和自我优化？
5 跨模态、多任务泛化：未来研究应关注模型在多模态信息和多任务场景中的迁移能力，推动通用智能系统的发展。

Applications

Immediate Applications

边缘设备智能助手

利用LearnWeak在边缘设备上实现个性化、自动化的软件操作助手，减少人工调教，提升用户体验。只需少量示例即可快速适应不同用户习惯和软件环境。

Long-term Vision

自主学习与持续适应

未来，模型将能在不断变化的环境中自主识别新弱点，自动生成训练任务，实现持续学习和自我优化，迈向更接近人类智能的水平。

Abstract

Computer-use agents (CUAs) have recently made substantial progress, but deploying a separate large expert for each software domain remains expensive. Small open computer-use agents are more practical specialization targets, but they remain substantially weaker and exhibit uneven domain-specific failures. A straightforward remedy is to synthesize large-scale training data for the target domain, yet we find that this naive approach yields only marginal improvements. Building on this observation, we introduce LearnWeak, an annotation-free specialization framework for small computer-use agents that uses a stronger reference agent to identify the student's weaknesses in the target domain, synthesize targeted tasks, and construct supervision automatically. LearnWeak further introduces an error-aware specialization objective that disentangles planning and execution errors, enabling more behaviorally precise updates than broad uniform supervision. On OSWorld, LearnWeak achieves average gains of 11.6 and 11.1 percentage points over EvoCUA-8B and OpenCUA-7B, respectively, across eight domains. We also validate that our student-aware dataset generation and training approaches outperform existing autonomous trajectory generation and training baselines. Our work highlights the importance of student awareness in both data synthesis and agent training, pointing toward a more principled and efficient path for specializing small computer-use agents in diverse domains.

cs.LG cs.AI cs.CL

References (20)

OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

Tianbao Xie, Danyang Zhang, Jixuan Chen et al.

2024 732 citations ⭐ Influential View Analysis →

AgentTrek: Agent Trajectory Synthesis via Guiding Replay with Web Tutorials

Yiheng Xu, Dunjie Lu, Zhennan Shen et al.

2024 81 citations ⭐ Influential View Analysis →

ZeroGUI: Automating Online GUI Learning at Zero Human Cost

Chenyu Yang, Shiqian Su, Shi Liu et al.

2025 26 citations ⭐ Influential View Analysis →

EvoCUA: Evolving Computer Use Agents via Learning from Scalable Synthetic Experience

Taofeng Xue, Chong Peng, Mianqiu Huang et al.

2026 18 citations ⭐ Influential View Analysis →

AgentSynth: Scalable Task Generation for Generalist Computer-Use Agents

Jingxu Xie, Dylan Xu, Xuandong Zhao et al.

2025 21 citations ⭐ Influential View Analysis →

OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis

Qiushi Sun, Kanzhi Cheng, Zichen Ding et al.

2024 117 citations ⭐ Influential View Analysis →

OpenCUA: Open Foundations for Computer-Use Agents

Xinyuan Wang, Bowen Wang, Dunjie Lu et al.

2025 82 citations ⭐ Influential View Analysis →

Efficient Agent Training for Computer Use

Yanheng He, Jiahe Jin, Pengfei Liu

2025 9 citations ⭐ Influential View Analysis →

On the Effects of Data Scale on UI Control Agents

Wei Li, Will Bishop, Alice Li et al.

2024 159 citations View Analysis →

SEAgent: Self-Evolving Computer Use Agent with Autonomous Learning from Experience

Zeyi Sun, Ziyu Liu, Yuhang Zang et al.

2025 38 citations View Analysis →

Continual GUI Agents

Ziwei Liu, Borui Kang, Hangjie Yuan et al.

2026 4 citations View Analysis →

Trial and Error: Exploration-Based Trajectory Optimization for LLM Agents

Yifan Song, Da Yin, Xiang Yue et al.

2024 179 citations View Analysis →

TinyAgent: Function Calling at the Edge

Lutfi Eren Erdogan, Nicholas Lee, Siddharth Jha et al.

2024 48 citations View Analysis →

Agent S2: A Compositional Generalist-Specialist Framework for Computer Use Agents

Saaket Agashe, Kyle Wong, Vincent Tu et al.

2025 119 citations View Analysis →

AgentDAM: Privacy Leakage Evaluation for Autonomous Web Agents

A. Zharmagambetov, Chuan Guo, Ivan Evtimov et al.

2025 51 citations View Analysis →

UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning

Haoming Wang, Haoyang Zou, Huatong Song et al.

2025 131 citations View Analysis →

PPTArena: A Benchmark for Agentic PowerPoint Editing

Michael Ofengenden, Yunze Man, Ziqi Pang et al.

2025 6 citations View Analysis →

Mobile-Agent-v3.5: Multi-platform Fundamental GUI Agents

Haiyang Xu, Xi Zhang, Hao Liu et al.

2026 18 citations View Analysis →

Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents

Boyu Gou, Ruohan Wang, Boyuan Zheng et al.

2024 341 citations View Analysis →

OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

Zhiyong Wu, Zhenyu Wu, Fangzhi Xu et al.

2024 306 citations View Analysis →

Learn from Weaknesses: Automated Domain Specialization for Small Computer-Use Agents

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Analysis

Background

Core Problem

Innovation

Methodology

Experiments

Results

Applications

Limitations & Outlook

Plain Language Accessible to non-experts

ELI14 Explained like you're 14

Glossary

Computer-Use Agent (CUA) (计算机使用代理)

弱点识别 (Weakness Detection)

偏好优化（Preference Optimization, DPO）

LoRA（Low-Rank Adaptation）

多轮迭代（Multi-round Iteration）

OSWorld

行为修正（Behavioral Correction）

自动化任务合成（Automated Task Synthesis）

Open Questions Unanswered questions from this research

Applications

Immediate Applications

边缘设备智能助手

Long-term Vision

自主学习与持续适应

Abstract

References (20)

Related Papers

Execution-State Capsules: Graph-Bound Execution-State Checkpoint and Restore for Low-Latency, Small-Batch, On-Device Physical-AI Serving

On the Oracle Complexity of Interpolation-Based Gradient Descent

STARE: Surprisal-Guided Token-Level Advantage Reweighting for Policy Entropy Stability

Zero-Shot Active Feature Acquisition via LLM-Elicitation

Looped World Models

Kolmogorov Regression for Robust Diffusion Policies