Preference-Calibrated Human-in-the-Loop Reinforcement Learning for Robotic Manipulation
Proposed PACT framework uses preference signals and task progress modeling to correct overestimated Q-values, boosting success rate by 24.5% and accelerating convergence 1.3× in real robot tasks.
Key Findings
Methodology
The PACT framework integrates implicit human preference signals, a task progress model, and segment-level Q-value correction to enhance sample efficiency in human-in-the-loop reinforcement learning. First, a self-supervised progress estimator trained on demonstration data identifies suboptimal segments within trajectories. Next, at intervention points, human corrective actions are compared with policy actions to generate preference pairs, which define a counterfactual advantage. This advantage is used to penalize overestimated Bellman targets in identified suboptimal segments, effectively calibrating critic estimates. Additionally, preference signals are directly incorporated into policy optimization by aligning the actor’s mean actions with human preferences in a bounded action space. The combined approach results in more accurate credit assignment, reduced Q-value overestimation, and faster policy convergence, demonstrated across five real-robot manipulation tasks with significant performance improvements.
Key Results
- Across five real-world manipulation tasks, PACT increased the average success rate from 58.0% to 82.5%, a 24.5% improvement, while reducing intervention rate from 47.1% to 32.3%. The training time was shortened to 63 minutes, representing a 1.3× acceleration compared to baseline HIL-SERL. Notably, in the complex assembly task, success rate rose from 10% to 62.5%, with a significant reduction in human interventions, indicating enhanced sample efficiency and robustness.
- Analysis of Q-value bias showed that PACT effectively suppressed overestimation in suboptimal segments, with critic bias shifting from positive (overestimation) to near-zero or negative values. The task progress model successfully localized failure behaviors such as hesitation and misgrasp, validating its utility for segment-level correction. Ablation studies confirmed that both critic correction and preference-guided policy optimization contributed synergistically to performance gains.
- The method’s ability to leverage implicit preferences from human interventions for fine-grained credit reassignment marks a key advancement over traditional reward shaping or global value correction strategies. This approach significantly improves the stability and efficiency of real-world robot learning, especially in complex, long-horizon tasks.
- Overall, the experimental results demonstrate that PACT not only accelerates learning but also reduces human workload, making it a promising approach for deploying autonomous robots in practical settings.
Significance
This work addresses a fundamental challenge in real-world reinforcement learning: the overestimation bias caused by trajectory heterogeneity and sparse rewards. By integrating human preferences into the credit assignment process, PACT offers a novel solution that enhances both sample efficiency and policy robustness. Its ability to localize and correct suboptimal behaviors at the segment level paves the way for more reliable autonomous robotic systems capable of learning complex tasks with minimal human supervision. The approach bridges the gap between imitation learning and reinforcement learning, providing a scalable framework for human-robot collaboration. Its successful deployment on five diverse manipulation tasks underscores its practical relevance, potentially transforming industrial automation, service robotics, and assistive technologies.
Technical Contribution
The primary technical innovation lies in the integration of implicit human preference signals into a segment-level Q-value correction mechanism within an actor-critic framework. The method introduces a self-supervised progress model that localizes suboptimal segments without requiring additional annotations. It then constructs preference pairs at intervention points to define a counterfactual advantage, which is exponentially weighted and used to penalize overestimated Bellman targets. This correction is seamlessly incorporated into critic training, reducing bias. Furthermore, the approach directly aligns the policy with human preferences in the bounded mean-action space, enhancing policy convergence and stability. The combination of these mechanisms constitutes a significant advancement over existing methods like Double DQN, TD3, and CQL, which lack fine-grained credit reassignment capabilities.
Novelty
Unlike prior works that rely on global reward shaping or uniform value backups, this study pioneers the use of intervention-induced implicit preferences for segment-level Q-value calibration. The core novelty is the construction of preference pairs and the exponential weighting scheme to perform directional credit correction, effectively mitigating Q-value overestimation in suboptimal segments. Additionally, the integration of a self-supervised progress model for automatic localization of suboptimal behaviors is a key contribution, enabling the framework to operate without external annotations. This approach introduces a new paradigm for human-in-the-loop reinforcement learning, emphasizing fine-grained, preference-guided credit assignment in real robot tasks.
Limitations
- The effectiveness of the progress model depends on the quality and representativeness of demonstration data; in highly non-monotonic or ambiguous tasks, localization accuracy may decline, affecting correction quality.
- The current method primarily addresses overestimation bias at the segment level; it does not fully solve the problem of per-step value calibration, especially in highly stochastic or complex environments.
- While the approach reduces human intervention, it still relies on human corrective actions at intervention points, which may limit scalability in scenarios requiring continuous supervision. Future work should explore automatic preference extraction and multi-agent extensions.
Future Work
未来的研究将结合语义理解和预训练模型,提升任务进展估计的鲁棒性和泛化能力。还计划探索多模态偏好信号的自动提取与融合,增强模型在复杂环境中的适应性。此外,将偏好校正机制推广到多智能体协作和长时序任务中,以实现更广泛的自主机器人应用。未来还将结合预训练的任务表示和语义推理,推动偏好引导的信用重分配技术在更复杂、多样化的场景中实现更高效、更稳健的学习。
AI Executive Summary
In recent years, robotic manipulation has seen remarkable progress through deep reinforcement learning (DRL), yet deploying these methods in real-world settings remains challenging due to sample inefficiency and safety concerns. Traditional RL approaches often require extensive interaction data, which is costly and risky in physical environments. Human-in-the-loop reinforcement learning (HIL-RL) offers a promising solution by incorporating human interventions during training, thereby improving sample efficiency. However, existing HIL-RL methods treat all trajectory transitions uniformly, ignoring the heterogeneity within successful trajectories that contain both beneficial and suboptimal actions. This oversight leads to Q-value overestimation, biased policy updates, and slower convergence.
Addressing this critical issue, the paper introduces PACT, a Preference-Calibrated Actor-Critic Training framework that leverages implicit human preferences to perform fine-grained credit reassignment. The core idea is to identify suboptimal behavior segments within trajectories using a self-supervised task progress model trained on demonstration data. At intervention points, human corrective actions are compared with policy actions to generate preference pairs, which define a counterfactual advantage. This advantage quantifies the degree of preference violation and is used to penalize overestimated Q-values in the identified segments. The penalization is position-aware, exponentially emphasizing later actions within the segment, which are more causally linked to interventions.
Simultaneously, the framework incorporates human preferences directly into policy optimization by aligning the actor’s mean actions with human corrective actions in a bounded action space. This dual approach—critic correction and actor preference alignment—ensures more accurate credit assignment, reduces bias, and accelerates learning. Extensive experiments on five real-robot manipulation tasks demonstrate that PACT achieves a 24.5% increase in success rate, reduces intervention rate by 15%, and shortens training time by approximately 20 minutes compared to the state-of-the-art HIL-SERL. Notably, in the most complex assembly task, success rate improved from 10% to 62.5%, showcasing the method’s robustness.
The significance of this work lies in its novel integration of human preferences for segment-level credit correction, a departure from traditional reward shaping or global value adjustments. By localizing suboptimal behaviors and calibrating their Q-values, PACT enhances the stability and efficiency of real-world robot learning. Its ability to reduce human workload while maintaining high performance paves the way for more autonomous, adaptable robotic systems in industrial, service, and medical domains.
Future directions include incorporating semantic-rich progress models, leveraging large pre-trained models for better behavior understanding, and extending the framework to multi-agent and long-horizon tasks. These advancements will further bridge the gap between imitation and reinforcement learning, fostering the development of truly autonomous robots capable of learning complex tasks with minimal human supervision.
Deep Analysis
Background
机器人操控中的深度强化学习(Deep Reinforcement Learning, DRL)经历了从基础的Q-learning到复杂的连续动作策略的快速演进。早期代表性算法如DQN(Deep Q-Network)和DDPG(Deep Deterministic Policy Gradient)在模拟环境中取得了显著成功,但在真实机器人中,样本采集成本高、风险大,限制了其应用。近年来,模仿学习、迁移学习和预训练模型的引入极大改善了样本效率,例如利用视觉预训练模型(如ResNet)和任务表示学习。人机交互强化学习(HIL-RL)作为一种结合人类专家知识的策略,逐渐成为研究热点。HIL-SERL等方法通过结合离线示范和在线交互,提升了机器人在复杂任务中的表现,但仍存在对所有轨迹一视同仁的问题,导致Q值偏差和学习偏差。传统的Q值校正方法如Double DQN、TD3和CQL,主要通过critic改进或奖励塑形缓解Q值膨胀,但未能解决轨迹内部的细粒度信用分配问题。本文创新性地利用人类干预产生的偏好信号,结合任务进展模型,实现段落级的Q值校正,推动了HIL-RL在实际机器人中的应用边界。
Core Problem
在真实机器人操作中,成功轨迹常包含次优行为段,传统强化学习方法在训练时将所有状态转移视作同质样本,导致Q值在次优段被过度估计。这种偏差不仅影响策略的稳定性,还会引导策略强化错误行为,延长收敛时间。人类干预虽能修正偏差,但频繁干预会增加人力成本,降低自主性。如何在利用干预信号的同时,避免Q值膨胀,提升样本利用效率,成为亟待解决的核心问题。现有方法缺乏对轨迹内部异质性的细粒度处理,难以实现高效、稳健的学习目标。
Innovation
本研究的主要创新在于引入偏好信号进行段落级Q值校正,具体包括:1) 设计任务进展模型自动识别次优段落,避免手工标注;2) 利用人类干预动作与策略动作的偏好对比,定义反事实优势值,抑制Q值膨胀;3) 将偏好信号引入连续动作空间的策略优化中,直接引导策略向人类偏好行为靠拢。这一机制突破了传统全轨迹奖励塑形的局限,实现了细粒度的信用重分配,有效缓解Q值偏差,提升了样本利用效率和训练速度。
Methodology
- �� 任务建模:将机器人操控问题定义为马尔可夫决策过程(MDP),目标是最大化累计折扣回报。
- �� 任务进展模型:采用多模态感知(图像和本体信息)编码器,训练一个自监督的任务进展估计器,识别轨迹中的次优段落。
- �� 次优段落识别:利用模型预测的任务进展值,检测下降明显或未恢复的段落,作为潜在的次优行为区间。
- �� 偏好对比构建:在干预点,利用人类干预动作与策略动作构建偏好对,定义反事实优势值,反映偏差程度。
- �� Q值校正:将偏好优势值按位置加权,调整对应段落的Bellman目标,抑制Q值膨胀。
- �� 策略引导:在连续动作空间中,将偏好信号引入策略优化,直接引导策略向人类偏好靠拢。
- �� 训练流程:结合Critic的Q值校正和Actor的偏好引导,进行端到端训练,实时调整策略参数。
Experiments
- �� 数据集:在五个真实机器人任务(按难度递增:Press、Insertion、Pick、Pick & Place、Assembly)中采集20条示范轨迹,用于训练任务进展模型和初始化强化学习。
- �� 评估指标:成功率、干预率、训练时间,比较HIL-SERL和PACT的性能。
- �� 实验设置:在Galaxea A1X 6-DoF机械臂上进行,控制变量包括学习率、偏好校正强度等。
- �� Ablation分析:剔除Critic校正或Actor偏好引导,验证各组件贡献。
- �� 统计分析:多次重复实验,计算平均成功率、干预率和训练时间,确保结果稳健。
Results
- �� 结果显示,PACT在五个任务中的平均成功率由58.0%提升至82.5%,干预率由47.1%降至32.3%,训练时间缩短至63分钟。特别是在复杂任务Assembly中,成功率从10%提升至62.5%,训练时间缩短约17分钟。
- �� Q值偏差分析表明,PACT在次优段落中有效抑制了Q值的过度估计,Critic偏差由正向偏差变为负向偏差,验证了偏好校正的有效性。
- �� 任务进展模型成功定位了多种失败行为,验证了段落识别的准确性。 Ablation研究显示,Critic校正和Actor偏好引导两者互补,共同提升性能。
Applications
- �� 立即应用:该方法可用于工业机器人装配、仓储自动化等场景,提升机器人自主学习能力,减少人工干预。
- �� 长远愿景:未来可结合语义理解和预训练模型,扩展到多智能体协作、长时序复杂任务,实现更智能、更自主的机器人系统。
Limitations & Outlook
- �� 依赖模仿学习的任务进展模型,可能在非单调或复杂长时序任务中表现不佳。
- �� 仅在段落层面进行Q值校正,未实现逐步精细校准,仍存在偏差风险。
- �� 模型在多模态感知环境中的鲁棒性尚未充分验证,未来需结合更丰富的感知信息和语义理解进行改进。
Abstract
Human-in-the-loop reinforcement learning (HIL-RL) improves sample efficiency in real-robot manipulation through online human intervention. However, successful trajectories may include suboptimal actions that deviate from the desired task-execution path and force human intervention. Existing HIL-RL methods typically apply the consistent credit assignment principle to all transitions, uniformly propagating discounted terminal rewards through suboptimal segments, ignoring the actual contribution of each transition to task success. This overestimates Q-values for critic learning and indirectly misguides actor updates toward suboptimal behavior patterns. To this end, we propose PACT, a Preference-calibrated Actor-Critic Training framework that leverages the implicit preference signals induced by intervention to perform credit reassignment on identified suboptimal segments while directly guiding policy training for unbiased critic-actor learning. Specifically, we first design a progress model that learns from human demonstration and identifies suboptimal segments for credit correction. Then, from the human action and resampled policy action at the intervention state, we build preference pairs to define a counterfactual advantage that penalizes Bellman targets of the identified suboptimal segment, enabling directional credit calibration. Moreover, we directly align the policy with human corrective actions in the bounded mean space, providing an additional signal beyond critic-guided updates. Across five real-robot manipulation tasks, PACT improves the average success rate by 24.5% and achieves 1.3 times faster convergence, thereby improving both RL sample efficiency and performance. Code is available at https://anonymous.4open.science/r/HILRL-A1X-BC05.
References (20)
Exploration-Guided Reward Shaping for Reinforcement Learning under Sparse Rewards
Rati Devidze, Parameswaran Kamalaruban, A. Singla
SERL: A Software Suite for Sample-Efficient Robotic Reinforcement Learning
Jianlan Luo, Zheyuan Hu, Charles Xu et al.
Toward next-generation learned robot manipulation
Jinda Cui, J. Trinkle
DriveIRL: Drive in Real Life with Inverse Reinforcement Learning
Tung Phan-Minh, Forbes Howington, Ting-Sheng Chu et al.
Sim-to-Real Model-Based and Model-Free Deep Reinforcement Learning for Tactile Pushing
Max Yang, Yijiong Lin, Alex Church et al.
HG-DAgger: Interactive Imitation Learning with Human Experts
Michael Kelly, Chelsea Sidrane, K. Driggs-Campbell et al.
Transferring policy of deep reinforcement learning from simulation to reality for robotics
Hao Ju, Rongshun Juan, R. Gomez et al.
Survey on Large Language Model-Enhanced Reinforcement Learning: Concept, Taxonomy, and Methods
Yuji Cao, Huan Zhao, Yuheng Cheng et al.
Efficient Online Reinforcement Learning with Offline Data
Philip J. Ball, Laura M. Smith, Ilya Kostrikov et al.
Reinforcement learning for robot research: A comprehensive review and open issues
Tengteng Zhang, Hongwei Mo
Guided Uncertainty-Aware Policy Optimization: Combining Learning and Model-Based Strategies for Sample-Efficient Policy Learning
Carlos Florensa, Jonathan Tremblay, Nathan D. Ratliff et al.
Model-Based Reinforcement Learning via Meta-Policy Optimization
Ignasi Clavera, Jonas Rothfuss, John Schulman et al.
ORPO: Monolithic Preference Optimization without Reference Model
Jiwoo Hong, Noah Lee, James Thorne
Self-Supervised Online Reward Shaping in Sparse-Reward Environments
F. Memarian, Wonjoon Goo, Rudolf Lioutikov et al.
ConRFT: A Reinforced Fine-tuning Method for VLA Models via Consistency Policy
Yuhui Chen, Shuai Tian, Shugao Liu et al.
E2HiL: Entropy-Guided Sample Selection for Efficient Real-World Human-in-the-Loop Reinforcement Learning
Haoyuan Deng, Yudong Lin, Yuanjiang Xue et al.
Precise and Dexterous Robotic Manipulation via Human-in-the-Loop Reinforcement Learning
Jianlan Luo, Charles Xu, Jeffrey Wu et al.
A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning
S. Ross, Geoffrey J. Gordon, J. Bagnell
Deep Reinforcement Learning from Human Preferences
P. Christiano, Jan Leike, Tom B. Brown et al.
Real-world robot applications of foundation models: a review
Kento Kawaharazuka, T. Matsushima, Andrew Gambardella et al.