From Passive Observer to Active Critic: Reinforcement Learning Elicits Process Reasoning for Robotic Manipulation

TL;DR

PRIMO R1 transforms video MLLMs into active 'Critics' using reinforcement learning, achieving 67.0% accuracy on RoboFail benchmark.

cs.RO 🔴 Advanced 2026-03-17 47 views
Yibin Liu Yaxing Lyu Daqi Gao Zhixuan Liang Weiliang Tang Shilong Mu Xiaokang Yang Yao Mu
Reinforcement Learning Video MLLMs Robotic Manipulation Process Reasoning Zero-Shot Generalization

Key Findings

Methodology

This paper introduces PRIMO R1, a 7B framework that transforms video MLLMs into active 'Critics' through outcome-based reinforcement learning, promoting explicit Chain-of-Thought generation for progress estimation. The architecture constructs a structured temporal input by explicitly anchoring the video sequence between initial and current state images. Extensive experiments supported by the proposed PRIMO Dataset and Benchmark demonstrate state-of-the-art performance across diverse in-domain environments and out-of-domain real-world humanoid scenarios.

Key Results

  • PRIMO R1 achieves a 50% reduction in the mean absolute error of specialized reasoning baselines, demonstrating significant relative accuracy improvements over 72B-scale general MLLMs.
  • On the RoboFail benchmark, PRIMO R1 establishes state-of-the-art performance with 67.0% accuracy, surpassing closed-source models like OpenAI o1 by 6.0%.
  • PRIMO R1 exhibits strong zero-shot generalization on difficult failure detection tasks, indicating its adaptability across different scenarios.

Significance

This research addresses the critical challenge of accurate process supervision in long-horizon robotic manipulation by transforming video MLLMs from passive observers into active critics. This transformation not only enhances model accuracy in specific tasks but also demonstrates robust generalization capabilities across different domains and scenarios. The successful application of PRIMO R1 indicates that outcome-based reinforcement learning can significantly improve process reasoning capabilities in robotic manipulation, providing new perspectives and methodologies for future robotic technology development.

Technical Contribution

PRIMO R1's technical contributions lie in its fundamental differences from existing state-of-the-art methods through outcome-based reinforcement learning and explicit Chain-of-Thought generation. The framework not only offers new theoretical guarantees but also opens up new engineering possibilities, particularly in enhancing the activeness and process reasoning capabilities of video MLLMs. By anchoring video sequences between initial and current states, PRIMO R1 achieves more accurate progress estimation and failure detection.

Novelty

PRIMO R1 is novel in transforming video MLLMs into active critics rather than merely passive observers. This transformation, facilitated by outcome-based reinforcement learning and explicit Chain-of-Thought generation, significantly enhances process reasoning accuracy and generalization capabilities, contrasting sharply with existing supervised fine-tuning methods.

Limitations

  • PRIMO R1 may face challenges in handling extremely complex scenarios, which might require higher computational resources and more sophisticated model architectures.
  • The method may require further fine-tuning for applications in certain out-of-domain scenarios to adapt to different task requirements and environmental changes.
  • While PRIMO R1 performs well in many tasks, its computational cost in real-time applications remains a challenge to be addressed.

Future Work

Future research directions include further optimizing PRIMO R1's computational efficiency for broader real-time applications. Additionally, exploring the framework's adaptability and performance in more out-of-domain scenarios, and how to enhance its process reasoning capabilities by integrating other advanced technologies, are worthwhile pursuits.

AI Executive Summary

Accurate process supervision in long-horizon robotic manipulation has been a critical challenge. Current video Multimodal Large Language Models (MLLMs), primarily trained under a Supervised Fine-Tuning paradigm, function as passive 'Observers' that recognize ongoing events rather than evaluating the current state relative to the final task goal. This paper introduces a 7B framework named PRIMO R1, which transforms video MLLMs into active 'Critics' using outcome-based Reinforcement Learning. The framework incentivizes explicit Chain-of-Thought generation for progress estimation and constructs a structured temporal input by explicitly anchoring the video sequence between initial and current state images.

Supported by the proposed PRIMO Dataset and Benchmark, extensive experiments demonstrate that PRIMO R1 achieves state-of-the-art performance across diverse in-domain environments and out-of-domain real-world humanoid scenarios. Specifically, PRIMO R1 achieves a 50% reduction in the mean absolute error of specialized reasoning baselines, showing significant relative accuracy improvements over 72B-scale general MLLMs. Furthermore, PRIMO R1 exhibits strong zero-shot generalization on difficult failure detection tasks.

On the RoboFail benchmark, PRIMO R1 establishes state-of-the-art performance with 67.0% accuracy, surpassing closed-source models like OpenAI o1 by 6.0%. These results indicate that PRIMO R1 not only excels in specific tasks but also possesses broad adaptability and generalization capabilities.

The successful application of PRIMO R1 demonstrates that outcome-based reinforcement learning can significantly improve process reasoning capabilities in robotic manipulation, providing new perspectives and methodologies for future robotic technology development. By transforming video MLLMs from passive observers into active critics, PRIMO R1 addresses the critical challenge of accurate process supervision in long-horizon robotic manipulation.

However, PRIMO R1 may face challenges in handling extremely complex scenarios, which might require higher computational resources and more sophisticated model architectures. Future research directions include further optimizing PRIMO R1's computational efficiency for broader real-time applications.

Deep Analysis

Background

In long-horizon robotic manipulation, accurate process supervision has been a critical challenge. Existing video Multimodal Large Language Models (MLLMs), primarily trained under a Supervised Fine-Tuning paradigm, typically function as passive 'Observers' that recognize ongoing events rather than evaluating the current state relative to the final task goal. This limitation of passive observation leads to a lack of effective progress estimation and failure detection capabilities in complex tasks. In recent years, with the development of deep learning and reinforcement learning technologies, researchers have begun to explore how to apply these technologies to process reasoning in robotic manipulation to enhance model activeness and accuracy.

Core Problem

Existing video MLLMs face several core problems in long-horizon robotic manipulation. Firstly, these models typically function as passive observers, lacking the ability to evaluate the current state relative to the final task goal. Secondly, existing methods have limited progress estimation and failure detection capabilities in complex tasks, making it difficult to meet diverse task requirements and environmental changes. Finally, traditional supervised fine-tuning methods often require large amounts of labeled data and computational resources when handling long-horizon tasks, making efficient process supervision challenging.

Innovation

The core innovations of PRIMO R1 lie in transforming video MLLMs into active critics through outcome-based reinforcement learning. Specifically:


  • �� Outcome-Based Reinforcement Learning: By incentivizing explicit Chain-of-Thought generation, PRIMO R1 can more accurately estimate progress.

  • �� Structured Temporal Input: By explicitly anchoring video sequences between initial and current state images, PRIMO R1 can better capture task progress.

  • �� Zero-Shot Generalization: PRIMO R1 exhibits strong zero-shot generalization on difficult failure detection tasks, indicating its adaptability across different scenarios.

Methodology

The implementation of PRIMO R1 includes the following key steps:


  • �� Outcome-Based Reinforcement Learning: Incentivizing the model to generate explicit Chain-of-Thought for progress estimation through a reward mechanism.

  • �� Video Sequence Anchoring: Explicitly anchoring video sequences between initial and current state images to construct structured temporal input.

  • �� Chain-of-Thought Generation: Utilizing reinforcement learning techniques, PRIMO R1 can generate explicit Chain-of-Thought, improving progress estimation accuracy.

  • �� Zero-Shot Generalization: Extensive experiments across diverse in-domain environments and out-of-domain real-world humanoid scenarios validate PRIMO R1's zero-shot generalization capabilities.

Experiments

The experimental design includes extensive testing using the PRIMO Dataset and Benchmark. Specifically, experiments are conducted across diverse in-domain environments and out-of-domain real-world humanoid scenarios to validate PRIMO R1's performance. Baselines include existing 72B-scale general MLLMs, with evaluation metrics such as mean absolute error and failure detection accuracy. Ablation studies are also conducted to analyze the contributions and impacts of each component in PRIMO R1.

Results

Experimental results show that PRIMO R1 achieves a 50% reduction in the mean absolute error of specialized reasoning baselines, demonstrating significant relative accuracy improvements over 72B-scale general MLLMs. Furthermore, PRIMO R1 establishes state-of-the-art performance on the RoboFail benchmark with 67.0% accuracy, surpassing closed-source models like OpenAI o1 by 6.0%. These results indicate that PRIMO R1 not only excels in specific tasks but also possesses broad adaptability and generalization capabilities.

Applications

Application scenarios for PRIMO R1 include process supervision and failure detection in long-horizon robotic manipulation. The framework can be directly applied to tasks requiring high-precision process reasoning, such as industrial automation and smart manufacturing. Additionally, PRIMO R1's zero-shot generalization capabilities make it widely adaptable across different domains and scenarios, capable of meeting diverse task requirements and environmental changes.

Limitations & Outlook

Despite PRIMO R1's excellent performance in many tasks, it may face challenges in handling extremely complex scenarios. Specifically, these scenarios might require higher computational resources and more sophisticated model architectures. Additionally, PRIMO R1 may require further fine-tuning for applications in certain out-of-domain scenarios to adapt to different task requirements and environmental changes. Future research directions include further optimizing PRIMO R1's computational efficiency for broader real-time applications.

Plain Language Accessible to non-experts

Imagine you're cooking in a kitchen. You have an assistant who usually just watches what you're doing without telling you how you're doing. Now, imagine this assistant becomes smarter and not only sees what you're doing but also tells you how far you are from completing the dish and even points out potential mistakes. That's what PRIMO R1 does. It's like a smart assistant that actively evaluates a robot's progress in a task by watching a video and providing feedback. This ability makes robots more efficient in complex tasks, just like an experienced chef can better manage each step in the kitchen. PRIMO R1 achieves this through reinforcement learning, making robots not just passive observers but active participants and evaluators of task progress. This transformation makes robots more efficient and accurate in handling long-horizon tasks, much like an experienced chef can navigate complex dishes with ease.

ELI14 Explained like you're 14

Hey there! You know how robots work, right? Imagine there's a robot helping you clean your room, but it just watches you and doesn't know how well it's doing. It's like a friend who only watches but doesn't talk. Now, PRIMO R1 is like a super smart robot assistant that not only sees what it's doing but also tells you how far it is from finishing the task and even points out where it might be going wrong. It's like when you're playing a game, and your character not only sees the path ahead but also knows how far it is from the finish line. This ability makes robots more efficient in complex tasks, just like you can level up faster in a game. PRIMO R1 does this using a technique called reinforcement learning, making the robot smarter and more active. Imagine if you had such a smart robot assistant, how fun life would be!

Glossary

Reinforcement Learning

A machine learning method that trains models through a reward mechanism to make optimal decisions in specific tasks.

Used in PRIMO R1 to incentivize explicit Chain-of-Thought generation.

Video Multimodal Large Language Models (MLLMs)

Models capable of processing and analyzing video data, typically used for recognizing and understanding events in videos.

PRIMO R1 transforms them into active critics through reinforcement learning.

Chain-of-Thought

A reasoning process that estimates task progress or solves problems through a series of logical steps.

Used in PRIMO R1 for progress estimation.

Zero-Shot Generalization

The ability of a model to perform well on unseen tasks or scenarios.

PRIMO R1 demonstrates this capability in failure detection tasks.

Process Reasoning

The ability to evaluate the relationship between the current state and the goal during task execution.

Enhanced in PRIMO R1 through reinforcement learning.

Supervised Fine-Tuning

Fine-tuning a pre-trained model with labeled data to improve its performance on specific tasks.

Existing video MLLMs are primarily trained under this paradigm.

RoboFail Benchmark

A benchmark test used to evaluate robotic failure detection performance.

PRIMO R1 achieves state-of-the-art performance on this benchmark.

Active Critic

A model role capable of actively evaluating task progress and providing feedback.

Achieved in PRIMO R1 through reinforcement learning.

Structured Temporal Input

Temporal input constructed by explicitly anchoring video sequences between initial and current states.

Used in PRIMO R1 to improve progress estimation accuracy.

Mean Absolute Error

A metric for evaluating model prediction accuracy, representing the average difference between predicted and true values.

Used to evaluate PRIMO R1's performance on reasoning baselines.

Open Questions Unanswered questions from this research

  • 1 While PRIMO R1 performs well in many tasks, it may face challenges in handling extremely complex scenarios. These scenarios might require higher computational resources and more sophisticated model architectures. How to enhance the model's ability to handle complex scenarios without increasing computational costs is an open question.
  • 2 PRIMO R1 may require further fine-tuning for applications in certain out-of-domain scenarios to adapt to different task requirements and environmental changes. How to achieve broader domain adaptability without extensive fine-tuning remains a challenge.
  • 3 Despite PRIMO R1's excellent performance in failure detection tasks, its computational cost in real-time applications remains a challenge to be addressed. How to reduce computational costs without sacrificing performance is an important direction for future research.
  • 4 How PRIMO R1's zero-shot generalization capabilities perform across different scenarios and how to further enhance this capability is a research direction worth exploring.
  • 5 How to apply PRIMO R1's technology to more practical scenarios and verify its adaptability and performance across different domains is a problem that requires further research.

Applications

Immediate Applications

Industrial Automation

PRIMO R1 can be used for process supervision and failure detection in industrial automation, improving production line efficiency and accuracy.

Smart Manufacturing

In smart manufacturing, PRIMO R1 can optimize production processes and reduce error rates by actively evaluating task progress.

Robotic-Assisted Surgery

PRIMO R1 can be used in robotic-assisted surgery for process supervision, ensuring precision and safety in surgical procedures.

Long-term Vision

Smart Home

PRIMO R1 can be used in smart home robotic assistants, providing smarter household management and security monitoring.

Autonomous Driving

In autonomous driving, PRIMO R1 can improve the safety and reliability of autonomous systems by actively evaluating driving environments.

Abstract

Accurate process supervision remains a critical challenge for long-horizon robotic manipulation. A primary bottleneck is that current video MLLMs, trained primarily under a Supervised Fine-Tuning (SFT) paradigm, function as passive "Observers" that recognize ongoing events rather than evaluating the current state relative to the final task goal. In this paper, we introduce PRIMO R1 (Process Reasoning Induced Monitoring), a 7B framework that transforms video MLLMs into active "Critics". We leverage outcome-based Reinforcement Learning to incentivize explicit Chain-of-Thought generation for progress estimation. Furthermore, our architecture constructs a structured temporal input by explicitly anchoring the video sequence between initial and current state images. Supported by the proposed PRIMO Dataset and Benchmark, extensive experiments across diverse in-domain environments and out-of-domain real-world humanoid scenarios demonstrate that PRIMO R1 achieves state-of-the-art performance. Quantitatively, our 7B model achieves a 50% reduction in the mean absolute error of specialized reasoning baselines, demonstrating significant relative accuracy improvements over 72B-scale general MLLMs. Furthermore, PRIMO R1 exhibits strong zero-shot generalization on difficult failure detection tasks. We establish state-of-the-art performance on RoboFail benchmark with 67.0% accuracy, surpassing closed-source models like OpenAI o1 by 6.0%.

cs.RO cs.AI cs.CL cs.CV