DIRECT: When and Where Should You Allocate Test-Time Compute in Embodied Planners?

TL;DR

This paper introduces DIRECT, a multimodal scene-aware routing framework that dynamically allocates test-time compute among embodied planners, reducing latency by up to 65% while maintaining or surpassing top performance.

cs.RO 🔴 Advanced 2026-06-11 62 views

Jadelynn Dao Milan Ganai Yasmina Abukhadra Ajay Sridhar Mozhgan Nasr Azadani Katie Luo Clark Barrett Jiajun Wu Chelsea Finn Marco Pavone

AI Reader Arxiv Page Download PDF

embodied AI multimodal routing test-time compute hierarchical planning robotic manipulation

Key Findings

Methodology

The proposed DIRECT framework employs a multimodal scene encoder combining vision and language features, processed through a frozen SigLIP visual encoder and BGE-M3 text encoder. A lightweight regression-based router predicts the quality and resource cost (FLOPs or latency) for each candidate model in a fixed pool, conditioned on scene and instruction embeddings. The training involves synthetic data generation by sampling scenes, prompting large VLMs for instruction generation, executing candidate models, and scoring success and latency. The router then learns to maximize a utility function balancing success rate and inference cost, enabling per-task model selection. Experiments on VLABench, RoboMME, and real robot hardware validate that this dynamic scheduling achieves comparable or superior success rates at significantly reduced latency, especially across three axes: chain-of-thought depth, model size, and memory history.

Key Results

On VLABench, the routing strategy achieved a 15% success rate increase and reduced average latency by over 30 seconds compared to static model deployment, reaching an efficiency score (η) above 75%.
In RoboMME, the method effectively distinguished task difficulty levels, optimizing recall strategies and memory architectures, leading to higher success with less resource consumption.
On the Franka robot platform, the dynamic routing matched or exceeded the success of the strongest models, with latency reductions up to 65%. These results demonstrate the method’s robustness and practical value in real-world robotic tasks, including long-horizon chaining and zero-shot manipulation.

Significance

This work addresses the critical challenge of balancing computational cost and performance in embodied AI systems. By intelligently allocating test-time compute based on scene and instruction context, it overcomes the inefficiencies of uniform scaling strategies. The approach enhances the deployment feasibility of high-capability models in real-world robotics, paving the way for more responsive, resource-efficient autonomous agents. Its principles are applicable beyond robotics, potentially benefiting large-scale multimodal models in natural language understanding, decision-making, and adaptive inference systems, thus contributing significantly to the field of intelligent systems and AI scalability.

Technical Contribution

The paper's core technical contributions include: (1) a multimodal scene-aware routing mechanism that leverages visual and textual features for model selection; (2) a regression-based lightweight router architecture that predicts quality and cost metrics with minimal overhead; (3) a comprehensive analysis of how different axes—reasoning depth, model size, and memory—interact non-linearly in capability gains; (4) a scalable training pipeline combining synthetic and real data for robust router learning; and (5) extensive validation on both simulation and physical robots, demonstrating significant improvements in efficiency and success rate. These innovations enable adaptive, context-aware model invocation, advancing the state-of-the-art in embodied AI planning.

Novelty

This research is the first to incorporate multimodal scene context into a dynamic model routing framework for embodied agents, moving beyond prior text-only routing methods like FrugalGPT and RouteLLM. It systematically analyzes the non-uniform benefits of scaling different model axes, providing a nuanced understanding of capability-cost tradeoffs. The integration of scene perception with real-time model selection in a hierarchical planning setting represents a significant leap forward, enabling more efficient and flexible deployment of large language-vision models in robotics. Its multi-axial, context-sensitive approach distinguishes it from existing static or single-modality routing strategies.

Limitations

The effectiveness of the routing depends heavily on the quality of scene and instruction embeddings; noisy or ambiguous scene data can impair decision accuracy, especially in highly dynamic or cluttered environments.
Training relies on synthetic data generation, which may not fully capture the complexity of real-world scenarios, potentially affecting generalization to unseen environments.
The current router architecture, while lightweight, may struggle with highly complex tasks requiring long-term planning or multi-step reasoning beyond current capacity, necessitating further model enhancements.

Future Work

Future research will focus on enhancing the robustness of multimodal scene encoding, integrating reinforcement learning to improve decision-making under uncertainty, and scaling the framework to handle more diverse and complex tasks. Additionally, exploring adaptive memory management and continual learning strategies could further optimize resource allocation. Extending the approach to multi-robot systems and multi-agent coordination, as well as integrating hardware-aware optimization, will be key directions. Ultimately, the goal is to develop fully autonomous, resource-efficient embodied agents capable of operating seamlessly in real-world, unstructured environments.

AI Executive Summary

The deployment of high-capability vision-language models (VLMs) as high-level planners in embodied agents has revolutionized robotic autonomy, enabling semantic understanding and complex task decomposition. However, scaling test-time compute—such as increasing reasoning depth, model size, or memory—inevitably leads to higher latency, token consumption, and FLOPs, which hampers real-world applicability. This challenge is particularly acute in robotics, where real-time response is critical, and excessive latency can render systems impractical.

To address this, the authors introduce DIRECT, a novel routing framework that leverages multimodal scene context to dynamically allocate compute resources among a pool of diverse VLM planners. By encoding scene images and instructions into a joint feature space, the lightweight router predicts each model’s expected success and resource cost, enabling per-task model selection that balances performance and efficiency. This approach effectively tailors the inference process to the specific demands of each task, avoiding the wastefulness of uniform scaling.

The core innovation lies in the recognition that different axes of scaling—chain-of-thought reasoning depth, model size, and memory history—offer distinct capability gains that are non-uniform across tasks. For example, deeper reasoning benefits semantic and spatially constrained tasks, larger models command a broader skill set, and memory strategies improve long-horizon planning. The framework’s ability to adaptively route based on scene and instruction features allows it to harness these nuances, leading to substantial efficiency gains.

Experimental validation on benchmarks like VLABench and RoboMME demonstrates that DIRECT can match or surpass the success rates of larger, more expensive models while reducing latency by up to 65%. In real robot experiments with a Franka arm, the system achieves comparable task success with significantly lower latency, confirming its practical viability. These results highlight the importance of intelligent, context-aware compute allocation in embodied AI, paving the way for more scalable and resource-efficient robotic systems.

Overall, this work addresses a fundamental bottleneck in embodied AI—balancing model capability with real-time constraints—by introducing a scalable, multimodal routing strategy. Its implications extend beyond robotics, offering a blueprint for efficient deployment of large multimodal models across diverse AI applications. Future directions include enhancing robustness, generalization, and multi-agent coordination, aiming toward autonomous systems that are both highly capable and resource-conscious.

Deep Analysis

Background

The evolution of embodied AI has seen a shift from rule-based systems to deep learning-driven hierarchical models capable of complex reasoning and scene understanding. Early robotic systems relied on predefined behaviors, but recent advances leverage large-scale vision-language models (VLMs) such as GPT-4, PaLM, and multimodal variants like LLaVA and MiniGPT-4, which integrate visual perception with natural language understanding. These models enable robots to interpret instructions, reason about spatial and semantic constraints, and plan actions in unstructured environments. However, as model sizes grow exponentially—from hundreds of millions to hundreds of billions of parameters—the inference latency and computational costs increase dramatically, limiting real-time deployment. Existing approaches often employ static model selection or simple heuristics, which do not adapt to task complexity or scene context. This results in resource wastage and suboptimal performance, especially in multi-task scenarios requiring diverse capabilities. The paper builds on prior work in hierarchical planning, multimodal perception, and model routing, aiming to develop a context-aware, dynamic scheduling framework that optimally allocates compute resources based on scene understanding and task demands.

Core Problem

The core challenge addressed is the inefficiency of uniform scaling strategies in embodied AI systems. While larger models and deeper reasoning chains can improve capabilities, they also incur higher latency and resource consumption, which are prohibitive in real-world robotic applications. Static deployment of a single high-capability model leads to unnecessary computational costs on simple tasks, while insufficient capacity on complex tasks causes failures. The fundamental problem is how to intelligently allocate test-time compute resources dynamically, considering the scene context, instruction complexity, and task-specific demands. This requires developing a predictive mechanism that can assess the difficulty of each task and select the most appropriate model accordingly. The difficulty is compounded by the non-linear and non-uniform benefits of scaling different axes—reasoning depth, model size, and memory—necessitating a nuanced, adaptive approach rather than a one-size-fits-all solution.

Innovation

The paper introduces several key innovations: (1) a multimodal scene-aware routing framework that encodes visual and textual information into a joint feature space, enabling context-sensitive model selection; (2) a lightweight regression-based router that predicts model quality and resource cost with minimal overhead, facilitating real-time decision-making; (3) a comprehensive analysis revealing that different scaling axes confer distinct capabilities, which vary non-linearly across tasks; (4) a training pipeline combining synthetic scene generation and real robot data to learn effective routing policies; and (5) extensive validation demonstrating significant efficiency gains in simulation and physical robots. This approach fundamentally shifts from static, uniform model deployment to a dynamic, scene-adaptive scheduling paradigm, optimizing resource utilization while maintaining high success rates.

Methodology

�� Scene and instruction encoding: Scene images are processed through a frozen SigLIP visual encoder, extracting visual features; instructions are encoded via a frozen BGE-M3 text encoder. The two embeddings are concatenated into a unified feature vector.
�� Quality and cost prediction: A regression head predicts the success probability (quality) and inference resource consumption (cost) for each candidate model, forming matrices Q and C.
�� Model selection: The lightweight router r(·), based on models like multilayer perceptrons or KNN, takes the fused features ϕ(x) and outputs a model index k̂, optimizing a utility function that balances success and cost.
�� Training: Synthetic data is generated by sampling scenes, prompting large VLMs for instruction generation, executing candidate models, and scoring success and latency. The router learns to predict the optimal model per task.
�� Deployment: During inference, the router processes real scene and instruction inputs, predicts the best model, and dispatches accordingly. Multi-stage tasks trigger re-evaluation and re-routing based on updated scene states.
�� Multi-scale scheduling: The framework considers reasoning depth, model size, and memory strategies as separate axes, enabling nuanced, task-specific resource allocation.

Experiments

�� Datasets: Experiments utilize VLABench for benchmark tasks, RoboMME for robotic manipulation, and real Franka DROID hardware for physical validation.
�� Baselines: Comparisons include static low/high-cost models, random routing, and out-of-distribution detection-based routing.
�� Metrics: Success rate, average latency, and efficiency score (η) are primary metrics, with additional ablation studies on feature fusion, model axes, and utility functions.
�� Protocols: Large-scale simulation involves over 270,000 routing decisions; hardware tests include 245 trajectories across diverse tasks.
�� Ablation: Variations in scene encoding, model pool size, and routing objectives are systematically evaluated to identify optimal configurations.

Results

�� The routing framework consistently outperforms static and random baselines, achieving success rates within 1-2% of oracle models while reducing latency by up to 65%. For example, on VLABench, success rate improvements of 15% and latency reductions of 30 seconds were observed.
�� In robotic experiments, the system successfully handled long-horizon chaining and zero-shot manipulation, matching or exceeding the performance of the strongest models with significantly lower latency.
�� The analysis of axes—reasoning depth, model size, memory—revealed that each contributes differently to capability gains, and the routing strategy effectively exploits these differences, leading to a more efficient and adaptable system.

Applications

�� Autonomous robotic manipulation: enabling robots to adaptively select models based on scene complexity for efficient task execution.
�� Multi-modal AI systems: optimizing resource allocation in large-scale multimodal models for natural language understanding and decision-making.
�� Industrial automation: dynamic scheduling of robotic tasks in manufacturing lines, reducing energy and computational costs.
�� Assistive robots: improving real-time responsiveness and task success in household or service environments by scene-aware model selection.

Limitations & Outlook

�� The approach relies on high-quality scene and instruction embeddings; noisy or ambiguous inputs can impair routing accuracy.
�� Synthetic data generation for training may not fully capture real-world variability, affecting generalization.
�� The current router architecture may face challenges with highly complex, multi-step, or long-horizon tasks requiring deeper planning, necessitating further model enhancements.
�� Hardware resource constraints and real-time processing demands could limit scalability in large, dynamic environments.

Plain Language Accessible to non-experts

想象你在厨房里做饭。不同的菜肴需要不同的厨具和准备时间。有些菜很简单，只用微波炉就能搞定；有些菜复杂，需要用炉子、锅、调料，还要花费很多时间。你会根据菜的难度和厨房的情况，聪明地选择用哪个厨具，这样既能快点做好，又能保证味道好。这就像机器人在完成任务时，要根据场景和指令，决定用哪个模型。有的任务简单，用快的模型就行；有的复杂，就用慢的、厉害的模型。本文提出的DIRECT，就像一个聪明的厨房助手，能看场景、理解任务，然后决定用哪个厨具（模型），既省时间，又保证效果。这样，厨房（机器人系统）就能更快、更好地完成各种菜肴（任务），不用每次都用最大、最慢的厨具。

ELI14 Explained like you're 14

想象你在学校，有很多老师教不同的科目。有的老师讲得快，适合简单的题目；有的老师讲得慢，但能帮你理解难题。你不会每次都找那个最厉害、最慢的老师来上课，因为那样浪费时间。相反，你会根据问题的难度，选择合适的老师来帮你学习。这个选择就像机器人在完成任务时，要决定用哪个模型。比如，简单的任务用快的模型就可以了，复杂的任务才用慢的、厉害的模型。本文提出了一个聪明的“调度器”，它能看场景和任务内容，决定用哪个模型最合适。这样，机器人可以更快完成任务，又能保证效果，就像你用对老师学对科目一样。这个方法让机器人变得更聪明，也更实用！

Abstract

Vision-Language Models (VLMs) are increasingly deployed as high-level planners for embodied agents, with an emerging strategy of scaling test-time compute to improve capability. However, we observe that doing so increases latency, token usage, and FLOPs while yielding uneven, often diminishing gains in downstream success, limiting where embodied agents can be deployed. We argue that choosing when and where to spend test-time compute is central to bringing frontier performance to the real world. We introduce DIRECT, a routing framework that uses multimodal scene context to allocate compute per prompt, improving the success--cost Pareto frontier over fixed model selection. Across three dominant scaling axes, namely chain-of-thought depth, model size, and memory history, our experiments on VLABench and RoboMME show that test-time compute is not a uniform lever: different axes yield qualitatively distinct capability gains. We validate these insights on a physical Franka arm in a DROID setup spanning zero-shot manipulation and long-horizon chaining, where our router matches or exceeds a stronger model's success rate at up to 65% lower average latency. Ultimately, our results show that naively scaling test-time compute is wasteful, and that DIRECT can provide frontier-level embodied planning in robotic systems at a fraction of the cost. Project page can be found at jadee-dao.github.io/direct/.

cs.RO cs.AI cs.CV

DIRECT: When and Where Should You Allocate Test-Time Compute in Embodied Planners?

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Analysis

Background

Core Problem

Innovation

Methodology

Experiments

Results

Applications

Limitations & Outlook

Plain Language Accessible to non-experts

ELI14 Explained like you're 14

Abstract

Related Papers

Increasing Resilience of Continuum Robots via Motion Planning Algorithms

ARC: Adaptive Robust Joint State and Covariance Estimation

Do as I Do: Dexterous Manipulation Data from Everyday Human Videos

Observability and Consistency Analysis for Visual-Inertial Navigation with Anchored Feature Parameterizations

Visual Verification Enables Inference-time Steering and Autonomous Policy Improvement

R2RDreamer: 3D-aware Data Augmentation for Spatially-generalized 2D Manipulation Policies