Execution-State Capsules: Graph-Bound Execution-State Checkpoint and Restore for Low-Latency, Small-Batch, On-Device Physical-AI Serving

TL;DR

Proposes graph-bound execution-state capsules for low-latency, small-batch on-device AI, enabling byte-exact snapshot and restore with sub-millisecond GPU performance.

cs.LG 🔴 Advanced 2026-06-19 31 views
Liang Su
Deep Learning Model Compression Edge Computing State Management GPU Acceleration

Key Findings

Methodology

This paper introduces a graph-bound execution-state checkpoint and restore mechanism, integrated with the white-box CUDA runtime FlashRT. By capturing a complete model execution plan over contiguous static buffers—free of block-table indirection—the system encapsulates all necessary state, including KV, recurrent, convolutional, MTP, and metadata, into a self-contained capsule. The runtime employs CUDA Graphs to record and replay the execution plan efficiently. The capsule supports operations like snapshot, restore, fork, and rollback, enabling rapid session management. Experiments on NVIDIA RTX 5090 demonstrate byte-exact restore and token-identical decoding, with GPU snapshot/restore times in milliseconds. Validation on Jetson AGX Thor and DGX Spark confirms structural robustness and low-latency performance. Moving the reuse unit from token-addressed KV fragments to graph-bound execution boundaries allows for significant latency reductions, especially in low-batch, interactive scenarios.

Key Results

  • On RTX 5090, the capsule achieves byte-exact restore, with time-to-first-token (TTFT) reduced by 2.6-2.8× compared to vLLM’s warm prefix cache baseline. At 16k tokens, the speedup reaches 27×, demonstrating a substantial reduction in cold start latency. The system maintains token-identical outputs under greedy decoding, with cosine similarity of 1.0 for action replay in vision-language tasks. Experiments on Jetson AGX Thor show speedups from 9× to 76× over cold prefill, confirming the approach’s effectiveness on resource-constrained devices. Ablation studies reveal that recurrent state is crucial; KV-only restores diverge at the first token, underscoring the importance of full state encapsulation. These results validate the core hypothesis that static, self-contained buffers enable ultra-low latency session management.
  • The experimental setup involves models similar to GPT-3, with prefix lengths ranging from 2k to 16k tokens, evaluated on latency metrics (TTFT, TTFA). Baselines include vLLM’s prefix cache. The experiments measure the byte-level fidelity of restore, the speed of session switching, and the impact of including recurrent states. Results consistently show that the graph-bound capsule approach outperforms traditional KV caching in low-latency scenarios, with the advantage growing with prefix length. Hardware validation across different platforms demonstrates the robustness and portability of the mechanism, making it suitable for diverse edge and embedded AI applications.

Significance

This work addresses a critical bottleneck in deploying AI models on edge devices where low latency and rapid session switching are paramount. Traditional high-throughput serving architectures optimize for concurrency and throughput, often neglecting the needs of interactive, latency-sensitive applications. The proposed graph-bound execution-state capsule fundamentally shifts the paradigm by enabling complete, byte-exact snapshots of model states at meaningful boundaries. This allows for instant recovery, session fork, and rollback, significantly reducing response times in real-time AI systems such as robots, voice assistants, and multi-modal interfaces. The approach bridges the gap between high-throughput server-side inference and low-latency edge AI, opening new avenues for responsive, autonomous systems. It also provides a scalable framework for future multi-modal, multi-task, and multi-device AI deployments, fostering more natural human-AI interactions and autonomous decision-making.

Technical Contribution

The paper’s main technical contribution is the design of a graph-bound execution-state capsule that encapsulates the complete model state over static contiguous buffers, avoiding the indirection of traditional KV caches. This is achieved through a novel capture of the CUDA graph plan over fixed buffers, enabling byte-exact snapshot and millisecond-level restore times. The approach leverages the static buffer’s immutability to ensure self-contained state, supporting operations like fork and rollback without recapturing the graph. It introduces a minimal C ABI interface for managing capsules, ensuring broad applicability and ease of integration. The system’s correctness is validated through rigorous byte-level and token-level tests, demonstrating that the restore process produces identical outputs under greedy decoding. The work also extends the concept to multiple hardware platforms, showing its generality and robustness. This paradigm shift from token-based reuse to graph-bound state management offers new engineering possibilities for low-latency AI inference.

Novelty

This research is the first to encapsulate the entire execution state of a model as a graph-bound, self-contained buffer set, moving beyond token-based KV caches. Unlike prior works that rely on indirect addressing and prefix reuse, this method captures the full model state—including recurrent and convolutional states—over static buffers, enabling byte-exact restore and rapid session management. The integration of CUDA Graphs for capturing and replaying the entire forward pass over contiguous buffers is a novel engineering solution that guarantees low latency and high fidelity. This approach fundamentally redefines the unit of reuse from token fragments to complete execution boundaries, offering a new paradigm for low-latency, on-device AI serving.

Limitations

  • The current design relies on static, contiguous buffers, which may not be suitable for models with highly dynamic or frequently changing states, limiting flexibility in such scenarios.
  • Large models with extensive state information may lead to substantial storage and transmission overhead for capsules, impacting scalability and deployment in resource-constrained environments.
  • The implementation is primarily validated on NVIDIA hardware with CUDA, and cross-platform support (e.g., AMD, ARM) remains an open challenge requiring further research.

Future Work

Future research will focus on extending the capsule mechanism to support dynamic models, multi-modal states, and more flexible buffer management strategies. Additionally, optimizing storage and transmission through compression techniques will be explored to handle larger models efficiently. Cross-platform compatibility and integration into broader AI serving frameworks are also key directions. Finally, scaling the approach to multi-GPU and distributed settings, as well as automating capsule management for complex multi-turn interactions, will be pursued to facilitate widespread adoption in edge and embedded AI systems.

AI Executive Summary

The rapid evolution of large language models (LLMs) and their deployment in real-time, interactive applications have created a pressing need for ultra-low latency state management. Traditional high-throughput serving systems like vLLM and SGLang excel at handling massive concurrent requests through sophisticated KV cache mechanisms, but they are inherently optimized for batch processing rather than single-request responsiveness. In scenarios such as robotic control, conversational AI, and multimodal interfaces, response time—specifically the time-to-first-token (TTFT)—must be minimized to ensure seamless user experience and real-time decision-making.

Existing solutions rely heavily on token-addressed KV caches, which, while effective for throughput, pose limitations in rapid session resets, forks, and rollbacks. These mechanisms depend on indirect addressing and partial state reuse, which introduce latency and complexity when switching contexts or recovering from interruptions. Recognizing this gap, the authors propose a fundamentally different approach: the execution-state capsule. This mechanism encapsulates the entire model state at a boundary point into a fixed set of contiguous, self-contained buffers, enabling byte-exact snapshot and restore operations.

The core innovation lies in capturing the complete forward pass as a CUDA graph plan over static buffers, avoiding the indirection of traditional KV caches. This design ensures that the model’s execution state—including KV, recurrent, convolutional, and multi-token prediction states—is stored in a self-contained, immutable buffer set. When a session needs to be resumed or forked, the system simply copies this buffer set, avoiding recapturing or recomputation, thus achieving sub-millisecond restore times. Experimental results on NVIDIA RTX 5090 demonstrate that this approach reduces TTFT by nearly threefold compared to baseline systems, with speedups growing with longer prefixes.

Validation across multiple hardware platforms, including Jetson AGX Thor and DGX Spark, confirms the robustness and portability of the mechanism. The results show that the capsule approach not only achieves byte-exact restore but also maintains token-level fidelity under greedy decoding, ensuring output consistency. The ablation studies highlight the importance of including recurrent states, which are critical for complex models with hybrid attention mechanisms.

This work significantly advances the state of low-latency AI serving, providing a practical solution for real-time applications that require rapid context switching, session management, and interruption handling. By shifting the unit of reuse from token fragments to complete execution boundaries, the authors open new pathways for deploying responsive AI systems on resource-constrained edge devices. The proposed framework lays a solid foundation for future research into multi-modal, multi-task, and multi-device AI, promising a new era of intelligent, autonomous systems capable of instant adaptation and interaction.

While the results are promising, challenges remain in scaling the approach to highly dynamic models, reducing storage overhead for large states, and extending cross-platform support. Nonetheless, this research marks a pivotal step toward truly responsive, on-device AI, bridging the gap between high-throughput training and low-latency inference, and setting the stage for next-generation intelligent systems.

Deep Dive

Abstract

Mainstream LLM serving systems reuse prefix work mainly through paged or radix key-value (KV) caches. This is highly effective for high-throughput, high-concurrency serving, but it manages only one positional fragment of execution state: the KV cache. We study the opposite regime: low-latency, small-batch, on-device physical-AI serving, where interactive LLM agents, speech systems, and robot policies repeatedly branch, reset, interrupt, and re-enter under tight responsiveness budgets. We introduce execution-state capsules, a graph-bound checkpoint and restore mechanism for the complete restorable state at a committed boundary. FlashRT is a white-box, backend-facing kernel runtime whose evaluated NVIDIA CUDA backend runs captured graph plans over contiguous static buffers with no block-table indirection. Because the live state is a closed set of named buffers, a capsule can snapshot, restore, fork, or roll back the whole execution boundary, including KV, recurrent state, convolution state, MTP state, and metadata. This moves reuse from token-addressed KV fragments to graph-bound execution-state boundaries. On an RTX 5090, capsule restore is byte-exact at the stored-state level and token-identical under greedy decode. A KV-only ablation diverges, showing that recurrent state is load-bearing. GPU-resident snapshot and restore are sub-millisecond, and TTFT speedup over cold prefill grows from 3.9x at 2k tokens to 27x at 16k tokens. On Jetson AGX Thor and DGX Spark, the same correctness and structural properties hold. Capsules are not a replacement for high-throughput KV-cache serving; they define a complementary latency-first serving point for explicit execution-state reuse.

cs.LG cs.DC