DeltaBox: Scaling Stateful AI Agents with Millisecond-Level Sandbox Checkpoint/Rollback

TL;DR

DeltaBox achieves millisecond-level incremental checkpoint/rollback for AI agents via DeltaFS and DeltaCR; 14ms checkpoint and 5ms rollback on SWE-bench.

cs.OS 🔴 Advanced 2026-05-22 58 views

Yunpeng Dong Jingkai He Yuze Hou Dong Du Zhonghu Xu Si Yu Yubin Xia Haibo Chen

AI Reader Arxiv Page Download PDF

AI Agents State Management Incremental Checkpointing Operating Systems MicroVM

Key Findings

Methodology

This paper introduces DeltaBox, a stateful AI agent sandbox system that leverages the DeltaState OS abstraction to incrementally checkpoint and rollback coupled filesystem and process memory states. DeltaBox comprises two co-designed OS mechanisms: DeltaFS, an OverlayFS-based filesystem layer supporting dynamic writable layer freezing and insertion to enable copy-on-write and fast layer switching; and DeltaCR, a process state management system built on CRIU that performs incremental dumps and accelerates rollback via template process fork(), complemented by asynchronous page warm-up. Together, these ensure atomic consistency of filesystem and memory states and enable millisecond-level checkpoint/restore operations critical for high-frequency state exploration in AI agents.

Key Results

On the SWE-bench benchmark, DeltaBox achieves an average checkpoint latency of 14 milliseconds and rollback latency of 5 milliseconds, reducing state management overhead by over 90% compared to Firecracker VM snapshots, thereby significantly enhancing deep tree search and reinforcement learning workloads.
DeltaBox reduces the proportion of state management time from 47–77% in baseline coupled filesystem approaches to 3–6%, enabling AI agents to explore substantially more nodes within fixed time budgets and improving task success rates.
Ablation studies demonstrate that DeltaFS's dynamic layer switching and DeltaCR's template fork mechanism contribute approximately 40% and 35% of the performance gains respectively, while asynchronous warm-up further smooths copy-on-write induced latency spikes, ensuring stable low-latency operation.

Significance

This work addresses a critical bottleneck in LLM-powered AI agents: the high latency of checkpoint and rollback operations due to full state duplication. By introducing an OS-level incremental state management abstraction and mechanisms, DeltaBox enables efficient, atomic, and low-latency coupled checkpoint/restore of both filesystem and process memory states. This breakthrough facilitates deeper and broader state exploration in AI agents, enhancing automated software engineering, code repair, and complex task automation. It bridges a longstanding gap between AI research and system infrastructure, offering both academic insights and practical industrial impact.

Technical Contribution

DeltaBox's technical contributions lie in pioneering the DeltaState OS abstraction for transactional, incremental state management, and designing two synergistic mechanisms: DeltaFS, which extends OverlayFS with runtime reconfigurable writable layers for efficient filesystem checkpoint/restore, and DeltaCR, which couples CRIU incremental dumps with template process fork-based fast restore and asynchronous page warm-up. This architecture achieves millisecond-level checkpoint/rollback latency, a significant departure from prior full-copy or VM snapshot approaches, enabling new possibilities for high-frequency stateful AI agent workflows.

Novelty

DeltaBox is the first system to apply incremental, change-based checkpoint/restore at the OS level specifically tailored for stateful AI agents. Unlike prior work relying on full state duplication or VM-level snapshots, it innovates with dynamic writable layer switching in OverlayFS and template process fork recovery integrated with CRIU. This fundamental shift from bulk copying to delta-based transactional state management fills a critical infrastructure gap, enabling efficient high-frequency state exploration previously unattainable.

Limitations

Dependency on underlying filesystems supporting reflink (e.g., XFS) limits deployment flexibility in environments lacking such features.
The bounded template process pool can lead to eviction under frequent checkpoint/rollback, causing fallback to slower CRIU restore paths and increased latency.
Current design targets single-node microVM environments and does not address distributed multi-node state synchronization and consistency challenges.

Future Work

Future directions include extending DeltaBox to support distributed incremental state synchronization across nodes, enhancing template pool management strategies to maintain performance under high-frequency switching, broadening compatibility with diverse filesystems and container technologies, and integrating with advanced AI reasoning and training frameworks to further leverage incremental state management for complex agent workflows.

AI Executive Summary

Large Language Model (LLM)-powered AI agents have become pivotal in automating complex tasks such as software engineering and code repair. These agents rely heavily on high-frequency state exploration techniques like Monte Carlo Tree Search (MCTS) and reinforcement learning, which necessitate rapid checkpoint and rollback (C/R) of their entire sandbox environments, including both filesystem and process memory states. Traditional methods employ full duplication of these states, incurring latencies ranging from hundreds of milliseconds to seconds per checkpoint or rollback, severely limiting the depth and scale of search and training.

To overcome these limitations, the authors propose DeltaBox, a novel agent sandbox system that introduces the DeltaState OS abstraction. This abstraction leverages the insight that consecutive checkpoints differ only slightly, enabling the system to store and restore only the incremental changes between states. DeltaBox implements this through two co-designed mechanisms: DeltaFS and DeltaCR. DeltaFS extends Linux OverlayFS with dynamic writable layer freezing and insertion, facilitating copy-on-write semantics and enabling fast, unmount-free layer switching for filesystem state management. DeltaCR builds upon CRIU to perform incremental process state dumps and accelerates rollback via template process fork(), supplemented by asynchronous page warm-up to reduce copy-on-write overhead during recovery.

Together, these mechanisms maintain atomic consistency between filesystem and process memory states, crucial for deterministic AI agent behavior. DeltaBox also exploits the natural latency during LLM inference to mask checkpoint overhead, further enhancing system responsiveness. Experimental evaluation on the SWE-bench benchmark and reinforcement learning micro-benchmarks demonstrates that DeltaBox achieves average checkpoint and rollback latencies of 14 ms and 5 ms respectively, outperforming traditional Firecracker VM snapshots by over 90%. This reduction in overhead allows AI agents to explore significantly more nodes within fixed time budgets, improving search depth and training throughput.

The work addresses a fundamental bottleneck in AI agent infrastructure, bridging the gap between advanced AI algorithms and system-level state management. By enabling millisecond-level, incremental, and coupled checkpoint/restore, DeltaBox paves the way for more sophisticated and scalable AI agent applications. Its design principles and implementation provide a blueprint for future systems aiming to support high-frequency stateful computations.

Looking forward, the authors identify opportunities to extend DeltaBox’s capabilities to distributed environments, optimize template pool management, and enhance compatibility with diverse storage and container platforms. These advancements will further empower AI agents to perform deeper, faster, and more reliable state explorations, accelerating progress in automated reasoning and intelligent automation.

Deep Analysis

Background

The field of AI agents powered by Large Language Models (LLMs) has rapidly evolved, enabling automation of complex tasks such as code repair, desktop automation, and web navigation. Benchmarks like SWE-bench, OSWorld, and AgentBench have driven advances by providing realistic testing environments. Modern AI agents employ tree-structured search strategies, including Monte Carlo Tree Search (MCTS) and Language Agent Tree Search (LATS), to systematically explore possible action sequences. Execution-guided sampling methods like Best-of-N further enhance exploration by running multiple parallel candidate trajectories. These approaches require frequent checkpoint and rollback operations to manage the agent’s state, which comprises both durable filesystem state (e.g., working directories, installed packages) and ephemeral process state (e.g., memory, interpreter heap).

Reinforcement learning training also imposes infrastructure demands, as multiple sandbox instances must be rapidly cloned and restored to a known-good state for rollouts. Existing sandbox technologies, such as Docker containers and Firecracker microVMs, rely on full state duplication for checkpoint/restore, resulting in latencies from hundreds of milliseconds to seconds. This latency bottleneck restricts the depth of search trees and the scale of parallelism achievable, limiting the effectiveness of AI agents in real-world scenarios.

Core Problem

The core problem addressed is the high latency and inefficiency of checkpoint and rollback operations in stateful AI agent environments. Traditional methods duplicate the entire sandbox state—including filesystem and process memory—at each checkpoint, incurring prohibitive overheads of hundreds of milliseconds to seconds. This latency severely constrains the agent’s ability to perform deep tree search and large-scale parallel exploration. Additionally, existing approaches often manage filesystem and process states separately, risking inconsistency and semantic divergence upon rollback. Efficient, atomic, and low-latency checkpoint/restore mechanisms that handle both states jointly are lacking. Furthermore, leveraging idle inference time to mask checkpoint overhead remains an open challenge. These bottlenecks hinder both test-time reasoning and reinforcement learning training workflows.

Innovation

DeltaBox introduces several key innovations:

�� DeltaState OS abstraction: treats filesystem and process memory as a coupled transactional state, storing only incremental changes between checkpoints to minimize duplication.

�� DeltaFS: an extension of Linux OverlayFS enabling runtime reconfiguration of writable layers. It dynamically freezes the current writable layer as read-only and inserts a new writable layer without unmounting, enabling copy-on-write semantics and fast checkpoint/rollback via simple layer switching.

�� DeltaCR: builds on CRIU to perform asynchronous incremental dumps and creates frozen template processes via fork() to enable millisecond-level fast rollback. An asynchronous warm-up thread preemptively handles copy-on-write page faults to reduce recovery latency.

�� Inference-masked checkpointing: exploits the LLM inference wait window to hide checkpoint overhead, maintaining seamless agent operation.

These innovations collectively overcome the limitations of full state duplication and enable high-frequency, low-latency, and atomic checkpoint/restore suitable for complex AI agent workloads.

Methodology

�� DeltaState abstraction: conceptualizes the agent’s sandbox state as a pair of filesystem and process memory states, managed incrementally.

�� DeltaFS mechanism:
Based on Linux OverlayFS, supports multiple overlay layers.
At checkpoint, freezes the current writable layer (upper layer) into a read-only lower layer.
Inserts a fresh writable layer atop the stack for subsequent writes.
Employs per-inode generation counters for lazy redirection of open file descriptors across checkpoints.
Reduces file updates to copy-on-write operations, enabling rollback by switching overlay layers.

�� DeltaCR mechanism:
Performs asynchronous incremental CRIU dumps to capture process memory changes.
Simultaneously creates a frozen template process via fork() at quiescent points.
On rollback, forks the template process for rapid restoration; if template is evicted, falls back to CRIU restore.
Runs a background asynchronous warm-up thread to pre-fault writable memory pages, mitigating copy-on-write overhead.

�� StateManager coordination:
Host-side Sandbox Controller maintains a global snapshot tree with metadata.
Guest-side State Daemon executes checkpoint and restore operations locally.
Ensures atomic consistency between filesystem and process states.

�� Inference-masked checkpointing:
Checkpoints are triggered during LLM inference wait times.
Network connections to LLM are isolated via a proxy daemon to maintain uninterrupted communication during checkpoint.

�� Underlying storage:
Uses XFS with reflink support for block-level copy-on-write, reducing write amplification.

Experiments

Experiments were conducted on SWE-bench, a benchmark suite for AI code repair agents, and reinforcement learning micro-benchmarks. Baselines included Firecracker VM snapshots, Docker commit, and naive file copying methods. Metrics measured checkpoint and rollback latency, state management overhead as a fraction of total trajectory time, search node exploration counts, and success rates. Ablation studies isolated the impact of DeltaFS dynamic layer switching, DeltaCR template fork recovery, and asynchronous warm-up. Key hyperparameters such as template pool size and CRIU dump frequency were tuned. All experiments ran on Linux systems with XFS reflink-enabled filesystems, ensuring efficient copy-on-write support.

Results

DeltaBox achieved an average checkpoint latency of 14 ms and rollback latency of 5 ms across SWE-bench workloads, outperforming Firecracker VM snapshots which typically incur 200 ms to seconds of latency. State management overhead was reduced from 47–77% in baseline coupled filesystem approaches to 3–6%, enabling AI agents to explore significantly more search nodes within fixed time budgets and improving task success rates. Ablation studies revealed that DeltaFS’s dynamic layer switching and DeltaCR’s template fork mechanisms contributed approximately 40% and 35% of the performance gains respectively. The asynchronous warm-up thread further reduced latency spikes caused by copy-on-write page faults. The system demonstrated stable, low-latency operation suitable for high-frequency checkpoint/restore demands.

Applications

DeltaBox is applicable to AI agents requiring high-frequency, fine-grained state exploration, such as automated code repair, software testing, desktop automation, and web navigation. Its millisecond-level checkpoint and rollback capabilities support deep tree search, large-scale parallel trajectory exploration, and reinforcement learning training with rapid sandbox cloning and restoration. The system operates transparently to agents, requiring no code modifications, facilitating integration into existing AI frameworks. Future extensions to distributed multi-node environments will enable scalable cloud-based intelligent services and large-scale automated systems.

Limitations & Outlook

DeltaBox relies on underlying filesystems with reflink support (e.g., XFS), limiting deployment flexibility in environments lacking such features. The bounded template process pool can lead to eviction under frequent checkpoint/rollback, causing fallback to slower CRIU restore paths and increased latency. The current design targets single-node microVM environments and does not address distributed multi-node state synchronization and consistency, restricting applicability in large-scale cluster settings.

Abstract

LLM-powered AI agents require high-frequency state exploration (e.g., test-time tree search and reinforcement learning), relying on rapid checkpoint and rollback (C/R) of the complete sandbox state, including files and process state (e.g., memory, contexts, etc.). Existing mechanisms duplicate the entire state, causing hundreds of milliseconds to seconds of latency per C/R, which severely bottlenecks deep search and large-scale fan-outs. This paper observes that subsequent checkpoints in AI agents are highly similar. Therefore, instead of full duplication, a sandbox should only duplicate the changes between consecutive checkpoints (Key Insight). However, it is non-trivial to realize the idea, mainly due to the missing OS supports. This paper proposes a new OS-level abstraction, DeltaState, to enable the change-based transactional C/R for AI agents with two co-designed OS mechanisms. First, DeltaFS enables change-based filesystem C/R by organizing the file states into layers and dynamically freezing the writable layer and inserting a new one during checkpoint, reducing file updates to copy-on-write, and making rollback a simple layer switch. Second, DeltaCR enables change-based process state C/R using incremental dumps, and accelerates rollback by bypassing traditional pipelines to directly fork() from a frozen template process. We then present DeltaBox, a novel agent sandbox achieving millisecond level C/R through the two new mechanisms. Evaluations on SWE-bench and RL micro-benchmarks show DeltaBox completes checkpoint and rollback in millisecond-level latency (14ms and 5ms, respectively), empowering agents to explore substantially more nodes under fixed time budgets.

cs.OS cs.AI