KV-Fold: One-Step KV-Cache Recurrence for Long-Context Inference

Key Findings

Methodology

KV-Fold is a training-free long-context inference protocol that treats the key-value (KV) cache as the accumulator in a left fold over sequence chunks. At each step, the model processes the next chunk conditioned on the accumulated cache, appends the newly produced keys and values, and passes the enlarged cache forward. This method is analogous to foldl in functional programming. By repurposing the KV cache concatenation primitive introduced for latent multi-agent communication, KV-Fold is used as a chunk-to-chunk recurrence for long-context inference.

Key Results

On a needle-in-a-haystack benchmark, KV-Fold achieves 100% exact-match retrieval across 152 trials spanning contexts from 16K to 128K tokens and chain depths up to 511.
KV-Fold maintains long-range retrieval on Llama-3.1-8B while operating within the memory limits of a single 40GB GPU.
Compared to streaming methods, KV-Fold maintains long-range retrieval while operating as a sequence of tractable forward passes.

Significance

KV-Fold provides a practical route to long-context inference without architectural changes or training. This method has significant implications for both academia and industry, particularly in applications requiring long-sequence information processing, such as code generation, medical record analysis, and customer preference modeling. By keeping the model frozen, KV-Fold demonstrates the ability of pretrained transformers to support stable KV-cache recurrence without increasing computational complexity.

Technical Contribution

KV-Fold's technical contribution lies in its training-free long-context inference capability, utilizing the KV cache of pretrained transformers as a recurrent state. Unlike existing streaming methods and KV-cache compression techniques, it requires no model modifications or fine-tuning, nor does it introduce special memory tokens.

Novelty

KV-Fold's novelty lies in its use of the KV cache as a recurrent state. Compared to existing methods, KV-Fold provides a simple and effective long-context inference solution without requiring any model modifications or training.

Limitations

KV-Fold's cache grows linearly with sequence length, potentially increasing memory usage and per-step latency.
In extremely long sequences, more GPU memory or cache compression strategies may be required.
The robustness of this method across different model architectures and operational choices needs further validation.

Future Work

Future research directions include exploring KV cache compression strategies to reduce memory usage, validating KV-Fold's robustness across different model architectures and operational choices, and testing its performance in larger-scale contexts.

AI Executive Summary

In the field of natural language processing, long-context inference has always been a challenge. Existing methods often require model modifications or training, or involve trade-offs between memory and retrieval accuracy. The introduction of KV-Fold offers a new solution to this problem.

KV-Fold is a training-free long-context inference protocol that treats the KV cache of pretrained transformers as a recurrent state. By performing a left fold over sequence chunks, KV-Fold achieves chunk-to-chunk recurrence for long-context inference. This method requires no model modifications or fine-tuning, nor does it introduce special memory tokens.

The core technical principle of KV-Fold lies in its stable recurrence mechanism. At each step, the model processes the next chunk conditioned on the accumulated KV cache, appends the newly produced keys and values, and passes the enlarged cache forward. Although the cache grows linearly with sequence length, KV-Fold demonstrates robustness and stability in experiments.

In experiments, KV-Fold achieves 100% exact-match retrieval on a needle-in-a-haystack benchmark, covering contexts from 16K to 128K tokens and chain depths up to 511. This result shows that KV-Fold can maintain long-range retrieval without increasing computational complexity.

KV-Fold provides a practical route to long-context inference without architectural changes or training. This method has significant implications for both academia and industry, particularly in applications requiring long-sequence information processing, such as code generation, medical record analysis, and customer preference modeling.

Despite KV-Fold's potential in long-context inference, its cache grows linearly with sequence length, potentially increasing memory usage and per-step latency. Future research directions include exploring KV cache compression strategies to reduce memory usage and testing its performance in larger-scale contexts.

Deep Analysis

Background

Long-context inference is of great significance in the field of natural language processing. With the widespread application of pretrained transformers, effectively utilizing long-context information has become a key issue. Existing methods often require model modifications or training, or involve trade-offs between memory and retrieval accuracy. For example, streaming methods achieve bounded-memory inference by retaining a window of recent tokens, but this may lead to decreased retrieval accuracy. KV-cache compression methods attempt to compress the cache without losing important information, but these methods often require model modifications or fine-tuning.

Core Problem

The core problem of long-context inference is how to effectively utilize long-sequence information without increasing computational complexity. Existing methods often require model modifications or training, or involve trade-offs between memory and retrieval accuracy. In scenarios where exact retrieval is important, such as recovering an identifier from a long log or preserving factual details introduced much earlier in a document, these trade-offs can be unacceptable.

Innovation

The core innovation of KV-Fold lies in its use of the KV cache as a recurrent state. Unlike existing methods, KV-Fold provides a simple and effective long-context inference solution without requiring any model modifications or training. Specifically, KV-Fold achieves chunk-to-chunk recurrence for long-context inference by performing a left fold over sequence chunks. This method requires no model modifications or fine-tuning, nor does it introduce special memory tokens.

Methodology

The implementation of KV-Fold includes the following key steps:

�� Divide the long sequence into chunks, each processed as a single forward pass.

�� At each step, the model processes the next chunk conditioned on the accumulated KV cache, appends the newly produced keys and values, and passes the enlarged cache forward.

�� By performing a left fold over sequence chunks, KV-Fold achieves chunk-to-chunk recurrence for long-context inference.

�� Although the cache grows linearly with sequence length, KV-Fold demonstrates robustness and stability in experiments.

Experiments

The experimental design includes evaluating the performance of KV-Fold on a needle-in-a-haystack benchmark. This benchmark covers contexts from 16K to 128K tokens and chain depths up to 511. The model used in the experiments is Llama-3.1-8B, and the experiments are conducted on a single 40GB GPU. The experimental results show that KV-Fold achieves 100% exact-match retrieval across 152 trials.

Results

The experimental results show that KV-Fold achieves 100% exact-match retrieval on a needle-in-a-haystack benchmark, covering contexts from 16K to 128K tokens and chain depths up to 511. Compared to streaming methods, KV-Fold maintains long-range retrieval while operating as a sequence of tractable forward passes. Although the cache grows linearly with sequence length, KV-Fold demonstrates robustness and stability in experiments.

Applications

KV-Fold is of great significance in applications requiring long-sequence information processing, such as code generation, medical record analysis, and customer preference modeling. By keeping the model frozen, KV-Fold demonstrates the ability of pretrained transformers to support stable KV-cache recurrence without increasing computational complexity.

Limitations & Outlook

Despite KV-Fold's potential in long-context inference, its cache grows linearly with sequence length, potentially increasing memory usage and per-step latency. In extremely long sequences, more GPU memory or cache compression strategies may be required. Future research directions include exploring KV cache compression strategies to reduce memory usage and testing its performance in larger-scale contexts.

Plain Language Accessible to non-experts

Imagine you're in a library trying to find a specific book. Normally, you'd have to walk through each shelf until you find the book you're looking for. This is like traditional long-context inference methods, which require traversing the entire context to find the needed information. KV-Fold is like a smart librarian who remembers the location of every book and can quickly find the one you need. Each time you need to find a book, they update their mental map of the library, so there's no need to walk through every shelf again. This method saves time and ensures you can always find the book you want. KV-Fold achieves this by storing information in an ever-updating cache. Even as the shelves get longer, the librarian can still efficiently find the book because they continuously update their memory. This method is not only applicable to libraries but also to other scenarios requiring large amounts of information processing, such as medical record analysis and code generation.

ELI14 Explained like you're 14

Imagine you're playing a super complex game with a huge map, and you need to remember lots of places and quests. Usually, you might have to keep opening the map to check, but that's a hassle, right? KV-Fold is like a super helper in the game who remembers every place you've been and every quest you've completed. Every time you need to check, they quickly tell you, so you don't have to find it yourself. It's like having a friend with a super memory who always helps you find the information you need. Even as the map gets bigger and the quests increase, they can handle it easily. This method not only makes the game easier to play but also helps you find answers faster in school assignments!

Glossary

KV Cache

A structure for storing key-value pairs, typically used for fast lookup and retrieval of information. In this paper, it is used as a recurrent state.

The KV cache in pretrained transformers stores layer-wise representations for later tokens to access through attention.

Long-Context Inference

The ability of a model to effectively utilize the entire context when processing long-sequence information.

The KV-Fold protocol proposed in this paper achieves training-free long-context inference.

Left Fold

An operation in functional programming that traverses a sequence from left to right, accumulating results.

KV-Fold treats the KV cache as the accumulator in a left fold over sequence chunks.

Needle-in-a-Haystack

A benchmark used to evaluate a model's ability to retrieve specific information in long contexts.

KV-Fold achieves 100% exact-match retrieval on this benchmark.

Streaming Methods

Methods that achieve bounded-memory inference by retaining a window of recent tokens, but may lead to decreased retrieval accuracy.

Compared to streaming methods, KV-Fold maintains long-range retrieval while operating as a sequence of tractable forward passes.

Pretrained Transformers

Deep learning models that have been pretrained on large datasets and can be used for various natural language processing tasks.

KV-Fold utilizes the KV cache of pretrained transformers as a recurrent state.

Recurrence

A computational process where a function solves a problem by calling itself.

KV-Fold achieves chunk-to-chunk recurrence for long-context inference by performing a left fold over sequence chunks.

RoPE (Rotary Position Embedding)

A position embedding method used to represent the position of each token in a sequence in transformer models.

In KV-Fold, RoPE is used to identify the position of new tokens.

Llama-3.1-8B

A large pretrained transformer model with 8 billion parameters.

KV-Fold was evaluated on the Llama-3.1-8B model in experiments.

GPU (Graphics Processing Unit)

A processor designed for parallel computing, commonly used for training and inference of deep learning models.

KV-Fold was evaluated on a single 40GB GPU in experiments.

Open Questions Unanswered questions from this research

1 How does KV-Fold perform on extremely long sequences? While the paper demonstrates its performance on 128K contexts, in longer sequences, the linear growth of the cache might lead to excessive memory usage.
2 What is the robustness of KV-Fold across different model architectures and operational choices? While experiments demonstrate its performance on Llama-3.1-8B, its effectiveness on other models remains to be verified.
3 How can the KV cache be compressed without affecting performance? The linear growth of the cache might lead to excessive memory usage, making the exploration of effective cache compression strategies an important direction for future research.
4 How does KV-Fold perform in real-time applications? While the paper demonstrates its performance in experimental settings, its performance in real-world applications, especially those requiring real-time responses, remains to be verified.
5 How does KV-Fold perform in multi-task learning? While the paper focuses on single-task long-context inference, how to effectively utilize the KV cache in multi-task learning remains to be explored.

Applications

Immediate Applications

Code Generation

KV-Fold can be used to process large codebases, helping developers achieve precise code generation and retrieval without modifying the model.

Medical Record Analysis

In the medical field, KV-Fold can be used to analyze long-span medical records, helping doctors quickly retrieve important patient information.

Customer Preference Modeling

KV-Fold can be used to analyze long-term customer interaction histories, helping businesses better understand customer preferences and provide personalized services.

Long-term Vision

Intelligent Assistants

KV-Fold can be used to develop smarter virtual assistants that can remember users' long-term interaction histories and provide precise information when needed.

Autonomous Driving

In the field of autonomous driving, KV-Fold can be used to process long-term sensor data, helping vehicles make more accurate decisions in complex environments.

Abstract

We introduce KV-Fold, a simple, training-free long-context inference protocol that treats the key-value (KV) cache as the accumulator in a left fold over sequence chunks. At each step, the model processes the next chunk conditioned on the accumulated cache, appends the newly produced keys and values, and passes the enlarged cache forward; the same one-step update is applied repeatedly, analogous to foldl in functional programming. Building on the KV cache concatenation primitive introduced for latent multi-agent communication, we repurpose it as a chunk-to-chunk recurrence for long-context inference. When processing chunk t, the model attends to the KV cache carried from earlier chunks as a prefix, reusing its internal state across segments without modifying or retraining the model. Despite its simplicity, the induced recurrence is stable: per-step drift rises briefly and then saturates into a flat plateau that persists across deep chains. This plateau is insensitive to a 10,000x change in numerical precision, robust across chunk sizes, and consistent across model families. At the task level, KV-Fold preserves exact information over long distances. On a needle-in-a-haystack benchmark, it achieves 100% exact-match retrieval across 152 trials spanning contexts from 16K to 128K tokens and chain depths up to 511 on Llama-3.1-8B, while remaining within the memory limits of a single 40GB GPU. Compared to streaming methods, which trade fidelity for bounded memory, KV-Fold maintains long-range retrieval while operating as a sequence of tractable forward passes. Overall, our results show that frozen pretrained transformers already support a stable form of KV-cache recurrence, providing a practical route to long-context inference without architectural changes or training.

cs.LG cs.AI cs.CL

Related Papers

Execution-State Capsules: Graph-Bound Execution-State Checkpoint and Restore for Low-Latency, Small-Batch, On-Device Physical-AI Serving

Proposes graph-bound execution-state capsules for low-latency, small-batch on-device AI, enabling byte-exact snapshot and restore with sub-millisecond GPU performance.

cs.LG 2026-06-19

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Analysis

Background

Core Problem

Innovation

Methodology

Experiments

Results

Applications

Limitations & Outlook

Plain Language Accessible to non-experts

ELI14 Explained like you're 14

Glossary

KV Cache

Long-Context Inference

Left Fold

Needle-in-a-Haystack

Streaming Methods

Pretrained Transformers

Recurrence

RoPE (Rotary Position Embedding)

Llama-3.1-8B

GPU (Graphics Processing Unit)

Open Questions Unanswered questions from this research

Applications

Immediate Applications

Code Generation

Medical Record Analysis

Customer Preference Modeling

Long-term Vision

Intelligent Assistants

Autonomous Driving

Abstract

Related Papers

Execution-State Capsules: Graph-Bound Execution-State Checkpoint and Restore for Low-Latency, Small-Batch, On-Device Physical-AI Serving

On the Oracle Complexity of Interpolation-Based Gradient Descent

STARE: Surprisal-Guided Token-Level Advantage Reweighting for Policy Entropy Stability

Zero-Shot Active Feature Acquisition via LLM-Elicitation

Looped World Models

Kolmogorov Regression for Robust Diffusion Policies