Efficient Training-Free Multi-Token Prediction via Embedding-Space Probing

TL;DR

Efficient training-free multi-token prediction via embedding-space probing, improving LLaMA3 acceptance length by 12%.

cs.CL 🔴 Advanced 2026-03-19 99 views
Raghavv Goel Mukul Gagrani Mingu Lee Chris Lott
large language models multi-token prediction embedding space training-free efficient decoding

Key Findings

Methodology

The paper introduces a training-free multi-token prediction (MTP) method that probes a large language model (LLM) using mask tokens from its embedding space, enabling parallel future-token prediction without modifying model weights or relying on auxiliary models. The method constructs a speculative token tree by sampling top-K candidates from mask-token logits and applies a lightweight pruning strategy to retain high-probability continuations. During decoding, candidate predictions are verified in parallel, resulting in lossless generation while significantly reducing the number of model calls and improving token throughput.

Key Results

  • Acceptance length increased by approximately 12% on LLaMA3 and 8-12% on Qwen3. The probing-based MTP method achieved throughput gains of up to 15-19%.
  • The method consistently outperformed existing training-free baselines such as Lookahead Decoding and Prompt Lookup Decoding in the SpecBench benchmark.
  • Quantitative and qualitative studies showed how token acceptance behavior varies with mask-token design and task type, particularly excelling in compute-limited settings.

Significance

This research is significant for both academia and industry, demonstrating how to leverage the latent capabilities of existing large language models for multi-token prediction without increasing computational burden. The method is particularly suitable for compute-constrained environments like edge devices, addressing the challenge of traditional methods requiring substantial computational resources.

Technical Contribution

Technical contributions include introducing a novel training-free MTP paradigm that uses mask-token probing in the base model's embedding space, enabling multi-token generation without retraining or external draft models. The dynamic tree expansion mechanism allows for flexible decoding, the efficient static-tree implementation improves throughput, and theoretical and empirical evidence shows alignment between mask-token and true-token representations.

Novelty

This study is the first to propose probing mask tokens in the embedding space for multi-token prediction, offering an efficient and lossless decoding method without additional training or model modifications compared to existing methods.

Limitations

  • In some tasks, such as retrieval, the method performs slightly worse than others, possibly due to the specific requirements of the task for token prediction.
  • The method may incur increased computational overhead when handling very long sequences due to the complexity of the tree structure.
  • In certain cases, the initialization strategy of mask tokens may affect prediction accuracy.

Future Work

Future research directions include exploring more complex tree structures to enhance prediction diversity and accuracy, optimizing mask-token initialization strategies, and validating the method's generality across more tasks and models.

AI Executive Summary

Large language models (LLMs) have made significant strides in the field of natural language processing, particularly in generation tasks. However, traditional autoregressive decoding methods typically generate one token at a time, leaving substantial computational resources underutilized. To address this issue, this paper proposes a training-free multi-token prediction method that probes mask tokens in the embedding space, enabling parallel future-token prediction.

The core of this method lies in leveraging the internal generative capacity of large language models by synthesizing mask tokens in the model's embedding space, which are then injected into the prompt to elicit predictions of multiple future tokens. These predictions are jointly verified by the base model, enabling efficient and lossless decoding.

In experiments, the method demonstrated superior performance in the SpecBench benchmark, outperforming existing training-free baselines such as Lookahead Decoding and Prompt Lookup Decoding. Specifically, acceptance length increased by approximately 12% on LLaMA3 and 8-12% on Qwen3, with throughput gains of up to 15-19%.

This method is significant for both academia and industry, particularly suitable for compute-constrained environments like edge devices. It shows how to leverage the latent capabilities of existing large language models for multi-token prediction without increasing computational burden.

However, in some tasks, such as retrieval, the method performs slightly worse than others, possibly due to the specific requirements of the task for token prediction. Future research directions include exploring more complex tree structures to enhance prediction diversity and accuracy, optimizing mask-token initialization strategies, and validating the method's generality across more tasks and models.

Deep Analysis

Background

In recent years, large language models (LLMs) have made significant strides in the field of natural language processing, particularly in generation tasks. However, traditional autoregressive decoding methods typically generate one token at a time, leaving substantial computational resources underutilized. To address this issue, researchers have proposed multi-token prediction (MTP) methods, aiming to predict multiple future tokens in parallel. However, existing approaches often rely on training auxiliary heads, modifying base model weights, or employing external draft models, which are impractical in compute-constrained environments.

Core Problem

Traditional autoregressive decoding methods are inefficient in generation tasks as they generate one token at a time, leaving substantial computational resources underutilized. To improve generation efficiency, researchers have proposed multi-token prediction (MTP) methods. However, existing approaches often rely on training auxiliary heads, modifying base model weights, or employing external draft models, which are impractical in compute-constrained environments.

Innovation

This paper proposes a training-free multi-token prediction method that probes mask tokens in the embedding space, enabling parallel future-token prediction. The core of this method lies in leveraging the internal generative capacity of large language models by synthesizing mask tokens in the model's embedding space, which are then injected into the prompt to elicit predictions of multiple future tokens. These predictions are jointly verified by the base model, enabling efficient and lossless decoding.

Methodology

  • �� Leverage the internal generative capacity of large language models by synthesizing mask tokens in the model's embedding space.
  • �� Inject synthesized mask tokens into the prompt to elicit predictions of multiple future tokens.
  • �� Jointly verify predictions by the base model, enabling efficient and lossless decoding.
  • �� Use a dynamic token-tree expansion mechanism to adaptively grow token paths based on cumulative probabilities, improving efficiency while maintaining diversity.

Experiments

In experiments, the method demonstrated superior performance in the SpecBench benchmark, outperforming existing training-free baselines such as Lookahead Decoding and Prompt Lookup Decoding. Specifically, acceptance length increased by approximately 12% on LLaMA3 and 8-12% on Qwen3, with throughput gains of up to 15-19%.

Results

Experimental results showed that the method demonstrated superior performance in the SpecBench benchmark, outperforming existing training-free baselines such as Lookahead Decoding and Prompt Lookup Decoding. Specifically, acceptance length increased by approximately 12% on LLaMA3 and 8-12% on Qwen3, with throughput gains of up to 15-19%.

Applications

This method is significant for both academia and industry, particularly suitable for compute-constrained environments like edge devices. It shows how to leverage the latent capabilities of existing large language models for multi-token prediction without increasing computational burden.

Limitations & Outlook

In some tasks, such as retrieval, the method performs slightly worse than others, possibly due to the specific requirements of the task for token prediction. Future research directions include exploring more complex tree structures to enhance prediction diversity and accuracy, optimizing mask-token initialization strategies, and validating the method's generality across more tasks and models.

Plain Language Accessible to non-experts

Imagine you're in a kitchen cooking a meal. Traditionally, you might prepare one dish at a time, which is quite inefficient. The method proposed in this paper is like preparing multiple dishes simultaneously. We place some 'mask tokens' in the kitchen, which act like pre-prepared ingredients that help us predict the steps for multiple dishes at once, without having to start from scratch each time. This way, we can increase cooking efficiency without adding extra burden to the kitchen. This method is particularly suitable for kitchens with limited resources, like a small kitchen where we can prepare multiple dishes simultaneously without needing extra utensils or assistants.

ELI14 Explained like you're 14

Hey there! Imagine you're playing a game where you can only move one step at a time. That would be super slow, right? Now imagine you have a magic tool that lets you move several steps at once. That's what this paper's method does. We place some 'mask tokens' in the game, which act like magic tools that help us predict multiple steps at once, without having to stop and think about the next step each time. This way, we can speed up the game without making it harder. This method is especially useful in environments with limited resources, like a small phone where we can play multiple games at once without needing extra devices or helpers.

Glossary

Large Language Model (LLM)

A large language model is a language model with a large number of parameters capable of generating or understanding natural language text.

In this paper, LLMs are used for generating and predicting natural language tokens.

Multi-Token Prediction (MTP)

Multi-token prediction is a method aimed at improving generation efficiency by predicting multiple future tokens in parallel.

The paper proposes a training-free MTP method.

Embedding Space

Embedding space refers to the process of mapping discrete tokens into a continuous vector space.

The paper utilizes mask tokens in the embedding space for prediction.

Mask Token

A mask token is a special token used to elicit predictions of multiple future tokens from the model.

The paper uses generated mask tokens to trigger predictions of multiple future tokens.

Lossless Decoding

Lossless decoding refers to generating output without losing information.

The method achieves lossless decoding.

Dynamic Token Tree

A dynamic token tree is a data structure used to adaptively grow token paths.

The paper uses a dynamic token tree expansion mechanism to improve prediction efficiency.

SpecBench

SpecBench is a benchmark dataset covering various tasks such as summarization, translation, reasoning, etc.

The paper validates the method's effectiveness on the SpecBench benchmark.

Throughput

Throughput refers to the number of tokens processed per unit time.

The method achieves significant improvements in throughput.

Edge Device

An edge device is a computing device that operates at the edge of a network, typically with limited resources.

The method is particularly suitable for compute-constrained edge devices.

Tree Structure

A tree structure is a data structure used to represent hierarchical relationships.

The paper uses a tree structure to organize predicted token paths.

Open Questions Unanswered questions from this research

  • 1 How can multi-token prediction accuracy be further improved in more complex tasks? Existing methods perform poorly in some tasks, possibly due to the specific requirements of the task for token prediction. More complex tree structures and mask-token initialization strategies need to be explored.
  • 2 How can computational overhead be effectively controlled when handling very long sequences? Existing methods may incur increased computational overhead due to the complexity of the tree structure when handling long sequences. More efficient pruning strategies need to be developed.
  • 3 How can the method's generality be validated across more tasks and models? Existing experiments focus mainly on the SpecBench benchmark. Validation across more tasks and models is needed.
  • 4 How can mask-token initialization strategies be optimized to improve prediction accuracy? In some cases, the initialization strategy of mask tokens may affect prediction accuracy. More optimal initialization methods need to be explored.
  • 5 How can generation efficiency be further improved without increasing computational burden? Existing methods may increase computational burden while improving generation efficiency. More efficient generation strategies need to be explored.

Applications

Immediate Applications

Natural Language Processing on Edge Devices

The method is particularly suitable for compute-constrained edge devices, such as smartphones or IoT devices. By improving generation efficiency, more complex natural language processing tasks can be achieved on these devices.

Real-Time Translation Systems

In real-time translation systems, the method can improve translation speed and accuracy, reduce latency, and enhance user experience.

Smart Assistants

Smart assistants can use the method to improve response speed and accuracy, providing a more natural and fluent user interaction experience.

Long-term Vision

Ubiquitous Natural Language Generation

As computational resources continue to develop, the method is expected to become ubiquitous in more application scenarios, achieving more efficient natural language generation.

Automated Content Generation

In the future, the method can be used for automated content generation, such as news reports, product descriptions, etc., improving the efficiency and quality of content creation.

Abstract

Large language models (LLMs) exhibit latent multi-token prediction (MTP) capabilities despite being trained solely for next-token generation. We propose a simple, training-free MTP approach that probes an LLM using on-the-fly mask tokens drawn from its embedding space, enabling parallel prediction of future tokens without modifying model weights or relying on auxiliary draft models. Our method constructs a speculative token tree by sampling top-K candidates from mask-token logits and applies a lightweight pruning strategy to retain high-probability continuations. During decoding, candidate predictions are verified in parallel, resulting in lossless generation while substantially reducing the number of model calls and improving token throughput. Across benchmarks, our probing-based MTP consistently outperforms existing training-free baselines, increasing acceptance length by approximately 12\% on LLaMA3 and 8--12\% on Qwen3, and achieving throughput gains of up to 15--19\%. Finally, we provide theoretical insights and empirical evidence showing that decoder layers naturally align mask-token representations with next-token states, enabling accurate multi-step prediction without retraining or auxiliary models.

cs.CL

References (20)

Better & Faster Large Language Models via Multi-token Prediction

Fabian Gloeckle, Badr Youbi Idrissi, Baptiste Rozière et al.

2024 255 citations ⭐ Influential View Analysis →

BiTA: Bi-Directional Tuning for Lossless Acceleration in Large Language Models

Feng Lin, Hanling Yi, Hongbin Li et al.

2024 14 citations ⭐ Influential View Analysis →

Speculative Streaming: Fast LLM Inference without Auxiliary Models

Nikhil Bhendawade, Irina Belousova, Qichen Fu et al.

2024 39 citations ⭐ Influential View Analysis →

SpecInfer: Accelerating Generative LLM Serving with Speculative Inference and Token Tree Verification

Xupeng Miao, Gabriele Oliaro, Zhihao Zhang et al.

2023 142 citations ⭐ Influential

PaSS: Parallel Speculative Sampling

Giovanni Monea, Armand Joulin, Edouard Grave

2023 47 citations ⭐ Influential View Analysis →

Hardware-Aware Parallel Prompt Decoding for Memory-Efficient Acceleration of LLM Inference

H. Chen, Wayne Luk, Ka-Fai Cedric Yiu et al.

2024 17 citations ⭐ Influential View Analysis →

Future Lens: Anticipating Subsequent Tokens from a Single Hidden State

Koyena Pal, Jiuding Sun, Andrew Yuan et al.

2023 102 citations ⭐ Influential View Analysis →

Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads

Tianle Cai, Yuhong Li, Zhengyang Geng et al.

2024 593 citations ⭐ Influential View Analysis →

Fast Inference from Transformers via Speculative Decoding

Yaniv Leviathan, Matan Kalman, Yossi Matias

2022 1321 citations ⭐ Influential View Analysis →

SpecTr: Fast Speculative Decoding via Optimal Transport

Ziteng Sun, A. Suresh, Jae Hun Ro et al.

2023 129 citations View Analysis →

Multi-Token Prediction Needs Registers

Anastasios Gerontopoulos, Spyros Gidaris, Nikos Komodakis

2025 6 citations View Analysis →

EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees

Yuhui Li, Fangyun Wei, Chao Zhang et al.

2024 226 citations View Analysis →

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang et al.

2025 3706 citations View Analysis →

Training Verifiers to Solve Math Word Problems

K. Cobbe, Vineet Kosaraju, Mo Bavarian et al.

2021 7780 citations View Analysis →

Efficient Memory Management for Large Language Model Serving with PagedAttention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang et al.

2023 5005 citations View Analysis →

Your LLM Knows the Future: Uncovering Its Multi-Token Prediction Potential

Mohammad Samragh, Arnav Kundu, David Harrison et al.

2025 19 citations View Analysis →

SGLang: Efficient Execution of Structured Language Model Programs

Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie et al.

2023 636 citations View Analysis →

LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding

Mostafa Elhoushi, Akshat Shrivastava, Diana Liskovich et al.

2024 209 citations View Analysis →

Break the Sequential Dependency of LLM Inference Using Lookahead Decoding

Yichao Fu, Peter Bailis, Ion Stoica et al.

2024 271 citations View Analysis →

Simple and Effective Masked Diffusion Language Models

S. Sahoo, Marianne Arriola, Yair Schiff et al.

2024 495 citations View Analysis →