SimSD: Simple Speculative Decoding in Diffusion Language Models

TL;DR

SimSD employs a plug-and-play masking strategy to enable token-level speculative decoding in diffusion LLMs, achieving up to 7.46× speedup while maintaining quality.

cs.CL 🔴 Advanced 2026-06-02 110 views
Junxia Cui Haotian Ye Runchu Tian Hongcan Guo Jinya Jiang Haoru Li Chaojie Ren Yiming Huang Kaijie Zhu Zhongkai Yu Kun Zhou Jingbo Shang
Natural Language Processing Diffusion Models Inference Acceleration Masking Strategy Token-level Verification

Key Findings

Methodology

This paper introduces SimSD, a simple speculative decoding algorithm designed for diffusion large language models (dLLMs). The core idea is to implement a plug-and-play masking strategy that introduces reference tokens—predicted by a draft model—and constructs an attention mask based on token-level temporal order. This mask ensures that each token attends only to previous tokens and reference information, mimicking autoregressive causal masking, thus enabling token-level verification in a single forward pass. The approach involves explicitly adding reference tokens into the input layout, aligning position encodings via a RoPE-copy strategy, and designing a temporal attention mask that enforces causality at the token level. The method is training-free, compatible with existing inference pipelines, and can be combined with techniques like KV caching and blockwise decoding. Extensive experiments on SDAR models across four benchmarks demonstrate that SimSD achieves up to 7.46× higher decoding throughput, with equal or improved generation quality.

Key Results

  • On four benchmarks—GSM8K, TriviaQA, MBPP, and MMLU—SimSD achieves an average decoding speed of 71.6 tokens/sec for block length 4 and 81.8 tokens/sec for block length 8, representing 7.46× and 5.40× speedups over vanilla decoding, respectively. The speedup is consistent across different model sizes and tasks.
  • Generation quality remains comparable to vanilla decoding, with accuracy on GSM8K increasing from 69.6% to 71.3% for block length 4, and from 68.1% to 69.8% for block length 8, indicating that the approximation introduced by the masking strategy does not harm, and may slightly improve, output quality.
  • Ablation studies confirm the importance of position encoding alignment; unaligned RoPE causes a dramatic accuracy drop to near zero, highlighting the critical role of position consistency in token verification.

Significance

This work addresses a key bottleneck in deploying diffusion-based large language models at scale—namely, the lack of an efficient token-level verification mechanism akin to autoregressive causal masking. By enabling token-level speculative decoding without retraining, SimSD unlocks the potential for high-throughput, low-latency inference in diffusion models, making them more practical for real-world applications such as conversational AI, content creation, and knowledge-intensive tasks. Its compatibility with existing acceleration techniques further enhances its industrial relevance. The approach bridges a critical gap between the parallelism of diffusion models and the verification capabilities of autoregressive decoding, paving the way for future research in efficient large-scale inference.

Technical Contribution

The primary technical contribution is the design of a token-level temporal causal attention mask that enables diffusion models to perform token-level speculative verification. Unlike traditional bidirectional attention, this mask enforces a causal structure at the token level, allowing each token to attend only to previous tokens and reference context. The method involves explicit input layout design with reference tokens, RoPE position encoding copying for alignment, and a novel attention mask that encodes temporal order. This setup enables the model to compute valid logits for drafted tokens in a single forward pass, mimicking autoregressive verification. Additionally, the method is training-free, directly applicable to pretrained models, and compatible with KV cache and blockwise decoding, significantly improving inference efficiency without retraining.

Novelty

This research is the first to introduce token-level temporal causal masking into diffusion language models, effectively enabling token-level speculative decoding—a feature traditionally exclusive to autoregressive models. The key innovation lies in designing a plug-and-play attention mask that constructs a temporally valid context, allowing non-causal, bidirectional models to perform autoregressive-like verification in a single pass. This approach maintains the parallel decoding advantage of diffusion models while restoring the crucial verification capability, representing a significant departure from prior work that either relied on training or structural modifications. The method's simplicity and effectiveness mark a novel contribution to the field.

Limitations

  • While SimSD achieves impressive speedups, it introduces an approximation in the denoising process due to the masking strategy, which may accumulate errors in complex multi-turn or multi-modal tasks. Its performance in highly nuanced or multi-modal scenarios remains to be validated.
  • The approach relies on careful position encoding alignment; improper implementation can lead to significant degradation, as shown by ablation results. Extending this to very large models or longer sequences may incur additional computational overhead.
  • The current validation is primarily on SDAR series models; generalization to other architectures or non-block diffusion models needs further exploration. Additionally, the method assumes the availability of reference predictions, which may not always be feasible in real-time settings.

Future Work

Future directions include refining the masking strategy to reduce approximation errors further, exploring adaptive or learned attention masks for better robustness, and extending the approach to multi-modal diffusion models. Investigating the integration with reinforcement learning or self-supervised fine-tuning could enhance verification accuracy. Moreover, scaling the method to larger models and longer sequences, as well as deploying in real-world applications like dialogue systems and content generation, will be critical. Finally, developing theoretical guarantees on the approximation bounds and error propagation will strengthen the method's reliability and adoption.

AI Executive Summary

The rapid evolution of large language models (LLMs) has revolutionized natural language processing, enabling unprecedented capabilities in understanding and generating human-like text. Among these, autoregressive (AR) models such as GPT have dominated due to their strong performance, but their inherently sequential decoding process limits inference speed, especially in real-time applications. To address this, diffusion large language models (dLLMs) have emerged as a promising alternative, leveraging bidirectional attention and iterative denoising to enable parallel or blockwise decoding. These models significantly narrow the performance gap with AR models and offer faster inference, making them attractive for practical deployment.

However, a fundamental challenge remains: the masked language modeling formulation of dLLMs conflicts with the token-level speculative decoding techniques that have proven highly effective in accelerating AR models. In AR decoding, causal masks preserve the temporal order of tokens, allowing multiple draft tokens to be verified simultaneously in a single forward pass. This capability is crucial for reducing latency and increasing throughput. In contrast, dLLMs rely on bidirectional attention and mask tokens, which break the temporal consistency needed for token-level verification. As a result, existing speculative decoding methods cannot be directly applied to diffusion models, limiting their speed and scalability.

To overcome this obstacle, the authors propose SimSD, a simple yet powerful plug-and-play masking strategy that restores token-level speculative decoding in diffusion models. The key idea is to explicitly incorporate reference tokens—predicted by a draft model—into the input, and design an attention mask that enforces a temporal order at the token level. This mask ensures that each token attends only to previous tokens and reference information, mimicking the causal structure of AR models. Additionally, the method employs a RoPE position encoding copying mechanism to align positional information, maintaining consistency in logits computation. Importantly, SimSD does not require retraining or fine-tuning; it can be directly integrated into existing inference pipelines, compatible with techniques like KV caching and blockwise decoding.

Extensive experiments on SDAR series diffusion models across four benchmark tasks—GSM8K, TriviaQA, MBPP, and MMLU—demonstrate the effectiveness of SimSD. The results show up to 7.46× higher decoding throughput compared to vanilla diffusion decoding, with no degradation in generation quality. In some cases, the quality even slightly improves, indicating that the approximation introduced by the masking strategy does not harm the model’s output. These findings suggest that SimSD provides a practical solution for accelerating diffusion models, making them more suitable for real-world applications requiring low latency and high throughput.

Overall, this work marks a significant step forward in the field of large language model inference. By bridging the gap between parallel diffusion decoding and token-level verification, SimSD unlocks new possibilities for deploying powerful language models in time-sensitive scenarios. Its simplicity, efficiency, and compatibility with existing technologies make it a promising candidate for widespread adoption. As the demand for faster, more reliable AI systems grows, innovations like SimSD will play a crucial role in shaping the future landscape of natural language processing, enabling smarter, more responsive AI assistants and content generators worldwide.

Deep Dive

Abstract

Diffusion large language models (dLLMs) have recently emerged as a promising alternative to autoregressive (AR) LLMs, offering faster inference through parallel or blockwise decoding. However, their masked language modeling formulation remains incompatible with standard token-level speculative decoding, one of the most effective acceleration techniques for AR models. In AR decoding, the causal mask preserves temporally valid token-level contexts, enabling a target model to verify multiple drafted tokens in a single forward pass. In contrast, dLLMs rely on mask tokens and bidirectional attention, causing the effective context to change across denoising steps and preventing direct token-level speculative verification. To bridge this gap, we propose a simple but effective speculative decoding algorithm for diffusion language models, named SimSD, which mainly adopts a plug-and-play masking strategy that equips dLLMs with temporally valid token-level contexts for speculative decoding. Our method explicitly introduces reference tokens from draft-model predictions and designs an attention mask that regulates their interaction with current-step tokens, allowing dLLMs to compute valid logits for drafted tokens in a single forward pass. This restores the key verification ability provided by causal masking in AR models while preserving the parallel decoding advantages of dLLMs. The proposed method is training-free and can be flexibly integrated with other acceleration techniques such as KV cache and blockwise decoding. Experiments on SDAR-family dLLMs across four benchmarks show that our method achieves up to 7.46x higher decoding throughput while maintaining and even improving average generation quality.

cs.CL cs.AI

References (20)

SDAR: A Synergistic Diffusion-AutoRegression Paradigm for Scalable Sequence Generation

Shuang Cheng, Yihan Bian, Dawei Liu et al.

2025 71 citations ⭐ Influential View Analysis →

Fast Inference from Transformers via Speculative Decoding

Yaniv Leviathan, Matan Kalman, Yossi Matias

2022 1598 citations ⭐ Influential View Analysis →

Accelerating Large Language Model Decoding with Speculative Sampling

Charlie Chen, Sebastian Borgeaud, G. Irving et al.

2023 876 citations ⭐ Influential View Analysis →

S2D2: Fast Decoding for Diffusion LLMs via Training-Free Self-Speculation

Ligong Han, Hao Wang, Han Gao et al.

2026 1 citations View Analysis →

DiffuSpec: Unlocking Diffusion Language Models for Speculative Decoding

Guanghao Li, Zhihui Fu, Min Fang et al.

2025 18 citations View Analysis →

Measuring Massive Multitask Language Understanding

Dan Hendrycks, Collin Burns, Steven Basart et al.

2020 8351 citations View Analysis →

Structured Denoising Diffusion Models in Discrete State-Spaces

Jacob Austin, Daniel D. Johnson, Jonathan Ho et al.

2021 1831 citations View Analysis →

dLLM-Cache: Accelerating Diffusion Large Language Models with Adaptive Caching

Zhiyuan Liu, Yicun Yang, Yaojie Zhang et al.

2025 134 citations View Analysis →

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye et al.

2021 3776 citations View Analysis →

DiffusionBERT: Improving Generative Masked Language Models with Diffusion Models

Zhengfu He, Tianxiang Sun, Kuan Wang et al.

2022 260 citations View Analysis →

dKV-Cache: The Cache for Diffusion Language Models

Xinyin Ma, Runpeng Yu, Gongfan Fang et al.

2025 111 citations View Analysis →

Accelerating Diffusion Language Model Inference via Efficient KV Caching and Guided Diffusion

Zhanqiu Hu, Jian Meng, Yash Akhauri et al.

2025 57 citations View Analysis →

Training Verifiers to Solve Math Word Problems

K. Cobbe, Vineet Kosaraju, Mo Bavarian et al.

2021 9032 citations View Analysis →

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee et al.

2019 115293 citations View Analysis →

TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

Mandar Joshi, Eunsol Choi, Daniel S. Weld et al.

2017 3878 citations View Analysis →

Break the Sequential Dependency of LLM Inference Using Lookahead Decoding

Yichao Fu, Peter Bailis, Ion Stoica et al.

2024 306 citations View Analysis →

DFlash: Block Diffusion for Flash Speculative Decoding

Jian Chen, Yesheng Liang, Zhijian Liu

2026 25 citations View Analysis →

Speculative Diffusion Decoding: Accelerating Language Generation through Diffusion

Jacob K. Christopher, Brian R. Bartoldson, B. Kailkhura et al.

2024 44 citations View Analysis →

RoFormer: Enhanced Transformer with Rotary Position Embedding

Jianlin Su, Yu Lu, Shengfeng Pan et al.

2021 5358 citations View Analysis →

SpecDiff-2: Scaling Diffusion Drafter Alignment For Faster Speculative Decoding

Jameson Sandler, Jacob K. Christopher, Thomas Hartvigsen et al.

2025 8 citations View Analysis →