北京大学王选所数据管理实验室

News

[2025 WWWJ] DySpec: Faster Speculative Decoding with Dynamic Token Tree Structure

Xiong Yunfan's paper "DySpec: Faster Speculative Decoding with Dynamic Token Tree Structure" addressing the optimization of speculative decoding has been accepted by WWWJ25.

Speculative decoding is an inference acceleration technique for large language models (LLMs). During text generation, many linguistically coherent phrases can be effectively generated by smaller models. While directly using smaller models for generation might seem advantageous, this approach inevitably compromises the overall coherence and performance of the model. Speculative decoding addresses this by first generating candidate outputs using a smaller model, then verifying them through the large model to ensure output equivalence with direct large-model generation sampling.

A critical limitation of conventional speculative decoding lies in its sequential validation process: the acceptance of subsequent tokens depends on previous verification results. This sequential dependency prevents parallelization and leads to exponentially diminishing marginal returns in acceptance rates for longer sequences, thereby constraining the effective prediction length. However, the core acceleration mechanism of speculative decoding comes from its ability to enhance hardware utilization by increasing inference parallelism when batch processing capacity is underutilized, thereby improving throughput without additional requests.

This paper proposes DySpec, a dynamic tree-structured speculative decoding method that dynamically determines next-step operations based on generated results during prediction tree construction. Experimental results demonstrate that our approach achieves higher per-step token acceptance rates and faster end-to-end acceleration compared to existing methods.