SLED: A Speculative LLM Decoding Framework for Efficient Edge Serving

dc.contributor.authorLi, Xiangchenen
dc.contributor.authorSpatharakis, Dimitriosen
dc.contributor.authorGhafouri, Saeiden
dc.contributor.authorFan, Jiakunen
dc.contributor.authorVandierendonck, Hansen
dc.contributor.authorJohn, Deepuen
dc.contributor.authorJi, Boen
dc.contributor.authorNikolopoulos, Dimitriosen
dc.date.accessioned2026-01-09T18:27:48Zen
dc.date.available2026-01-09T18:27:48Zen
dc.date.issued2025-12-03en
dc.date.updated2026-01-01T08:47:16Zen
dc.description.abstractThe growing gap between the increasing complexity of large language models (LLMs) and the limited computational budgets of edge devices poses a key challenge for efficient on-device inference, despite gradual improvements in hardware capabilities. Existing strategies, such as aggressive quantization, pruning, or remote inference, trade accuracy for efficiency or lead to substantial cost burdens. This position paper introduces a new framework that leverages speculative decoding, previously viewed primarily as a decoding acceleration technique for autoregressive generation of LLMs, as a promising approach specifically adapted for edge computing by orchestrating computation across heterogeneous devices. We propose SLED, a framework that allows lightweight edge devices to draft multiple candidate tokens locally using diverse draft models, while a single, shared edge server verifies the tokens utilizing a more precise target model. To further increase the efficiency of verification, the edge server batches the diverse verification requests from devices. This approach supports heterogeneous devices and reduces server-side memory footprint by sharing a single upstream target model across devices. Our initial experiments with Jetson Orin Nano, Raspberry Pi 4B/5, and an edge server equipped with 4 Nvidia A100 GPUs indicate substantial benefits: ×2.2 higher system throughput, ×2.8 higher system capacity, and better cost efficiency, all without sacrificing model accuracy.en
dc.description.versionPublished versionen
dc.format.mimetypeapplication/pdfen
dc.identifier.doihttps://doi.org/10.1145/3769102.3770608en
dc.identifier.urihttps://hdl.handle.net/10919/140714en
dc.language.isoenen
dc.publisherACMen
dc.rightsCreative Commons Attribution-NonCommercial 4.0 Internationalen
dc.rights.holderThe author(s)en
dc.rights.urihttp://creativecommons.org/licenses/by-nc/4.0/en
dc.titleSLED: A Speculative LLM Decoding Framework for Efficient Edge Servingen
dc.typeArticle - Refereeden
dc.type.dcmitypeTexten

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
3769102.3770608.pdf
Size:
1.4 MB
Format:
Adobe Portable Document Format
Description:
Published version
License bundle
Now showing 1 - 1 of 1
Name:
license.txt
Size:
1.5 KB
Format:
Item-specific license agreed upon to submission
Description: