Communication-Efficient Verifiable Attention for LLM Inference

Rui Tan

Communication-Efficient Verifiable Attention for LLM Inference

Abstract

Computation integrity of remote large language model (LLM) serving can be questionable. For conventional deep neural networks (DNNs), the existing TEE-shielded DNN partitioning (TSDP) approach uses Trusted Execution Environment (TEE) to compute non-linear components and verify the integrity of linear components offloaded to an untrusted GPU. However, directly applying TSDP to Transformer-based LLMs incurs significant TEE computation and TEE-GPU communication overhead. This paper presents Communication-efficient TEE-GPU Attention (VeriAttn) for accelerating verifiable LLM inference. VeriAttn offloads both linear and non-linear computations of attention to the GPU, while TEE performs verification. Moreover, for prefill, VeriAttn uses a two-level pipeline to overlap data movement, TEE pre-/post-processing, and GPU computation. For decoding, when the key-value cache exceeds available GPU memory, VeriAttn partitions attention across TEE and GPU to reduce repeated key-value transfers. Evaluation on an Intel TDX platform shows that VeriAttn achieves 2.60-3.38× and 3.86-5.42× acceleration over TSDP for 6k-token prompts and 10k-token outputs during prefill and decoding, respectively.

0

Turn this paper into a full lesson

ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.

Or compile a full topic from this idea

Discussion (0)

Sign in to join the discussion.

Loading comments…