The Serialized Bridge: Understanding and Recovering LLM Serving Performance under Blackwell GPU Confidential Computing
Highlights
- Identifies the CVM–GPU "serialized bridge" as the dominant overhead
- Throughput loss reduced from 13–27% down to single digits
- Worker-thread drain recovers up to 92% of lost performance
Abstract
GPU Confidential Computing preserves local GPU performance, yet LLM serving under Intel TDX plus GPU-CC suffers significant throughput losses (13–27%) and doubled KV-cache restore latency. We identify the confidential VM–GPU bridge — not GPU computation — as the primary bottleneck. GPU-CC turns host–device data movement into a serialized, high-setup-cost channel where secure copies cannot leverage CUDA-stream concurrency, asynchronous transfers block at runtime boundaries, and small crossings pay a fixed overhead. In vLLM dense decode, degradation stems from 44×-slower small allocation and copy operations. A scheduling flag recovers 57% of lost performance, and a worker-thread drain approach recovers up to 92% under high concurrency. The same bridge model explains KV-cache restoration penalties and model-loading slowdowns.
arXiv:2606.23969