GPU Confidential ComputingarXiv preprint · September 6, 2024

Confidential Computing on NVIDIA Hopper GPUs: A Performance Benchmark Study

JZJianwei Zhu

Hang YinPDPeng DengAAAline Almeida

Shunfan Zhou

View on arXiv Download PDF

Highlights

Under 7% overhead for typical LLM inference
Near-negligible overhead for large models / long sequences
PCIe data transfer — not GPU compute — is the bottleneck

Abstract

We evaluate how enabling Trusted Execution Environments on NVIDIA Hopper GPUs affects performance during large language model inference. Benchmarking overhead across multiple LLMs and token lengths, we emphasize CPU–GPU data transfer over PCIe as the key constraint. Computational overhead within the GPU itself remains minimal; data transfer is the primary performance penalty. For typical LLM queries, overhead stays under 7%, and larger models with longer sequences show nearly negligible overhead — establishing that confidential GPU inference is practical at production scale.

arXiv:2409.03992