GPU Confidential ComputingarXiv preprint · September 6, 2024
Confidential Computing on NVIDIA Hopper GPUs: A Performance Benchmark Study
Highlights
- Under 7% overhead for typical LLM inference
- Near-negligible overhead for large models / long sequences
- PCIe data transfer — not GPU compute — is the bottleneck
Abstract
We evaluate how enabling Trusted Execution Environments on NVIDIA Hopper GPUs affects performance during large language model inference. Benchmarking overhead across multiple LLMs and token lengths, we emphasize CPU–GPU data transfer over PCIe as the key constraint. Computational overhead within the GPU itself remains minimal; data transfer is the primary performance penalty. For typical LLM queries, overhead stays under 7%, and larger models with longer sequences show nearly negligible overhead — establishing that confidential GPU inference is practical at production scale.
arXiv:2409.03992