Private AI Inference
Centralized inference can log prompts and leak IP. Phala enclaves ensure no operator—cloud or vendor—can peek.
Traditional cloud infrastructure exposes sensitive information to operators and administrators.
More InformationHardware-enforced isolation prevents unauthorized access while maintaining computational efficiency.
More InformationEnd-to-end encryption protects data in transit, at rest, and critically during computation.
More InformationCryptographic verification ensures code integrity and proves execution in genuine TEE hardware.
More InformationGPU TEE Protection
Zero-Trust Inference
GPU TEEs with Intel TDX and AMD SEV provide hardware-level memory encryption—your model weights, user prompts, and inference outputs stay encrypted in-use. Inputs, outputs, and weights stay inside attested GPU enclaves. Not even cloud admins or hypervisors can inspect runtime state.
Privacy as a human right—by design. Route requests via mTLS into enclave. Emit usage receipts; never store plaintext. OpenAI-compatible endpoints with verifiable attestation and zero-logging guarantees.
Access the latest frontier AI models with cryptographic privacy protection
gpt-oss-20b is an open-weight 21B parameter model released by OpenAI under the Apache 2.0 license. It uses a Mixture-of-Experts (MoE) architecture with 3.6B active parameters per forward pass, optimized for lower-latency inference and deployability on consumer or single-GPU hardware. The model is trained in OpenAI’s Harmony response format and supports reasoning level configuration, fine-tuning, and agentic capabilities including function calling, tool use, and structured outputs.
gpt-oss-120b is an open-weight, 117B-parameter Mixture-of-Experts (MoE) language model from OpenAI designed for high-reasoning, agentic, and general-purpose production use cases. It activates 5.1B parameters per forward pass and is optimized to run on a single H100 GPU with native MXFP4 quantization. The model supports configurable reasoning depth, full chain-of-thought access, and native tool use, including function calling, browsing, and structured output generation.
Gemma 3 introduces multimodality, supporting vision-language input and text outputs. It handles context windows up to 128k tokens, understands over 140 languages, and offers improved math, reasoning, and chat capabilities, including structured outputs and function calling. Gemma 3 27B is Google's latest open source model, successor to [Gemma 2](google/gemma-2-27b-it)
Qwen2.5-VL is proficient in recognizing common objects such as flowers, birds, fish, and insects. It is also highly capable of analyzing texts, charts, icons, graphics, and layouts within images.
DeepSeek V3, a 685B-parameter, mixture-of-experts model, is the latest iteration of the flagship chat model family from the DeepSeek team.
Qwen2.5 7B is the latest series of Qwen large language models. Qwen2.5 brings the following improvements upon Qwen2: - Significantly more knowledge and has greatly improved capabilities in coding and mathematics, thanks to our specialized expert models in these domains. - Significant improvements in instruction following, generating long texts (over 8K tokens), understanding structured data (e.g, tables), and generating structured outputs especially JSON. More resilient to the diversity of system prompts, enhancing role-play implementation and condition-setting for chatbots. - Long-context Support up to 128K tokens and can generate up to 8K tokens. - Multilingual support for over 29 languages, including Chinese, English, French, Spanish, Portuguese, German, Italian, Russian, Japanese, Korean, Vietnamese, Thai, Arabic, and more. Usage of this model is subject to [Tongyi Qianwen LICENSE AGREEMENT](https://huggingface.co/Qwen/Qwen1.5-110B-Chat/blob/main/LICENSE).
Discover how leading companies are leveraging Phala's confidential AI to build exceptional digital experiences while maintaining complete data privacy and regulatory compliance.
Deploy confidential AI inference with the flexibility of cloud and the security of on-premise infrastructure.
End-to-end encrypted
Hardware-attested routing
by OpenAI
by DeepSeek
by Qwen
End-to-end encrypted
Hardware-attested routing
by DeepSeek
OpenAI-compatible APIs with advanced capabilities running in TEE
Use OpenAI-compatible SDK to access 200+ models with hardware-enforced privacy. Drop-in replacement with zero code changes.
from openai import OpenAI
client = OpenAI(
api_key="<API_KEY>",
base_url="https://api.redpill.ai/api/v1"
)
response = client.chat.completions.create(
model="phala/deepseek-chat-v3-0324",
messages=[
{"role": "system", "content": "You are a helpful assistant"},
{"role": "user", "content": "What is your model name?"},
],
stream=True
)
print(response.choices[0].message.content)
Every response includes cryptographic proof from NVIDIA and Intel TEE hardware. Verify attestation to ensure secure execution.
import requests
import jwt
# Fetch attestation report
response = requests.get(
"https://api.redpill.ai/v1/attestation/report?model=phala/deepseek-v3",
headers={"Authorization": f"Bearer {api_key}"}
)
report = response.json()
# Verify NVIDIA GPU attestation
gpu_response = requests.post(
"https://nras.attestation.nvidia.com/v3/attest/gpu",
headers={"Content-Type": "application/json"},
data=report["nvidia_payload"]
)
# Check verification result
gpu_tokens = gpu_response.json()[1]
for gpu_id, token in gpu_tokens.items():
decoded = jwt.decode(token, options={"verify_signature": False})
assert decoded.get("measres") == "success"
print(f"{gpu_id}: Verified ✓")
Choose the perfect privacy-first AI solution tailored to your needs
Private AI assistants for individuals who value data sovereignty and zero-logging guarantees.
OpenAI-compatible APIs with TEE protection—drop-in replacement with hardware-enforced privacy.
Scalable confidential AI infrastructure with compliance, auditability, and flexible deployment options.
Meeting the highest compliance requirements for your business
Everything you need to know about Private AI Inference
Deploy confidential LLM endpoints with hardware-enforced encryption and zero-knowledge guarantees.
Deploy on Phala