Private AI Inference

Serve LLMs without exposing prompts or weights

Why It Matters

Why Private Inference Matters

Centralized inference can log prompts and leak IP. Phala enclaves ensure no operator—cloud or vendor—can peek.

Data security

Prompts can expose sensitive queries

Traditional cloud infrastructure exposes sensitive information to operators and administrators.

More Information
Confidential computing

Model weights are valuable IP

Hardware-enforced isolation prevents unauthorized access while maintaining computational efficiency.

More Information
Zero-trust architecture

Inference logs reveal business patterns

End-to-end encryption protects data in transit, at rest, and critically during computation.

More Information
Attestation

No operator access to runtime memory

Cryptographic verification ensures code integrity and proves execution in genuine TEE hardware.

More Information

Hover to Encrypt

GPU TEE Protection

Zero-Trust Inference

Confidential Serving

OpenAI-Compatible API with Hardware Encryption

GPU TEEs with Intel TDX and AMD SEV provide hardware-level memory encryption—your model weights, user prompts, and inference outputs stay encrypted in-use. Inputs, outputs, and weights stay inside attested GPU enclaves. Not even cloud admins or hypervisors can inspect runtime state.

Privacy as a human right—by design. Route requests via mTLS into enclave. Emit usage receipts; never store plaintext. OpenAI-compatible endpoints with verifiable attestation and zero-logging guarantees.

GPU memory encryption
OpenAI-compatible API
Zero-logging architecture

Available Models

Access the latest frontier AI models with cryptographic privacy protection

OP

OpenAI: GPT OSS 20B

Encryptedopenai/gpt-oss-20b

gpt-oss-20b is an open-weight 21B parameter model released by OpenAI under the Apache 2.0 license. It uses a Mixture-of-Experts (MoE) architecture with 3.6B active parameters per forward pass, optimized for lower-latency inference and deployability on consumer or single-GPU hardware. The model is trained in OpenAI’s Harmony response format and supports reasoning level configuration, fine-tuning, and agentic capabilities including function calling, tool use, and structured outputs.

131K context | $0.10/M input tokens | $0.40/M output tokens
OP

OpenAI: GPT OSS 120B

Encryptedopenai/gpt-oss-120b

gpt-oss-120b is an open-weight, 117B-parameter Mixture-of-Experts (MoE) language model from OpenAI designed for high-reasoning, agentic, and general-purpose production use cases. It activates 5.1B parameters per forward pass and is optimized to run on a single H100 GPU with native MXFP4 quantization. The model supports configurable reasoning depth, full chain-of-thought access, and native tool use, including function calling, browsing, and structured output generation.

131K context | $0.10/M input tokens | $0.49/M output tokens
GO

Google: Gemma 3 27B

Encryptedgoogle/gemma-3-27b-it

Gemma 3 introduces multimodality, supporting vision-language input and text outputs. It handles context windows up to 128k tokens, understands over 140 languages, and offers improved math, reasoning, and chat capabilities, including structured outputs and function calling. Gemma 3 27B is Google's latest open source model, successor to [Gemma 2](google/gemma-2-27b-it)

54K context | $0.11/M input tokens | $0.40/M output tokens
QW

Qwen: Qwen2.5 VL 72B Instruct

Encryptedqwen/qwen2.5-vl-72b-instruct

Qwen2.5-VL is proficient in recognizing common objects such as flowers, birds, fish, and insects. It is also highly capable of analyzing texts, charts, icons, graphics, and layouts within images.

128K context | $0.59/M input tokens | $0.59/M output tokens
DE

deepseek/deepseek-chat-v3-0324

Encrypteddeepseek/deepseek-chat-v3-0324

DeepSeek V3, a 685B-parameter, mixture-of-experts model, is the latest iteration of the flagship chat model family from the DeepSeek team.

164K context | $0.49/M input tokens | $1.14/M output tokens
QW

Qwen2.5 7B Instruct

Encryptedqwen/qwen-2.5-7b-instruct

Qwen2.5 7B is the latest series of Qwen large language models. Qwen2.5 brings the following improvements upon Qwen2: - Significantly more knowledge and has greatly improved capabilities in coding and mathematics, thanks to our specialized expert models in these domains. - Significant improvements in instruction following, generating long texts (over 8K tokens), understanding structured data (e.g, tables), and generating structured outputs especially JSON. More resilient to the diversity of system prompts, enhancing role-play implementation and condition-setting for chatbots. - Long-context Support up to 128K tokens and can generate up to 8K tokens. - Multilingual support for over 29 languages, including Chinese, English, French, Spanish, Portuguese, German, Italian, Russian, Japanese, Korean, Vietnamese, Thai, Arabic, and more. Usage of this model is subject to [Tongyi Qianwen LICENSE AGREEMENT](https://huggingface.co/Qwen/Qwen1.5-110B-Chat/blob/main/LICENSE).

33K context | $0.04/M input tokens | $0.10/M output tokens

Real-World Success Stories

Discover how leading companies are leveraging Phala's confidential AI to build exceptional digital experiences while maintaining complete data privacy and regulatory compliance.

On-prem Privacy, Cloud Simplicity

Deploy confidential AI inference with the flexibility of cloud and the security of on-premise infrastructure.

Your private request

End-to-end encrypted

Phala Cloud

Hardware-attested routing

DeepSeek V3

GPU TEE

by DeepSeek

Powerful Features, Simple Integration

OpenAI-compatible APIs with advanced capabilities running in TEE

Step 1

Make Secure API Requests

Use OpenAI-compatible SDK to access 200+ models with hardware-enforced privacy. Drop-in replacement with zero code changes.

secure_request.py
from openai import OpenAI

client = OpenAI(
    api_key="<API_KEY>",
    base_url="https://api.redpill.ai/api/v1"
)

response = client.chat.completions.create(
    model="phala/deepseek-chat-v3-0324",
    messages=[
        {"role": "system", "content": "You are a helpful assistant"},
        {"role": "user", "content": "What is your model name?"},
    ],
    stream=True
)
print(response.choices[0].message.content)
Step 2

Verify TEE Execution

Every response includes cryptographic proof from NVIDIA and Intel TEE hardware. Verify attestation to ensure secure execution.

verify_attestation.py
import requests
import jwt

# Fetch attestation report
response = requests.get(
    "https://api.redpill.ai/v1/attestation/report?model=phala/deepseek-v3",
    headers={"Authorization": f"Bearer {api_key}"}
)
report = response.json()

# Verify NVIDIA GPU attestation
gpu_response = requests.post(
    "https://nras.attestation.nvidia.com/v3/attest/gpu",
    headers={"Content-Type": "application/json"},
    data=report["nvidia_payload"]
)

# Check verification result
gpu_tokens = gpu_response.json()[1]
for gpu_id, token in gpu_tokens.items():
    decoded = jwt.decode(token, options={"verify_signature": False})
    assert decoded.get("measres") == "success"
    print(f"{gpu_id}: Verified ✓")

Solutions for Every User

Choose the perfect privacy-first AI solution tailored to your needs

Personal

Individual

Private AI assistants for individuals who value data sovereignty and zero-logging guarantees.

What's included:

  • Private chat with zero data retention
  • Encrypted journal & notes
  • Personal data analysis

Developer

API

OpenAI-compatible APIs with TEE protection—drop-in replacement with hardware-enforced privacy.

What's included:

  • One-line API integration
  • Same SDKs & libraries
  • Verifiable attestation

Enterprise

Enterprise

Scalable confidential AI infrastructure with compliance, auditability, and flexible deployment options.

What's included:

  • Private RAG & AI copilots
  • Confidential fine-tuning
  • On-prem or cloud deployment
  • HIPAA/SOC2 compliance

Industry-Leading Enterprise Compliance

Meeting the highest compliance requirements for your business

AICPA SOC 2ISO 27001CCPAGDPR

Frequently Asked Questions

Everything you need to know about Private AI Inference

Privacy & Security Guarantees

Developer Experience

Industry Use Cases

Start Private AI Inference Today

Deploy confidential LLM endpoints with hardware-enforced encryption and zero-knowledge guarantees.

Deploy on Phala
  • Intel TDX & AMD SEV support
  • Remote attestation built-in
  • Zero-trust architecture
  • Enterprise-ready compliance
  • 24/7 technical support