ReadyGPU TEE

Qwen2.5 7B Instruct

Model IDqwen/qwen-2.5-7b-instruct

Qwen2.5 7B is the latest series of Qwen large language models. Qwen2.5 brings the following improvements upon Qwen2: - Significantly more knowledge and has greatly improved capabilities in coding and mathematics, thanks to our specialized expert models in these domains. - Significant improvements in instruction following, generating long texts (over 8K tokens), understanding structured data (e.g, tables), and generating structured outputs especially JSON. More resilient to the diversity of system prompts, enhancing role-play implementation and condition-setting for chatbots. - Long-context Support up to 128K tokens and can generate up to 8K tokens. - Multilingual support for over 29 languages, including Chinese, English, French, Spanish, Portuguese, German, Italian, Russian, Japanese, Korean, Vietnamese, Thai, Arabic, and more. Usage of this model is subject to [Tongyi Qianwen LICENSE AGREEMENT](https://huggingface.co/Qwen/Qwen1.5-110B-Chat/blob/main/LICENSE).

Start building Docs

input

$0.04/M

output

$0.10/M

context

33K

created

Oct 3, 2025

Supported API shape

input

text

output

text

tools

Supported

json mode

Not listed

Verification

receipt

x-receipt-id

attestation

gateway report

session

attested upstream

provider

Phala

Provider

Phala

GPU TEE

input

$0.04/M

output

$0.10/M

context

33K

$ pip install openai

1from openai import OpenAI
2import os
3
4MODEL = "qwen/qwen-2.5-7b-instruct"
5
6client = OpenAI(
7    base_url="https://inference.phala.com/v1",
8    api_key=os.environ["PHALA_API_KEY"],
9)
10
11response = client.chat.completions.create(
12    model=MODEL,
13    messages=[{"role": "user", "content": "Run this privately."}],
14)
15
16print(response.id)
17print(response.choices[0].message.content)

$ npm install openai

1import OpenAI from "openai"
2
3const MODEL = "qwen/qwen-2.5-7b-instruct"
4
5const openai = new OpenAI({
6  baseURL: "https://inference.phala.com/v1",
7  apiKey: process.env.PHALA_API_KEY,
8})
9
10const response = await openai.chat.completions.create({
11  model: MODEL,
12  messages: [{ role: "user", content: "Run this privately." }],
13})
14
15console.log(response.id)
16console.log(response.choices[0].message.content)

1curl https://inference.phala.com/v1/chat/completions \
2  -H "Authorization: Bearer $PHALA_API_KEY" \
3  -H "Content-Type: application/json" \
4  -d '{
5    "model": "qwen/qwen-2.5-7b-instruct",
6    "messages": [
7      { "role": "user", "content": "Run this privately." }
8    ]
9  }'

More models

Other private inference routes.

View catalog

encrypted

Qwen: Qwen3.6 27B

Qwen3.6 27B is a dense 27-billion-parameter language model from the Qwen Team at Alibaba, released in April 2026. It features hybrid multimodal capabilities accepting text and image inputs, a configurable thinking/reasoning mode, and a native 262K context window. Served as a TEE deployment via Chutes.

context

262K

input

$0.32/M

encrypted

Google: Gemma 4 31B

Gemma 4 31B Instruct is Google DeepMind's 30.7B dense model. Features a 256K token context window, configurable thinking/reasoning mode, native function calling, and strong multilingual performance. Served as a text-only TEE deployment via NEAR AI.

context

262K

input

$0.15/M

encrypted

Phala: Gemma-4 26B-A4B Uncensored (Heretic)

Uncensored "Heretic" variant of google/gemma-4-26B-A4B-it created using Heretic v1.2.0 with the Arbitrary-Rank Ablation (ARA) method and row-norm preservation. Refusals drop from 100/100 to 11/100 with KL divergence 0.0499 vs the base model. The base Gemma 4 26B A4B is a Mixture-of-Experts model with 25.2B total / 3.8B active parameters (8 active / 128 total experts), 30-layer transformer with hybrid local sliding (1024) + global attention, supporting a 256K context window. Natively multimodal (text + images, variable aspect ratios). Strong on coding, reasoning, function calling, with native system prompt support across 35+ languages. Served on Phala in TDX-attested H200 enclave with end-to-end ECDSA response signing; vLLM-compatible FP8-Static quantization by cloud19 (router excluded from quantization).

context

66K

input

$0.15/M

encrypted

Phala: Qwen3.6 35B-A3B Uncensored (Aggressive)

Uncensored "Aggressive" variant of Qwen3.6-35B-A3B from Alibaba's Qwen team. The fine-tune by HauhauCS removes refusal behaviors (0/465 refusals) without modifying datasets or core capabilities. The base architecture is a 35B-parameter Mixture-of-Experts model with 256 experts routing 8 per token (~3B active params), 40 layers, and a hybrid linear+full-softmax attention mechanism (3:1 ratio). Supports a native 262K context and is natively multimodal across text, images, and video. Served on Phala in TDX-attested H200 enclave with end-to-end ECDSA response signing; FP8 quantization by lamianlbe.

context

131K

input

$0.30/M