PrêtGPU TEE

Qwen: Qwen3.5 397B A17B

ID du modèleqwen/qwen3.5-397b-a17b

The Qwen3.5 series 397B-A17B native vision-language model is built on a hybrid architecture that integrates a linear attention mechanism with a sparse mixture-of-experts model, achieving higher inference efficiency. It delivers state-of-the-art performance comparable to leading-edge models across a wide range of tasks, including language understanding, logical reasoning, code generation, agent-based tasks, image understanding, video understanding, and graphical user interface (GUI) interactions. With its robust code-generation and agent capabilities, the model exhibits strong generalization across diverse agent.

Commencer à construire Docs

entrée

$0.55/M

sortie

$3.50/M

contexte

262K

créé

28 févr. 2026

Forme d’API prise en charge

entrée

text · image · video

sortie

text

outils

Pris en charge

mode JSON

Pris en charge

Vérification

signature

ID de réponse

attestation

GPU TEE

fournisseur

Phala

Provider

Phala

GPU TEE

entrée

$0.55/M

sortie

$3.50/M

contexte

262K

$ pip install openai

1from openai import OpenAI
2import os
3
4MODEL = "qwen/qwen3.5-397b-a17b"
5
6client = OpenAI(
7    base_url="https://api.redpill.ai/v1",
8    api_key=os.environ["REDPILL_API_KEY"],
9)
10
11response = client.chat.completions.create(
12    model=MODEL,
13    messages=[{"role": "user", "content": "Run this privately."}],
14)
15
16print(response.id)
17print(response.choices[0].message.content)

$ npm install openai

1import OpenAI from "openai"
2
3const MODEL = "qwen/qwen3.5-397b-a17b"
4
5const openai = new OpenAI({
6  baseURL: "https://api.redpill.ai/v1",
7  apiKey: process.env.REDPILL_API_KEY,
8})
9
10const response = await openai.chat.completions.create({
11  model: MODEL,
12  messages: [{ role: "user", content: "Run this privately." }],
13})
14
15console.log(response.id)
16console.log(response.choices[0].message.content)

1curl https://api.redpill.ai/v1/chat/completions \
2  -H "Authorization: Bearer $REDPILL_API_KEY" \
3  -H "Content-Type: application/json" \
4  -d '{
5    "model": "qwen/qwen3.5-397b-a17b",
6    "messages": [
7      { "role": "user", "content": "Run this privately." }
8    ]
9  }'

Plus de modèles

Autres routes d’inférence privée.

Voir le catalogue

chiffré

Phala: Gemma-4 26B-A4B Uncensored (Heretic)

Uncensored "Heretic" variant of google/gemma-4-26B-A4B-it created using Heretic v1.2.0 with the Arbitrary-Rank Ablation (ARA) method and row-norm preservation. Refusals drop from 100/100 to 11/100 with KL divergence 0.0499 vs the base model. The base Gemma 4 26B A4B is a Mixture-of-Experts model with 25.2B total / 3.8B active parameters (8 active / 128 total experts), 30-layer transformer with hybrid local sliding (1024) + global attention, supporting a 256K context window. Natively multimodal (text + images, variable aspect ratios). Strong on coding, reasoning, function calling, with native system prompt support across 35+ languages. Served on Phala in TDX-attested H200 enclave with end-to-end ECDSA response signing; vLLM-compatible FP8-Static quantization by cloud19 (router excluded from quantization).

contexte

66K

entrée

$0.15/M

chiffré

Phala: Qwen3.6 35B-A3B Uncensored (Aggressive)

Uncensored "Aggressive" variant of Qwen3.6-35B-A3B from Alibaba's Qwen team. The fine-tune by HauhauCS removes refusal behaviors (0/465 refusals) without modifying datasets or core capabilities. The base architecture is a 35B-parameter Mixture-of-Experts model with 256 experts routing 8 per token (~3B active params), 40 layers, and a hybrid linear+full-softmax attention mechanism (3:1 ratio). Supports a native 262K context and is natively multimodal across text, images, and video. Served on Phala in TDX-attested H200 enclave with end-to-end ECDSA response signing; FP8 quantization by lamianlbe.

contexte

131K

entrée

$0.30/M

chiffré

Qwen: Qwen3.5-27B

The Qwen3.5 27B native vision-language Dense model incorporates a linear attention mechanism, delivering fast response times while balancing inference speed and performance. Its overall capabilities are comparable to those of the Qwen3.5-122B-A10B.

contexte

262K

entrée

$0.30/M

chiffré

Z.AI: GLM 4.7 Flash

As a 30B-class SOTA model, GLM-4.7-Flash offers a new option that balances performance and efficiency. It is further optimized for agentic coding use cases, strengthening coding capabilities, long-horizon task planning, and tool collaboration, and has achieved leading performance among open-source models of the same size on several current public benchmark leaderboards.

contexte

203K

entrée

$0.10/M