ReadyGPU TEE

Sentence Transformers: all-MiniLM-L6-v2

Model IDsentence-transformers/all-minilm-l6-v2

The all-MiniLM-L6-v2 embedding model maps sentences and short paragraphs into a 384-dimensional dense vector space, enabling high-quality semantic representations that are ideal for downstream tasks such as information retrieval, clustering, similarity scoring, and text ranking.

Start building Docs

input

$0.0050/M

output

Free/M

context

512

created

Nov 25, 2025

Supported API shape

input

text · embeddings

output

embeddings

tools

Not listed

json mode

Not listed

Verification

receipt

x-receipt-id

attestation

gateway report

session

attested upstream

provider

Phala

Provider

Phala

GPU TEE

input

$0.0050/M

output

Free/M

context

512

$ pip install openai

1from openai import OpenAI
2import os
3
4MODEL = "sentence-transformers/all-minilm-l6-v2"
5
6client = OpenAI(
7    base_url="https://inference.phala.com/v1",
8    api_key=os.environ["PHALA_API_KEY"],
9)
10
11response = client.embeddings.create(
12    model=MODEL,
13    input="Phala keeps inference data inside a confidential runtime.",
14)
15
16print(response.data[0].embedding[:8])

$ npm install openai

1import OpenAI from "openai"
2
3const MODEL = "sentence-transformers/all-minilm-l6-v2"
4
5const openai = new OpenAI({
6  baseURL: "https://inference.phala.com/v1",
7  apiKey: process.env.PHALA_API_KEY,
8})
9
10const response = await openai.embeddings.create({
11  model: MODEL,
12  input: "Phala keeps inference data inside a confidential runtime.",
13})
14
15console.log(response.data[0].embedding.slice(0, 8))

1curl https://inference.phala.com/v1/embeddings \
2  -H "Authorization: Bearer $PHALA_API_KEY" \
3  -H "Content-Type: application/json" \
4  -d '{
5    "model": "sentence-transformers/all-minilm-l6-v2",
6    "input": "Phala keeps inference data inside a confidential runtime."
7  }'

More models

Other private inference routes.

View catalog

encrypted

Qwen: Qwen3.6 27B

Qwen3.6 27B is a dense 27-billion-parameter language model from the Qwen Team at Alibaba, released in April 2026. It features hybrid multimodal capabilities accepting text and image inputs, a configurable thinking/reasoning mode, and a native 262K context window. Served as a TEE deployment via Chutes.

context

262K

input

$0.32/M

encrypted

Google: Gemma 4 31B

Gemma 4 31B Instruct is Google DeepMind's 30.7B dense model. Features a 256K token context window, configurable thinking/reasoning mode, native function calling, and strong multilingual performance. Served as a text-only TEE deployment via NEAR AI.

context

262K

input

$0.15/M

encrypted

Phala: Gemma-4 26B-A4B Uncensored (Heretic)

Uncensored "Heretic" variant of google/gemma-4-26B-A4B-it created using Heretic v1.2.0 with the Arbitrary-Rank Ablation (ARA) method and row-norm preservation. Refusals drop from 100/100 to 11/100 with KL divergence 0.0499 vs the base model. The base Gemma 4 26B A4B is a Mixture-of-Experts model with 25.2B total / 3.8B active parameters (8 active / 128 total experts), 30-layer transformer with hybrid local sliding (1024) + global attention, supporting a 256K context window. Natively multimodal (text + images, variable aspect ratios). Strong on coding, reasoning, function calling, with native system prompt support across 35+ languages. Served on Phala in TDX-attested H200 enclave with end-to-end ECDSA response signing; vLLM-compatible FP8-Static quantization by cloud19 (router excluded from quantization).

context

66K

input

$0.15/M

encrypted

Phala: Qwen3.6 35B-A3B Uncensored (Aggressive)

Uncensored "Aggressive" variant of Qwen3.6-35B-A3B from Alibaba's Qwen team. The fine-tune by HauhauCS removes refusal behaviors (0/465 refusals) without modifying datasets or core capabilities. The base architecture is a 35B-parameter Mixture-of-Experts model with 256 experts routing 8 per token (~3B active params), 40 layers, and a hybrid linear+full-softmax attention mechanism (3:1 ratio). Supports a native 262K context and is natively multimodal across text, images, and video. Served on Phala in TDX-attested H200 enclave with end-to-end ECDSA response signing; FP8 quantization by lamianlbe.

context

131K

input

$0.30/M