Blog

Running GLM-5.2 1M Context on a Single 8×H200 Node

Jun 18, 20265 min read
Running GLM-5.2 1M Context on a Single 8×H200 Node

GLM-5.2 gives open-weight coding models a real 1M-token context window. The hard part is serving that full window on the hardware many teams already run in production: Hopper.

We quantized GLM-5.2-FP8 into W4AFP8 and validated it on a single 8×H200 node with SGLang. The checkpoint cuts weight memory from 755 GB to 368 GB, freeing 387 GB of HBM for the 1M-token KV cache and runtime headroom.

Why this matters

GLM-5.2 already solved the model side of long context: sparse attention, IndexShare, MTP speculative decoding, tool use, reasoning, and a 1,048,576-token window. Deployment still has a second problem. A 1M-token window needs room for the model weights, KV cache, CUDA graphs, runtime buffers, and serving overhead.

The official FP8 checkpoint is the right general serving baseline. On Hopper, that baseline leaves much less memory slack once you push toward the full context window. W4AFP8 changes the memory budget without changing the model family, tokenizer, API shape, or GLM-5.2 behavior.

What we changed

GLM-5.2 is a large MoE model. Most of the weight footprint sits in the routed experts, so those experts are the highest leverage place to compress. GLM-5.2-W4AFP8 stores routed MoE expert weights in 4-bit integers with FP8 activations. Dense layers, shared experts, attention, the sparse-attention indexer, and non-expert MTP weights stay in FP8 or BF16.

The result is a Hopper-focused checkpoint for SGLang. It keeps the full logical GLM-5.2 parameter count and serves the same model at lower weight precision. The Hugging Face parameter count shown on the model page reflects packed storage elements, since two 4-bit weights share one byte.

Quality stayed aligned

We evaluated the checkpoint on 8×H200 with SGLang v0.5.13.post1 using GLM-5.2 recommended sampling: temperature 1.0, top_p 0.95, reasoning enabled. The benchmark goal was simple: prove that the memory win preserves model quality.

GPQA-Diamond came in at 90.9 versus the official 91.2 reference. IFBench strict came in at 75.0 versus 74.3. AA-LCR long-context reasoning came in at 76.0 versus 69.7. Needle-in-a-haystack at roughly 983K tokens retrieved 3 out of 3 needles. BFCL tool calling matched the reference behavior.

This is the claim boundary: GLM-5.2-W4AFP8 is aligned with the official GLM-5.2 baseline on the benchmarks we ran, with long-context retrieval and tool calling intact.

MTP speculative decoding still works

GLM-5.2 includes a Multi-Token Prediction layer for EAGLE-style speculative decoding. We quantized and validated that path as part of the release. In SGLang, the main model still verifies every drafted token, so speculative decoding affects latency and throughput rather than output semantics.

On a single stream, decode throughput increased from 75 tok/s to 118 tok/s with EAGLE. On the 8K input / 1K output, 16-concurrent serving test, aggregate completion throughput increased from 658 tok/s to 733 tok/s.

How to run it

The checkpoint is validated with SGLang v0.5.13.post1 on Hopper GPUs. Use the W4AFP8 quantization path, FP8 KV cache, GLM reasoning and tool parsers, and the full 1,048,576-token context length.

python -m sglang.launch_server \
  --model-path PhalaCloud/GLM-5.2-W4AFP8 \
  --quantization w4afp8 --disable-shared-experts-fusion \
  --tp 8 --kv-cache-dtype fp8_e4m3 \
  --reasoning-parser glm45 --tool-call-parser glm47 \
  --context-length 1048576 --mem-fraction-static 0.85 \
  --trust-remote-code

To use the 118 tok/s single-stream path, add EAGLE speculative decoding:

--speculative-algorithm EAGLE --speculative-num-steps 1 \
--speculative-eagle-topk 1 --speculative-num-draft-tokens 2

Scope

  • Validated with SGLang v0.5.13.post1.
  • Designed for Hopper GPUs: H100 or H200, SM90.
  • vLLM and other engines have not been validated for this checkpoint.
  • The checkpoint inherits the capabilities, license, and limitations of the base GLM-5.2 model.

The practical takeaway

GLM-5.2 made 1M context available in an open model. GLM-5.2-W4AFP8 makes that window practical on a single 8×H200 node.

For teams with Hopper fleets, this turns GLM-5.2 from a long-context model release into a deployable 1M-context serving target.

Model: https://huggingface.co/PhalaCloud/GLM-5.2-W4AFP8

Recent Posts

Related Posts