Measured throughput and latency on DGX Spark GB10 (SM121) hardware. All results use vLLM 0.19.0 (NGC container nvcr.io/nvidia/vllm:26.04-py3) unless noted.
Qwen3-235B-A22B-GPTQ-Int4 — Two-node cluster
Date: 2026-05-03
Config: TP=2, EP=2, Ray cluster over QSFP-DD RoCE direct interconnect, --attention-backend=TRITON_ATTN, --quantization=gptq_marlin, --kv-cache-dtype=fp8, --gpu-memory-utilization=0.87
| Batch | Avg completion tokens | tok/s per request | Aggregate tok/s |
|---|---|---|---|
| 1 (serial) | 256 | 17.0 | 17.0 |
| 2 (concurrent) | 256 | 12.1 | 24.1 |
| 4 (concurrent) | 256 | 9.1 | 36.4 |
Prefix cache: 97% delta hit rate on repeated system prompt.
Startup to first inference: ~15 minutes (Ray init + weight load across two nodes + compile).
Weight resident per node: 57.64 GiB.
vs NVIDIA’s official published number
| Source | Config | tok/s (batch=1) |
|---|---|---|
| This lab | GPTQ-Int4, vLLM gptq_marlin | 17.0 |
| NVIDIA official | NVFP4, TRT-LLM | 11.73 |
Our GPTQ-Int4 + vLLM result beats NVIDIA’s own published NVFP4 + TRT-LLM number by ~45% at batch=1. The NVFP4 SM121 MoE kernel is still maturing — GPTQ-Marlin is a more optimized path on SM121 today.
gpt-oss-120b — Single node
Date: 2026-05-02
Config: TP=1, --quantization=mxfp4, --kv-cache-dtype=fp8, --attention-backend=TRITON_ATTN, --moe-backend=marlin, --gpu-memory-utilization=0.87, --max-cudagraph-capture-size=2048
| Metric | Value |
|---|---|
| Generation throughput (with reasoning overhead) | ~32–35 tok/s |
| Pure decode (no reasoning) | 57–60 tok/s |
| Prefix cache hit rate | ~76% |
| Context window | 128,000 tokens |
Effect of --enforce-eager
| Config | tok/s |
|---|---|
| CUDAGraph enabled (default) | ~59 tok/s |
--enforce-eager (CUDAGraph disabled) | ~26 tok/s |
--enforce-eager cuts throughput by ~55% on SM121. Never use it.
Effect of --max-cudagraph-capture-size
The NVIDIA default Blackwell recipe sets this to 32, which limits graph coverage to batch sizes up to 32. Setting it to 2048 provides full coverage from batch=1 through batch=2048 with no meaningful overhead.
Inter-node network (QSFP-DD RoCE direct connect)
Measured with ib_write_bw (single QP, 4096 B MTU), RDMA/RoCE confirmed:
| Channel | Bandwidth |
|---|---|
| Channel 1 | 13.35 Gb/s |
| Channel 2 | 13.26 Gb/s |
| Combined | ~26.6 Gb/s |
Note: single QP with 4096 B MTU caps results well below the 200 Gb/s theoretical maximum — this is a benchmark tool artifact. A multi-QP test with 9000 B MTU should approach the full ceiling. Model weight cache sync over QSFP-DD averages ~500–580 MB/s (118 GB transferred in ~3–4 minutes).
Community benchmarks for comparison
| Model | Source | Config | tok/s |
|---|---|---|---|
| Qwen3-235B | This lab | GPTQ-Int4, gptq_marlin, TP=2 | 17 (b=1) / 36 agg (b=4) |
| Qwen3-235B | NVIDIA official | NVFP4, TRT-LLM | 11.73 (b=1) |
| gpt-oss-120b | This lab | mxfp4, Marlin, TP=1 | 57–60 (pure decode) |
| Qwen3-30B-A3B | Community (jleighfields) | NVFP4, vLLM | 32–45 |
| Qwen3.6-27B | Community (NVIDIA forums) | FP8, stock NGC | 14–21 |
| Qwen3.6-27B | Community (mitkox fork) | FP8, DFlash+DDTree | 136–200 |