Measured throughput and latency on DGX Spark GB10 (SM121) hardware. All results use vLLM 0.19.0 (NGC container nvcr.io/nvidia/vllm:26.04-py3) unless noted.

Qwen3-235B-A22B-GPTQ-Int4 — Two-node cluster

Date: 2026-05-03
Config: TP=2, EP=2, Ray cluster over QSFP-DD RoCE direct interconnect, --attention-backend=TRITON_ATTN, --quantization=gptq_marlin, --kv-cache-dtype=fp8, --gpu-memory-utilization=0.87

BatchAvg completion tokenstok/s per requestAggregate tok/s
1 (serial)25617.017.0
2 (concurrent)25612.124.1
4 (concurrent)2569.136.4

Prefix cache: 97% delta hit rate on repeated system prompt.
Startup to first inference: ~15 minutes (Ray init + weight load across two nodes + compile).
Weight resident per node: 57.64 GiB.

vs NVIDIA’s official published number

SourceConfigtok/s (batch=1)
This labGPTQ-Int4, vLLM gptq_marlin17.0
NVIDIA officialNVFP4, TRT-LLM11.73

Our GPTQ-Int4 + vLLM result beats NVIDIA’s own published NVFP4 + TRT-LLM number by ~45% at batch=1. The NVFP4 SM121 MoE kernel is still maturing — GPTQ-Marlin is a more optimized path on SM121 today.

gpt-oss-120b — Single node

Date: 2026-05-02
Config: TP=1, --quantization=mxfp4, --kv-cache-dtype=fp8, --attention-backend=TRITON_ATTN, --moe-backend=marlin, --gpu-memory-utilization=0.87, --max-cudagraph-capture-size=2048

MetricValue
Generation throughput (with reasoning overhead)~32–35 tok/s
Pure decode (no reasoning)57–60 tok/s
Prefix cache hit rate~76%
Context window128,000 tokens

Effect of --enforce-eager

Configtok/s
CUDAGraph enabled (default)~59 tok/s
--enforce-eager (CUDAGraph disabled)~26 tok/s

--enforce-eager cuts throughput by ~55% on SM121. Never use it.

Effect of --max-cudagraph-capture-size

The NVIDIA default Blackwell recipe sets this to 32, which limits graph coverage to batch sizes up to 32. Setting it to 2048 provides full coverage from batch=1 through batch=2048 with no meaningful overhead.

Inter-node network (QSFP-DD RoCE direct connect)

Measured with ib_write_bw (single QP, 4096 B MTU), RDMA/RoCE confirmed:

ChannelBandwidth
Channel 113.35 Gb/s
Channel 213.26 Gb/s
Combined~26.6 Gb/s

Note: single QP with 4096 B MTU caps results well below the 200 Gb/s theoretical maximum — this is a benchmark tool artifact. A multi-QP test with 9000 B MTU should approach the full ceiling. Model weight cache sync over QSFP-DD averages ~500–580 MB/s (118 GB transferred in ~3–4 minutes).

Community benchmarks for comparison

ModelSourceConfigtok/s
Qwen3-235BThis labGPTQ-Int4, gptq_marlin, TP=217 (b=1) / 36 agg (b=4)
Qwen3-235BNVIDIA officialNVFP4, TRT-LLM11.73 (b=1)
gpt-oss-120bThis labmxfp4, Marlin, TP=157–60 (pure decode)
Qwen3-30B-A3BCommunity (jleighfields)NVFP4, vLLM32–45
Qwen3.6-27BCommunity (NVIDIA forums)FP8, stock NGC14–21
Qwen3.6-27BCommunity (mitkox fork)FP8, DFlash+DDTree136–200