DGX Spark Benchmark Results: vLLM on SM121

Measured throughput and latency on DGX Spark GB10 (SM121) hardware. All results use vLLM 0.19.0 (NGC container nvcr.io/nvidia/vllm:26.04-py3) unless noted.

Qwen3-235B-A22B-GPTQ-Int4 — Two-node cluster

Date: 2026-05-03
Config: TP=2, EP=2, Ray cluster over QSFP-DD RoCE direct interconnect, --attention-backend=TRITON_ATTN, --quantization=gptq_marlin, --kv-cache-dtype=fp8, --gpu-memory-utilization=0.87

Batch	Avg completion tokens	tok/s per request	Aggregate tok/s
1 (serial)	256	17.0	17.0
2 (concurrent)	256	12.1	24.1
4 (concurrent)	256	9.1	36.4

Prefix cache: 97% delta hit rate on repeated system prompt.
Startup to first inference: ~15 minutes (Ray init + weight load across two nodes + compile).
Weight resident per node: 57.64 GiB.

vs NVIDIA’s official published number

Source	Config	tok/s (batch=1)
This lab	GPTQ-Int4, vLLM gptq_marlin	17.0
NVIDIA official	NVFP4, TRT-LLM	11.73

Our GPTQ-Int4 + vLLM result beats NVIDIA’s own published NVFP4 + TRT-LLM number by ~45% at batch=1. The NVFP4 SM121 MoE kernel is still maturing — GPTQ-Marlin is a more optimized path on SM121 today.

gpt-oss-120b — Single node

Date: 2026-05-02
Config: TP=1, --quantization=mxfp4, --kv-cache-dtype=fp8, --attention-backend=TRITON_ATTN, --moe-backend=marlin, --gpu-memory-utilization=0.87, --max-cudagraph-capture-size=2048

Metric	Value
Generation throughput (with reasoning overhead)	~32–35 tok/s
Pure decode (no reasoning)	57–60 tok/s
Prefix cache hit rate	~76%
Context window	128,000 tokens

Effect of `--enforce-eager`

Config	tok/s
CUDAGraph enabled (default)	~59 tok/s
`--enforce-eager` (CUDAGraph disabled)	~26 tok/s

--enforce-eager cuts throughput by ~55% on SM121. Never use it.

Effect of `--max-cudagraph-capture-size`

The NVIDIA default Blackwell recipe sets this to 32, which limits graph coverage to batch sizes up to 32. Setting it to 2048 provides full coverage from batch=1 through batch=2048 with no meaningful overhead.

Inter-node network (QSFP-DD RoCE direct connect)

Measured with ib_write_bw (single QP, 4096 B MTU), RDMA/RoCE confirmed:

Channel	Bandwidth
Channel 1	13.35 Gb/s
Channel 2	13.26 Gb/s
Combined	~26.6 Gb/s

Note: single QP with 4096 B MTU caps results well below the 200 Gb/s theoretical maximum — this is a benchmark tool artifact. A multi-QP test with 9000 B MTU should approach the full ceiling. Model weight cache sync over QSFP-DD averages ~500–580 MB/s (118 GB transferred in ~3–4 minutes).

Community benchmarks for comparison

Model	Source	Config	tok/s
Qwen3-235B	This lab	GPTQ-Int4, gptq_marlin, TP=2	17 (b=1) / 36 agg (b=4)
Qwen3-235B	NVIDIA official	NVFP4, TRT-LLM	11.73 (b=1)
gpt-oss-120b	This lab	mxfp4, Marlin, TP=1	57–60 (pure decode)
Qwen3-30B-A3B	Community (jleighfields)	NVFP4, vLLM	32–45
Qwen3.6-27B	Community (NVIDIA forums)	FP8, stock NGC	14–21
Qwen3.6-27B	Community (mitkox fork)	FP8, DFlash+DDTree	136–200

Qwen3-235B-A22B-GPTQ-Int4 — Two-node cluster#

vs NVIDIA’s official published number#

gpt-oss-120b — Single node#

Effect of --enforce-eager#

Effect of --max-cudagraph-capture-size#

Inter-node network (QSFP-DD RoCE direct connect)#

Community benchmarks for comparison#