DGX Spark Benchmark Results: vLLM on SM121
Measured throughput and latency on DGX Spark GB10 (SM121) hardware. All results use vLLM 0.19.0 (NGC container nvcr.io/nvidia/vllm:26.04-py3) unless noted. Qwen3-235B-A22B-GPTQ-Int4 — Two-node cluster Date: 2026-05-03 Config: TP=2, EP=2, Ray cluster over QSFP-DD RoCE direct interconnect, --attention-backend=TRITON_ATTN, --quantization=gptq_marlin, --kv-cache-dtype=fp8, --gpu-memory-utilization=0.87 Batch Avg completion tokens tok/s per request Aggregate tok/s 1 (serial) 256 17.0 17.0 2 (concurrent) 256 12.1 24.1 4 (concurrent) 256 9.1 36.4 Prefix cache: 97% delta hit rate on repeated system prompt. Startup to first inference: ~15 minutes (Ray init + weight load across two nodes + compile). Weight resident per node: 57.64 GiB. ...