DGX Spark Model Comparison: What Fits and What Runs (SM121, 128 GB)

Quick-reference comparison of open-weight models for a single DGX Spark GB10 (SM121, 128 GB unified LPDDR5X memory). Based on tested configurations and community results as of May 2026.

Model	Architecture	Quantization	Memory	Expected tok/s	SM121 notes
Qwen3.6-35B-A3B	Pure MoE (3B active)	FP8 (~35 GB)	✅ easily	100+	Pure MoE, no GDN — fully supported
Qwen3.6-27B	Dense hybrid (GDN)	FP8 (~28 GB)	✅ easily	14–21 (stock) / 136–200 (fork)	GDN kernel gap; experimental fork needed for full speed
Qwen3-30B-A3B	Pure MoE (3.3B active)	NVFP4 / FP8 / BF16 (~16–60 GB)	✅ easily	32–50	Solid single-node option; no GDN
gpt-oss-120b	Sparse MoE (5.1B active)	mxfp4 (~61 GB)	✅	32–60	128K context; proprietary quant format
Qwen3.5-122B-A10B	Pure MoE (10B active)	NVFP4 only (~75 GB)	✅	up to 51	BF16 is 234 GB — does not fit; NVFP4 is the only path
Qwen3-235B-A22B	Pure MoE (22B active)	GPTQ-Int4 (~60 GB/node)	✅ (two nodes)	17–36 agg	Requires two DGX Sparks; best quality available
Qwen3.5-397B-A17B	Pure MoE (17B active)	NVFP4 (TP=2)	✅ (two nodes)	Unknown	SM121 MoE kernel not yet optimized; not recommended

Key observations

Throughput vs quality tradeoff at single-node: Qwen3.6-35B-A3B gives the highest throughput (100+ tok/s) with pure MoE architecture. Qwen3.5-122B-A10B gives the most capable model (10B active parameters) that fits on one node, at 51 tok/s. For most agentic workloads the bottleneck is tool latency, not token generation — so 51 tok/s is more than sufficient.

The GDN trap: Qwen3.6-27B looks attractive on paper — it’s small (28 GB), recent, and dense. But the GDN attention kernel has a gap on SM121 that cuts it to 14–21 tok/s with stock NGC. Qwen3.6-35B-A3B is larger on paper but runs 5–7× faster in practice.

NVFP4 is the only path to 122B on one node: Qwen3.5-122B-A10B at BF16 is 234 GB — it doesn’t fit. NVFP4 quantization brings it to ~75 GB. There is no other quantization format that both fits and runs correctly on SM121. See Running Qwen3.5-122B on a Single DGX Spark for setup details.

Two-node ceiling: Qwen3-235B-A22B over a QSFP-DD direct interconnect is the highest quality configuration available on two Sparks. Our benchmarks show 17 tok/s at batch=1 and 36 tok/s aggregate at batch=4 — beating NVIDIA’s own published TRT-LLM number by ~45%.

Key observations#

Key observations