vLLM Model Selection for DGX Spark (SM121)

The DGX Spark GB10 SoC (SM121) has specific constraints that determine which models run well and which don’t. This is a practical guide based on what we’ve tested in production.

The key constraint: SM121 kernel compatibility

Not all model architectures run well on SM121 with the NGC vLLM container. The main constraint is the MoE kernel:

Marlin kernel — stable, fast, supports GPTQ-Int4 and mxfp4
CUTLASS FP4 — broken on SM121, produces garbage outputs silently; never use
GDN (GatedDeltaNet) — kernel gap on SM121, 14–21 tok/s with stock NGC; requires experimental fork for full speed

Prefer pure MoE models over dense hybrid architectures when using the NGC container. Pure MoE (no GDN/Mamba layers) runs fully through Marlin and is well-tested on SM121.

Quantization format decision tree

Is the checkpoint from openai/gpt-oss-*?
├── Yes → --quantization=mxfp4  (gpt-oss pre-quantized uint8 format)
└── No  → Is there a GPTQ-Int4 checkpoint available?
           ├── Yes → --quantization=gptq_marlin  ✅ recommended
           └── No  → FP8 checkpoint available?
                      ├── Yes → --quantization=fp8 (test carefully on SM121)
                      └── No  → BF16 (no flag; fits only if weights < ~100 GB)

Do not use --quantization=mxfp4 on standard HuggingFace BF16 checkpoints. The NGC 26.04 mxfp4 weight loader only handles gpt-oss’s proprietary 3D uint8 tensor format. BF16 HF checkpoints will crash with IndexError in fused_moe/layer.py.

Model comparison for SM121 (single node, 128 GB)

Model	Architecture	Quantization	Fits 128 GB	Expected tok/s	Notes
gpt-oss-120b	Sparse MoE (5.1B active)	mxfp4	✅ ~61 GB	32–60	128K context; proprietary quant
Qwen3-235B-A22B	Pure MoE (22B active)	GPTQ-Int4 (TP=2)	✅ ~60 GB/node	17–36	Two nodes required; best quality
Qwen3-30B-A3B	Pure MoE (3.3B active)	NVFP4 / FP8 / BF16	✅ ~16–60 GB	32–50	Solid single-node option; no GDN
Qwen3.6-27B	Dense hybrid (GDN)	FP8 (~28 GB)	✅ easily	14–21 (stock) / 136–200 (fork)	GDN kernel gap; fork needed for full speed
Qwen3.5-122B-A10B	Pure MoE (10B active)	NVFP4 (~75 GB)	✅ single node	up to 51	Requires NVFP4 checkpoint + SM121 patches
Qwen3.6-35B-A3B	Pure MoE (3B active)	FP8 (~35 GB)	✅ easily	100+	Pure MoE, no GDN; successor to Qwen3-30B-A3B
Qwen3.5-397B-A17B	Pure MoE (17B active)	NVFP4	✅ (TP=2)	Unknown	Not yet recommended — SM121 MoE kernel not optimized

Architectures to prefer vs avoid

Prefer: pure MoE — models using only standard MoE layers (no GDN, no Mamba) run fully through the Marlin kernel and are the most reliable choice on SM121. Examples: Qwen3-235B-A22B, Qwen3-30B-A3B, gpt-oss-120b, Mixtral variants.

Avoid with stock NGC: GDN hybrid architectures — models with GatedDeltaNet (GDN) linear attention layers hit a kernel gap on SM121. Stock NGC produces 14–21 tok/s. If you need full speed from Qwen3.6-27B (~136–200 tok/s), the mitkox/vllm-dflash-ddtree experimental fork adds DFlash + DDTree speculative decoding for GDN, but it’s not yet production-stable.

Qwen3.6 model family

Released April 2026. Two architecturally very different open-weight variants:

	Qwen3.6-27B	Qwen3.6-35B-A3B
Architecture	Dense hybrid (GDN)	Pure MoE
Active params per token	27B (all)	~3B
FP8 weight size	~28 GB	~35 GB
tok/s on DGX Spark	14–21 (stock) / 136–200 (fork)	100+
GDN kernel gap	Yes	No
SM121 stock NGC	Underperforms	✅ Fully supported

No Qwen3.6-72B exists. As of May 2026, Qwen3.6 tops out at 27B dense and 35B-A3B MoE. For a 70B+ class model on a single DGX Spark, the current best option is Qwen3.5-122B-A10B NVFP4 (10B active, 51 tok/s confirmed).

SM121 hard rules

# Never use --enforce-eager — disables CUDA graphs, ~55% throughput loss
# Never set --gpu-memory-utilization above 0.90 — OOM on SM121
# MoE backend must be marlin — default produces garbage tokens on SM121
# Stay on driver 580.x — 590.x has a regression on this chip
# Never use CUTLASS FP4 — silent garbage output

Checkpoints to avoid

Checkpoint	Why
`nvidia/Qwen3-235B-A22B-NVFP4`	vLLM parsing bug #22906; TRT-LLM only
Any BF16 HF model with `--quantization=mxfp4`	mxfp4 loader only handles gpt-oss uint8 format
FP8 model with FlashInfer	FlashInfer crashes on SM121; use TRITON_ATTN
Any model requiring `--enforce-eager`	55% throughput loss

Useful community resources

eugr/spark-vllm-docker — Ray GPU resource fix + SM121 patches
jleighfields/vllm-dgx-spark — Qwen3-Coder-30B-A3B confirmed on DGX Spark
NVIDIA “Stacked Sparks” guide — build.nvidia.com/spark/vllm/stacked-sparks
NVIDIA DGX Spark Playbooks (DeepWiki) — Ray cluster, NCCL config, UMA tuning

The key constraint: SM121 kernel compatibility#

Quantization format decision tree#

Model comparison for SM121 (single node, 128 GB)#

Architectures to prefer vs avoid#

Qwen3.6 model family#

SM121 hard rules#

Checkpoints to avoid#

Useful community resources#