The DGX Spark GB10 SoC (SM121) has specific constraints that determine which models run well and which don’t. This is a practical guide based on what we’ve tested in production.

The key constraint: SM121 kernel compatibility

Not all model architectures run well on SM121 with the NGC vLLM container. The main constraint is the MoE kernel:

  • Marlin kernel — stable, fast, supports GPTQ-Int4 and mxfp4
  • CUTLASS FP4 — broken on SM121, produces garbage outputs silently; never use
  • GDN (GatedDeltaNet) — kernel gap on SM121, 14–21 tok/s with stock NGC; requires experimental fork for full speed

Prefer pure MoE models over dense hybrid architectures when using the NGC container. Pure MoE (no GDN/Mamba layers) runs fully through Marlin and is well-tested on SM121.

Quantization format decision tree

Is the checkpoint from openai/gpt-oss-*?
├── Yes → --quantization=mxfp4  (gpt-oss pre-quantized uint8 format)
└── No  → Is there a GPTQ-Int4 checkpoint available?
           ├── Yes → --quantization=gptq_marlin  ✅ recommended
           └── No  → FP8 checkpoint available?
                      ├── Yes → --quantization=fp8 (test carefully on SM121)
                      └── No  → BF16 (no flag; fits only if weights < ~100 GB)

Do not use --quantization=mxfp4 on standard HuggingFace BF16 checkpoints. The NGC 26.04 mxfp4 weight loader only handles gpt-oss’s proprietary 3D uint8 tensor format. BF16 HF checkpoints will crash with IndexError in fused_moe/layer.py.

Model comparison for SM121 (single node, 128 GB)

ModelArchitectureQuantizationFits 128 GBExpected tok/sNotes
gpt-oss-120bSparse MoE (5.1B active)mxfp4✅ ~61 GB32–60128K context; proprietary quant
Qwen3-235B-A22BPure MoE (22B active)GPTQ-Int4 (TP=2)✅ ~60 GB/node17–36Two nodes required; best quality
Qwen3-30B-A3BPure MoE (3.3B active)NVFP4 / FP8 / BF16✅ ~16–60 GB32–50Solid single-node option; no GDN
Qwen3.6-27BDense hybrid (GDN)FP8 (~28 GB)✅ easily14–21 (stock) / 136–200 (fork)GDN kernel gap; fork needed for full speed
Qwen3.5-122B-A10BPure MoE (10B active)NVFP4 (~75 GB)✅ single nodeup to 51Requires NVFP4 checkpoint + SM121 patches
Qwen3.6-35B-A3BPure MoE (3B active)FP8 (~35 GB)✅ easily100+Pure MoE, no GDN; successor to Qwen3-30B-A3B
Qwen3.5-397B-A17BPure MoE (17B active)NVFP4✅ (TP=2)UnknownNot yet recommended — SM121 MoE kernel not optimized

Architectures to prefer vs avoid

Prefer: pure MoE — models using only standard MoE layers (no GDN, no Mamba) run fully through the Marlin kernel and are the most reliable choice on SM121. Examples: Qwen3-235B-A22B, Qwen3-30B-A3B, gpt-oss-120b, Mixtral variants.

Avoid with stock NGC: GDN hybrid architectures — models with GatedDeltaNet (GDN) linear attention layers hit a kernel gap on SM121. Stock NGC produces 14–21 tok/s. If you need full speed from Qwen3.6-27B (~136–200 tok/s), the mitkox/vllm-dflash-ddtree experimental fork adds DFlash + DDTree speculative decoding for GDN, but it’s not yet production-stable.

Qwen3.6 model family

Released April 2026. Two architecturally very different open-weight variants:

Qwen3.6-27BQwen3.6-35B-A3B
ArchitectureDense hybrid (GDN)Pure MoE
Active params per token27B (all)~3B
FP8 weight size~28 GB~35 GB
tok/s on DGX Spark14–21 (stock) / 136–200 (fork)100+
GDN kernel gapYesNo
SM121 stock NGCUnderperforms✅ Fully supported

No Qwen3.6-72B exists. As of May 2026, Qwen3.6 tops out at 27B dense and 35B-A3B MoE. For a 70B+ class model on a single DGX Spark, the current best option is Qwen3.5-122B-A10B NVFP4 (10B active, 51 tok/s confirmed).

SM121 hard rules

# Never use --enforce-eager — disables CUDA graphs, ~55% throughput loss
# Never set --gpu-memory-utilization above 0.90 — OOM on SM121
# MoE backend must be marlin — default produces garbage tokens on SM121
# Stay on driver 580.x — 590.x has a regression on this chip
# Never use CUTLASS FP4 — silent garbage output

Checkpoints to avoid

CheckpointWhy
nvidia/Qwen3-235B-A22B-NVFP4vLLM parsing bug #22906; TRT-LLM only
Any BF16 HF model with --quantization=mxfp4mxfp4 loader only handles gpt-oss uint8 format
FP8 model with FlashInferFlashInfer crashes on SM121; use TRITON_ATTN
Any model requiring --enforce-eager55% throughput loss

Useful community resources

  • eugr/spark-vllm-docker — Ray GPU resource fix + SM121 patches
  • jleighfields/vllm-dgx-spark — Qwen3-Coder-30B-A3B confirmed on DGX Spark
  • NVIDIA “Stacked Sparks” guide — build.nvidia.com/spark/vllm/stacked-sparks
  • NVIDIA DGX Spark Playbooks (DeepWiki) — Ray cluster, NCCL config, UMA tuning