The DGX Spark GB10 SoC (SM121) has specific constraints that determine which models run well and which don’t. This is a practical guide based on what we’ve tested in production.
The key constraint: SM121 kernel compatibility
Not all model architectures run well on SM121 with the NGC vLLM container. The main constraint is the MoE kernel:
- Marlin kernel — stable, fast, supports GPTQ-Int4 and mxfp4
- CUTLASS FP4 — broken on SM121, produces garbage outputs silently; never use
- GDN (GatedDeltaNet) — kernel gap on SM121, 14–21 tok/s with stock NGC; requires experimental fork for full speed
Prefer pure MoE models over dense hybrid architectures when using the NGC container. Pure MoE (no GDN/Mamba layers) runs fully through Marlin and is well-tested on SM121.
Quantization format decision tree
Is the checkpoint from openai/gpt-oss-*?
├── Yes → --quantization=mxfp4 (gpt-oss pre-quantized uint8 format)
└── No → Is there a GPTQ-Int4 checkpoint available?
├── Yes → --quantization=gptq_marlin ✅ recommended
└── No → FP8 checkpoint available?
├── Yes → --quantization=fp8 (test carefully on SM121)
└── No → BF16 (no flag; fits only if weights < ~100 GB)
Do not use
--quantization=mxfp4on standard HuggingFace BF16 checkpoints. The NGC 26.04 mxfp4 weight loader only handles gpt-oss’s proprietary 3D uint8 tensor format. BF16 HF checkpoints will crash withIndexErrorinfused_moe/layer.py.
Model comparison for SM121 (single node, 128 GB)
| Model | Architecture | Quantization | Fits 128 GB | Expected tok/s | Notes |
|---|---|---|---|---|---|
| gpt-oss-120b | Sparse MoE (5.1B active) | mxfp4 | ✅ ~61 GB | 32–60 | 128K context; proprietary quant |
| Qwen3-235B-A22B | Pure MoE (22B active) | GPTQ-Int4 (TP=2) | ✅ ~60 GB/node | 17–36 | Two nodes required; best quality |
| Qwen3-30B-A3B | Pure MoE (3.3B active) | NVFP4 / FP8 / BF16 | ✅ ~16–60 GB | 32–50 | Solid single-node option; no GDN |
| Qwen3.6-27B | Dense hybrid (GDN) | FP8 (~28 GB) | ✅ easily | 14–21 (stock) / 136–200 (fork) | GDN kernel gap; fork needed for full speed |
| Qwen3.5-122B-A10B | Pure MoE (10B active) | NVFP4 (~75 GB) | ✅ single node | up to 51 | Requires NVFP4 checkpoint + SM121 patches |
| Qwen3.6-35B-A3B | Pure MoE (3B active) | FP8 (~35 GB) | ✅ easily | 100+ | Pure MoE, no GDN; successor to Qwen3-30B-A3B |
| Qwen3.5-397B-A17B | Pure MoE (17B active) | NVFP4 | ✅ (TP=2) | Unknown | Not yet recommended — SM121 MoE kernel not optimized |
Architectures to prefer vs avoid
Prefer: pure MoE — models using only standard MoE layers (no GDN, no Mamba) run fully through the Marlin kernel and are the most reliable choice on SM121. Examples: Qwen3-235B-A22B, Qwen3-30B-A3B, gpt-oss-120b, Mixtral variants.
Avoid with stock NGC: GDN hybrid architectures — models with GatedDeltaNet (GDN) linear attention layers hit a kernel gap on SM121. Stock NGC produces 14–21 tok/s. If you need full speed from Qwen3.6-27B (~136–200 tok/s), the mitkox/vllm-dflash-ddtree experimental fork adds DFlash + DDTree speculative decoding for GDN, but it’s not yet production-stable.
Qwen3.6 model family
Released April 2026. Two architecturally very different open-weight variants:
| Qwen3.6-27B | Qwen3.6-35B-A3B | |
|---|---|---|
| Architecture | Dense hybrid (GDN) | Pure MoE |
| Active params per token | 27B (all) | ~3B |
| FP8 weight size | ~28 GB | ~35 GB |
| tok/s on DGX Spark | 14–21 (stock) / 136–200 (fork) | 100+ |
| GDN kernel gap | Yes | No |
| SM121 stock NGC | Underperforms | ✅ Fully supported |
No Qwen3.6-72B exists. As of May 2026, Qwen3.6 tops out at 27B dense and 35B-A3B MoE. For a 70B+ class model on a single DGX Spark, the current best option is Qwen3.5-122B-A10B NVFP4 (10B active, 51 tok/s confirmed).
SM121 hard rules
# Never use --enforce-eager — disables CUDA graphs, ~55% throughput loss
# Never set --gpu-memory-utilization above 0.90 — OOM on SM121
# MoE backend must be marlin — default produces garbage tokens on SM121
# Stay on driver 580.x — 590.x has a regression on this chip
# Never use CUTLASS FP4 — silent garbage output
Checkpoints to avoid
| Checkpoint | Why |
|---|---|
nvidia/Qwen3-235B-A22B-NVFP4 | vLLM parsing bug #22906; TRT-LLM only |
Any BF16 HF model with --quantization=mxfp4 | mxfp4 loader only handles gpt-oss uint8 format |
| FP8 model with FlashInfer | FlashInfer crashes on SM121; use TRITON_ATTN |
Any model requiring --enforce-eager | 55% throughput loss |
Useful community resources
- eugr/spark-vllm-docker — Ray GPU resource fix + SM121 patches
- jleighfields/vllm-dgx-spark — Qwen3-Coder-30B-A3B confirmed on DGX Spark
- NVIDIA “Stacked Sparks” guide —
build.nvidia.com/spark/vllm/stacked-sparks - NVIDIA DGX Spark Playbooks (DeepWiki) — Ray cluster, NCCL config, UMA tuning