Quick-reference comparison of open-weight models for a single DGX Spark GB10 (SM121, 128 GB unified LPDDR5X memory). Based on tested configurations and community results as of May 2026.
| Model | Architecture | Quantization | Memory | Expected tok/s | SM121 notes |
|---|---|---|---|---|---|
| Qwen3.6-35B-A3B | Pure MoE (3B active) | FP8 (~35 GB) | ✅ easily | 100+ | Pure MoE, no GDN — fully supported |
| Qwen3.6-27B | Dense hybrid (GDN) | FP8 (~28 GB) | ✅ easily | 14–21 (stock) / 136–200 (fork) | GDN kernel gap; experimental fork needed for full speed |
| Qwen3-30B-A3B | Pure MoE (3.3B active) | NVFP4 / FP8 / BF16 (~16–60 GB) | ✅ easily | 32–50 | Solid single-node option; no GDN |
| gpt-oss-120b | Sparse MoE (5.1B active) | mxfp4 (~61 GB) | ✅ | 32–60 | 128K context; proprietary quant format |
| Qwen3.5-122B-A10B | Pure MoE (10B active) | NVFP4 only (~75 GB) | ✅ | up to 51 | BF16 is 234 GB — does not fit; NVFP4 is the only path |
| Qwen3-235B-A22B | Pure MoE (22B active) | GPTQ-Int4 (~60 GB/node) | ✅ (two nodes) | 17–36 agg | Requires two DGX Sparks; best quality available |
| Qwen3.5-397B-A17B | Pure MoE (17B active) | NVFP4 (TP=2) | ✅ (two nodes) | Unknown | SM121 MoE kernel not yet optimized; not recommended |
Key observations
Throughput vs quality tradeoff at single-node: Qwen3.6-35B-A3B gives the highest throughput (100+ tok/s) with pure MoE architecture. Qwen3.5-122B-A10B gives the most capable model (10B active parameters) that fits on one node, at 51 tok/s. For most agentic workloads the bottleneck is tool latency, not token generation — so 51 tok/s is more than sufficient.
The GDN trap: Qwen3.6-27B looks attractive on paper — it’s small (28 GB), recent, and dense. But the GDN attention kernel has a gap on SM121 that cuts it to 14–21 tok/s with stock NGC. Qwen3.6-35B-A3B is larger on paper but runs 5–7× faster in practice.
NVFP4 is the only path to 122B on one node: Qwen3.5-122B-A10B at BF16 is 234 GB — it doesn’t fit. NVFP4 quantization brings it to ~75 GB. There is no other quantization format that both fits and runs correctly on SM121. See Running Qwen3.5-122B on a Single DGX Spark for setup details.
Two-node ceiling: Qwen3-235B-A22B over a QSFP-DD direct interconnect is the highest quality configuration available on two Sparks. Our benchmarks show 17 tok/s at batch=1 and 36 tok/s aggregate at batch=4 — beating NVIDIA’s own published TRT-LLM number by ~45%.