Quick-reference comparison of open-weight models for a single DGX Spark GB10 (SM121, 128 GB unified LPDDR5X memory). Based on tested configurations and community results as of May 2026.

ModelArchitectureQuantizationMemoryExpected tok/sSM121 notes
Qwen3.6-35B-A3BPure MoE (3B active)FP8 (~35 GB)✅ easily100+Pure MoE, no GDN — fully supported
Qwen3.6-27BDense hybrid (GDN)FP8 (~28 GB)✅ easily14–21 (stock) / 136–200 (fork)GDN kernel gap; experimental fork needed for full speed
Qwen3-30B-A3BPure MoE (3.3B active)NVFP4 / FP8 / BF16 (~16–60 GB)✅ easily32–50Solid single-node option; no GDN
gpt-oss-120bSparse MoE (5.1B active)mxfp4 (~61 GB)32–60128K context; proprietary quant format
Qwen3.5-122B-A10BPure MoE (10B active)NVFP4 only (~75 GB)up to 51BF16 is 234 GB — does not fit; NVFP4 is the only path
Qwen3-235B-A22BPure MoE (22B active)GPTQ-Int4 (~60 GB/node)✅ (two nodes)17–36 aggRequires two DGX Sparks; best quality available
Qwen3.5-397B-A17BPure MoE (17B active)NVFP4 (TP=2)✅ (two nodes)UnknownSM121 MoE kernel not yet optimized; not recommended

Key observations

Throughput vs quality tradeoff at single-node: Qwen3.6-35B-A3B gives the highest throughput (100+ tok/s) with pure MoE architecture. Qwen3.5-122B-A10B gives the most capable model (10B active parameters) that fits on one node, at 51 tok/s. For most agentic workloads the bottleneck is tool latency, not token generation — so 51 tok/s is more than sufficient.

The GDN trap: Qwen3.6-27B looks attractive on paper — it’s small (28 GB), recent, and dense. But the GDN attention kernel has a gap on SM121 that cuts it to 14–21 tok/s with stock NGC. Qwen3.6-35B-A3B is larger on paper but runs 5–7× faster in practice.

NVFP4 is the only path to 122B on one node: Qwen3.5-122B-A10B at BF16 is 234 GB — it doesn’t fit. NVFP4 quantization brings it to ~75 GB. There is no other quantization format that both fits and runs correctly on SM121. See Running Qwen3.5-122B on a Single DGX Spark for setup details.

Two-node ceiling: Qwen3-235B-A22B over a QSFP-DD direct interconnect is the highest quality configuration available on two Sparks. Our benchmarks show 17 tok/s at batch=1 and 36 tok/s aggregate at batch=4 — beating NVIDIA’s own published TRT-LLM number by ~45%.