The NVIDIA DGX Spark (GB10 SoC, 128GB unified LPDDR5X memory) can run Qwen3.5-122B-A10B — a 122B parameter MoE model — at usable throughput for production workloads. Here’s what it actually takes.
The key constraint: NVFP4 only
Qwen3.5-122B-A10B at full precision is ~250GB. In NVFP4 quantization it’s ~75GB, which fits comfortably in 128GB unified memory. There is no other quantization path that both fits and runs correctly on the GB10.
The only verified checkpoint we’ve found: bjk110/SPARK_Qwen3.5-122B-A10B-NVFP4 on HuggingFace, which includes 15 patches for the SM121 architecture. Use this; don’t try to quantize the base model yourself unless you’re prepared to debug SM121-specific kernel failures.
SM121 vs SM120
The DGX Spark GB10 is SM121. Most documentation (NVIDIA’s included) targets SM120 (DGX Spark B200). They’re close but not identical — several vLLM optimizations that work on SM120 either fail silently or crash on SM121.
Hard rules for SM121:
# Never add --enforce-eager — it disables CUDA graphs and tanks throughput
# Never set --gpu-memory-utilization above 0.90 — OOM above this on SM121
# MoE backend must be marlin — default produces garbage tokens on SM121
# Stay on NVIDIA driver 580.x — 590.x has a regression on this chip
--moe-backend marlin \
--gpu-memory-utilization 0.85 \
--max-model-len 8192
Docker run command
docker run --rm -it \
--gpus all \
--shm-size=16g \
-v /path/to/model:/model \
-p 8000:8000 \
bjk110/vllm-spark:latest \
python -m vllm.entrypoints.openai.api_server \
--model /model \
--served-model-name qwen3.5-122b \
--dtype auto \
--moe-backend marlin \
--gpu-memory-utilization 0.85 \
--max-model-len 8192 \
--port 8000
Performance
On a single DGX Spark (no clustering):
| Metric | Result |
|---|---|
| Throughput | ~51 tok/s (generation, single stream) |
| TTFT | ~2–4s (8K context) |
| Memory at load | ~92GB / 128GB |
| Stable uptime | Yes — runs indefinitely, no OOM |
51 tok/s is fast enough for agentic workloads where the bottleneck is tool calls, not token generation.
Known warnings at startup
These appear in logs and are safe to ignore:
UserWarning: flashinfer is not available, falling back to xformers
The model config specified num_hidden_layers=94 but the actual number...
The num_hidden_layers warning is a display artifact from the MoE architecture — the model loads and runs correctly.
What doesn’t work: Qwen3.6
Qwen3.6 was released in April 2026 with two variants — 27B dense and 35B-A3B MoE. There is no Qwen3.6-72B or Qwen3.6-122B. If you’re looking for a 70B+ model in the Qwen3 family for a single DGX Spark, Qwen3.5-122B-A10B NVFP4 is currently the only option.
The Qwen3.6-27B (dense) introduces a GDN hybrid attention architecture that has a kernel gap on the GB10 SoC — as of May 2026, you’ll get ~14-21 tok/s on stock NGC images. Qwen3.6-35B-A3B (pure MoE) runs at 100+ tok/s but tops out at 35B parameters.