Running Qwen3.5-122B on a Single DGX Spark

The NVIDIA DGX Spark (GB10 SoC, 128GB unified LPDDR5X memory) can run Qwen3.5-122B-A10B — a 122B parameter MoE model — at usable throughput for production workloads. Here’s what it actually takes.

The key constraint: NVFP4 only

Qwen3.5-122B-A10B at full precision is ~250GB. In NVFP4 quantization it’s ~75GB, which fits comfortably in 128GB unified memory. There is no other quantization path that both fits and runs correctly on the GB10.

The only verified checkpoint we’ve found: bjk110/SPARK_Qwen3.5-122B-A10B-NVFP4 on HuggingFace, which includes 15 patches for the SM121 architecture. Use this; don’t try to quantize the base model yourself unless you’re prepared to debug SM121-specific kernel failures.

SM121 vs SM120

The DGX Spark GB10 is SM121. Most documentation (NVIDIA’s included) targets SM120 (DGX Spark B200). They’re close but not identical — several vLLM optimizations that work on SM120 either fail silently or crash on SM121.

Hard rules for SM121:

# Never add --enforce-eager — it disables CUDA graphs and tanks throughput
# Never set --gpu-memory-utilization above 0.90 — OOM above this on SM121
# MoE backend must be marlin — default produces garbage tokens on SM121
# Stay on NVIDIA driver 580.x — 590.x has a regression on this chip

--moe-backend marlin \
--gpu-memory-utilization 0.85 \
--max-model-len 8192

Docker run command

docker run --rm -it \
  --gpus all \
  --shm-size=16g \
  -v /path/to/model:/model \
  -p 8000:8000 \
  bjk110/vllm-spark:latest \
  python -m vllm.entrypoints.openai.api_server \
    --model /model \
    --served-model-name qwen3.5-122b \
    --dtype auto \
    --moe-backend marlin \
    --gpu-memory-utilization 0.85 \
    --max-model-len 8192 \
    --port 8000

Performance

On a single DGX Spark (no clustering):

Metric	Result
Throughput	~51 tok/s (generation, single stream)
TTFT	~2–4s (8K context)
Memory at load	~92GB / 128GB
Stable uptime	Yes — runs indefinitely, no OOM

51 tok/s is fast enough for agentic workloads where the bottleneck is tool calls, not token generation.

Known warnings at startup

These appear in logs and are safe to ignore:

UserWarning: flashinfer is not available, falling back to xformers
The model config specified num_hidden_layers=94 but the actual number...

The num_hidden_layers warning is a display artifact from the MoE architecture — the model loads and runs correctly.

What doesn’t work: Qwen3.6

Qwen3.6 was released in April 2026 with two variants — 27B dense and 35B-A3B MoE. There is no Qwen3.6-72B or Qwen3.6-122B. If you’re looking for a 70B+ model in the Qwen3 family for a single DGX Spark, Qwen3.5-122B-A10B NVFP4 is currently the only option.

The Qwen3.6-27B (dense) introduces a GDN hybrid attention architecture that has a kernel gap on the GB10 SoC — as of May 2026, you’ll get ~14-21 tok/s on stock NGC images. Qwen3.6-35B-A3B (pure MoE) runs at 100+ tok/s but tops out at 35B parameters.

The key constraint: NVFP4 only#

SM121 vs SM120#

Docker run command#

Performance#

Known warnings at startup#

What doesn’t work: Qwen3.6#