vLLM on DGX Spark: What the SM121 Architecture Actually Requires

The DGX Spark GB10 runs SM121 — the Grace Blackwell Superchip. It is not the same silicon as datacenter Blackwell (SM100, H100/H200). SM121 lacks TMEM, WGMMA, DSMEM, and NVSwitch. Several vLLM defaults, forum recommendations, and NVIDIA docs written for datacenter Blackwell do not apply, and some actively break things on SM121.

This is a reference for what we learned running vLLM 0.19.0 (NGC container nvcr.io/nvidia/vllm:26.04-py3) on two DGX Sparks — single-node and two-node cluster configurations.

Attention and MoE backends

Use --attention-backend=TRITON_ATTN. Do not use FlashInfer.

FlashInfer has confirmed accuracy bugs with FP8 models on Blackwell (vLLM issue #35138). On SM121 specifically, the FlashInfer MoE backends are unavailable — vLLM silently falls back to Triton anyway. Forum posts recommending FlashInfer are written for datacenter Blackwell. For SM121, set Triton explicitly and move on.

For MoE models, --moe-backend=marlin is required.

CUTLASS FP4 is broken on SM121. It produces garbage output silently — inference runs, generation looks plausible, but outputs are wrong. The correct MoE kernel for SM121 is Marlin. Set it explicitly:

--attention-backend=TRITON_ATTN
--moe-backend=marlin

Also set this environment variable:

VLLM_USE_FLASHINFER_MOE_FP4=0

Without it, MoE routing can go through the broken CUTLASS FP4 path.

Never use `--enforce-eager`

CUDAGraph is not optional on SM121. Disabling it with --enforce-eager cuts throughput roughly 55% — from ~59 tok/s to ~26 tok/s in our measurements. There is no scenario where this tradeoff is worth it. If you are adding it to work around a startup issue, fix the underlying issue instead.

Unified memory ceiling

The GB10 uses unified LPDDR5X memory — CPU and GPU share the same physical pool. The OS page cache competes directly with CUDA for this memory. Setting --gpu-memory-utilization too high causes OOM crashes or Xid 43 GPU channel preemption under load.

Hard limit: never exceed 0.90. In practice, 0.85–0.87 is the safe range for most models. At 131K context lengths with large KV caches, we needed to drop to 0.82 to stop Xid 43 errors on the two-node cluster.

If you hit OOM at startup, drop page cache before restarting:

sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'

Driver version

Stay on 580.x. Do not upgrade to 590.x.

NVIDIA driver 590.x has a confirmed CUDAGraph deadlock on GB10. At the time of writing, 580.142 is the correct version for SM121. Verify before upgrading any NGC container:

nvidia-smi --query-gpu=driver_version --format=csv,noheader

Flags that break NGC containers

--load-format fastsafetensors causes an ImportError in NGC 26.03 and 26.04. It is available in some community-built vLLM containers but not in stock NGC images. Omit it. The default mmap format is slower to load on startup but identical at runtime.

VLLM_MARLIN_USE_ATOMIC_ADD=1 is required for Marlin on SM121. Without it, there is a race condition in the Marlin kernel that produces incorrect outputs. Set it in your environment:

VLLM_MARLIN_USE_ATOMIC_ADD=1

Quantization: what works with what checkpoint format

The NGC 26.04 mxfp4 weight loader only handles gpt-oss pre-quantized checkpoints — specifically the format where expert weights are stored as 3D uint8 tensors. Standard HuggingFace BF16 checkpoints (Qwen3, Llama, etc.) are 2D BF16 tensors. Loading them with --quantization=mxfp4 produces an IndexError in fused_moe/layer.py, and after patching that, a dtype/shape mismatch.

If you are running gpt-oss-120b: --quantization=mxfp4 works.

If you are running any standard HuggingFace MoE checkpoint: use --quantization=gptq_marlin (for GPTQ-Int4 checkpoints) or --quantization=fp8 (for FP8 checkpoints). There is no online BF16→mxfp4 quantization path in the NGC build.

Reasoning models and `content: null`

Reasoning models (gpt-oss, Qwen3 in thinking mode) generate chain-of-thought tokens before producing content. These tokens consume the max_tokens budget. If max_tokens is too low, the model exhausts the budget during reasoning and returns content: null.

Set max_tokens to at least 512 for any reasoning model. For tool-calling workflows or complex prompts, 1024 or higher.

For latency-sensitive calls with Qwen3, append /no_think to the prompt to skip reasoning mode entirely.

Multi-node: NCCL and Ray on SM121

Single-GPU per machine (no NVSwitch), so tensor parallelism across two Sparks requires Ray + NCCL over a direct interconnect. We use a 400G QSFP-DD DAC between the two nodes, presenting as 2×200G RoCEv2.

Several things will silently fail or deadlock without explicit configuration.

network_mode: host is required. Bridge networking blocks NCCL rendezvous. This is not optional.

NCCL_IB_HCA must be set. Without pinning NCCL to the RoCE interfaces, NCCL on SM121 can deadlock silently — no error, no output, just a hung process.

NCCL_IB_HCA=rocep1s0f1,roceP2p1s0f1
NCCL_IB_DISABLE=0
NCCL_IB_GID_INDEX=3          # RoCEv2 + IPv4; verify with show_gids on your hardware
NCCL_NET_PLUGIN=none          # prevents AWS OFI plugin TCP fallback
NCCL_IB_ROCE_VERSION_NUM=2
NCCL_IB_TIMEOUT=22

RAY_memory_monitor_refresh_ms=0 is required. Ray’s memory monitor can kill inference processes mid-run on unified-memory systems. Set it to 0 to disable it.

Triton needs a ptxas symlink. The NGC container’s Triton backend looks for ptxas in a path it does not have on SM121. Add this to the container entrypoint:

mkdir -p /usr/local/lib/python3.12/dist-packages/triton/backends/nvidia/bin
ln -sf /usr/local/cuda/bin/ptxas \
  /usr/local/lib/python3.12/dist-packages/triton/backends/nvidia/bin/ptxas

Without it, Triton JIT compilation fails at the first inference call.

GLOO_SOCKET_IFNAME must point to the interconnect interface. Ray gloo rendezvous needs to use the direct QSFP-DD interface, not a random one:

GLOO_SOCKET_IFNAME=enp1s0f1np1

Prefix cache hit rate metric

The gpu_prefix_cache_hit_rate gauge is not present in vLLM 0.19.0. Calculate it from raw counters:

hits    = vllm:prefix_cache_hits_total
queries = vllm:prefix_cache_queries_total
rate    = hits / queries

Summary table

Rule	Why
`--attention-backend=TRITON_ATTN`	FlashInfer accuracy bugs on Blackwell FP8; MoE backends unavailable on SM121
`--moe-backend=marlin`	CUTLASS FP4 broken on SM121 — silent garbage output
`VLLM_USE_FLASHINFER_MOE_FP4=0`	Prevents MoE routing through broken CUTLASS path
`VLLM_MARLIN_USE_ATOMIC_ADD=1`	Marlin race condition on SM121
Never `--enforce-eager`	Disables CUDAGraph; ~55% throughput drop
`--gpu-memory-utilization ≤ 0.90`	Unified memory — page cache competes with CUDA
Stay on driver 580.x	590.x has CUDAGraph deadlock on GB10
No `--load-format fastsafetensors`	ImportError in NGC 26.03/26.04
mxfp4 quantization: gpt-oss checkpoints only	Loader is format-specific; use gptq_marlin or fp8 for standard HF checkpoints
`max_tokens ≥ 512` for reasoning models	Reasoning tokens consume budget before content; low values return `content: null`
`network_mode: host` for multi-node	Bridge networking blocks NCCL rendezvous
`NCCL_IB_HCA` set explicitly	Silent deadlock without interface pinning
`RAY_memory_monitor_refresh_ms=0`	Ray kills processes mid-inference on unified memory
ptxas symlink in entrypoint	Triton JIT fails without it on SM121

Attention and MoE backends#

Never use --enforce-eager#

Unified memory ceiling#

Driver version#

Flags that break NGC containers#

Quantization: what works with what checkpoint format#

Reasoning models and content: null#

Multi-node: NCCL and Ray on SM121#

Prefix cache hit rate metric#

Summary table#