Reference for the NVIDIA DGX Spark GB10 Grace Blackwell Superchip (SM121). The DGX Spark shares the Blackwell name with datacenter hardware but is architecturally distinct. A lot of documentation, forum posts, and vLLM flags written for B100/B200 do not apply here — some actively break things.


SM121 is not datacenter Blackwell

FeatureDGX Spark (GB10 / SM121)Datacenter Blackwell (B100/B200)
TMEMNoYes
WGMMANoYes
DSMEMNoYes
NVSwitchNoYes
CUTLASS FP4Broken — silent garbage outputSupported
Memory typeUnified LPDDR5X (shared CPU+GPU)HBM3e (GPU-only)
Memory per unit128 GB192 GB
GPUs per unit1 logical GPU1 GPU

When you see forum recommendations or vLLM flags that say “for Blackwell” — verify they’re for SM121 specifically before using them.


Memory architecture

The 128 GB LPDDR5X pool is shared between CPU and GPU. There is no separate VRAM. This affects everything:

  • The Linux page cache competes directly with CUDA allocations. A warm filesystem cache can push CUDA into OOM.
  • Never set --gpu-memory-utilization above 0.90. In practice, 0.85–0.87 is the safe range. At long context lengths with large KV caches, you may need to drop further.
  • If you hit OOM at container startup, drop the page cache first:
sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'
  • Ray’s memory monitor misreads page cache pressure as real memory pressure and will kill vLLM mid-inference. In multi-node configurations, always set:
RAY_memory_monitor_refresh_ms=0

GPU count and tensor parallelism

Each DGX Spark presents as one logical GPU. Tensor parallelism beyond TP=1 requires a Ray cluster spanning multiple units — there is no NVLink or NVSwitch. Inter-node communication runs over NCCL via a direct RoCE interconnect.


Software stack

ComponentVersion
OSUbuntu 24.04.4 LTS
Kernel6.17.0-1014-nvidia
NVIDIA driver580.142
CUDA (in NGC container)13.0
Docker CE29.2.1
NVIDIA Container Toolkit1.19.0-1

Driver: Stay on 580.x. Driver 590.x has a confirmed CUDAGraph deadlock on GB10. Pin the package to prevent accidental upgrades:

sudo apt-mark hold nvidia-driver-580

QSFP-DD direct interconnect

For multi-node setups, a direct QSFP-DD cable between two Sparks gives you a high-bandwidth RoCE link without a switch.

Cable: A 400G DAC passive copper cable (e.g., Amphenol NJAAKK-N911, 1m) presents as two independent 200 Gb/s logical interfaces on each machine. Both channels share the same serial number — this is normal for this cable type.

Measured bandwidth (ib_write_bw)

ChannelResult
Channel 1~13.35 Gb/s
Channel 2~13.26 Gb/s
Combined~26.6 Gb/s

The below-theoretical numbers are a known ib_write_bw artifact — it defaults to a single queue pair and 4096 B MTU. Actual NCCL throughput with multiple QPs approaches the full line rate. This is not a hardware or config problem.

Network configuration: nmcli only

The DGX Spark is fully NetworkManager-based. Netplan with the networkd renderer silently does nothing — there is no /run/systemd/network/ directory. All persistent interface configuration must use nmcli:

sudo nmcli con mod <connection-name> \
  ipv4.addresses 192.168.100.10/24 \
  ipv4.method manual \
  802-3-ethernet.mtu 9000
sudo nmcli con up <connection-name>

Set MTU to 9000 (jumbo frames) on all QSFP-DD interfaces.

RoCE GID index

GID index 3 = RoCEv2 + IPv4 on both machines. Confirm with:

show_gids | grep <interface-name>

This is required for NCCL: NCCL_IB_GID_INDEX=3.


Compute kernel compatibility

What works

Kernel / FormatStatus
--attention-backend=TRITON_ATTN✅ Stable
--moe-backend=marlin✅ Required for MoE models
gptq_marlin quantization✅ Fully supported
mxfp4 (gpt-oss format only)✅ Works for gpt-oss pre-quantized checkpoints
FP8 KV cache (--kv-cache-dtype=fp8)✅ Safe and recommended
CUDAGraph (--max-cudagraph-capture-size=2048)✅ Required for full throughput

What is broken or absent on SM121

Kernel / FormatStatus
CUTLASS FP4❌ Produces garbage outputs silently
FlashInfer attention❌ Accuracy bugs on Blackwell FP8; MoE backends absent on SM121
mxfp4 on standard HuggingFace BF16 checkpoints❌ IndexError + shape mismatch — gpt-oss format only
--enforce-eager❌ Disables CUDAGraph — ~55% throughput loss
--load-format=fastsafetensors❌ ImportError in NGC 26.03/26.04

For full runtime flag guidance, see the vLLM on DGX Spark SM121 post.


Startup time reference

For a two-node Qwen3-235B-A22B-GPTQ-Int4 cluster:

PhaseDuration
Ray cluster formation~2 min
Model weight loading (118 GB, TP=2)~7 min
CUDA graph capture + compile~5 min
Total: first successful inference~15 min

Single-node (120B model): ~8 min total.


Temperature monitoring

# GPU temperature
nvidia-smi --query-gpu=temperature.gpu --format=csv,noheader

# CPU/SoC temperature
sudo tegrastats --interval 1000

Normal operating range under sustained inference load: GPU 60–75°C, CPU 50–65°C. The GB10 thermal solution handles continuous full-load operation without throttling.