Reference for the NVIDIA DGX Spark GB10 Grace Blackwell Superchip (SM121). The DGX Spark shares the Blackwell name with datacenter hardware but is architecturally distinct. A lot of documentation, forum posts, and vLLM flags written for B100/B200 do not apply here — some actively break things.
SM121 is not datacenter Blackwell
| Feature | DGX Spark (GB10 / SM121) | Datacenter Blackwell (B100/B200) |
|---|---|---|
| TMEM | No | Yes |
| WGMMA | No | Yes |
| DSMEM | No | Yes |
| NVSwitch | No | Yes |
| CUTLASS FP4 | Broken — silent garbage output | Supported |
| Memory type | Unified LPDDR5X (shared CPU+GPU) | HBM3e (GPU-only) |
| Memory per unit | 128 GB | 192 GB |
| GPUs per unit | 1 logical GPU | 1 GPU |
When you see forum recommendations or vLLM flags that say “for Blackwell” — verify they’re for SM121 specifically before using them.
Memory architecture
The 128 GB LPDDR5X pool is shared between CPU and GPU. There is no separate VRAM. This affects everything:
- The Linux page cache competes directly with CUDA allocations. A warm filesystem cache can push CUDA into OOM.
- Never set
--gpu-memory-utilizationabove 0.90. In practice, 0.85–0.87 is the safe range. At long context lengths with large KV caches, you may need to drop further. - If you hit OOM at container startup, drop the page cache first:
sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'
- Ray’s memory monitor misreads page cache pressure as real memory pressure and will kill vLLM mid-inference. In multi-node configurations, always set:
RAY_memory_monitor_refresh_ms=0
GPU count and tensor parallelism
Each DGX Spark presents as one logical GPU. Tensor parallelism beyond TP=1 requires a Ray cluster spanning multiple units — there is no NVLink or NVSwitch. Inter-node communication runs over NCCL via a direct RoCE interconnect.
Software stack
| Component | Version |
|---|---|
| OS | Ubuntu 24.04.4 LTS |
| Kernel | 6.17.0-1014-nvidia |
| NVIDIA driver | 580.142 |
| CUDA (in NGC container) | 13.0 |
| Docker CE | 29.2.1 |
| NVIDIA Container Toolkit | 1.19.0-1 |
Driver: Stay on 580.x. Driver 590.x has a confirmed CUDAGraph deadlock on GB10. Pin the package to prevent accidental upgrades:
sudo apt-mark hold nvidia-driver-580
QSFP-DD direct interconnect
For multi-node setups, a direct QSFP-DD cable between two Sparks gives you a high-bandwidth RoCE link without a switch.
Cable: A 400G DAC passive copper cable (e.g., Amphenol NJAAKK-N911, 1m) presents as two independent 200 Gb/s logical interfaces on each machine. Both channels share the same serial number — this is normal for this cable type.
Measured bandwidth (ib_write_bw)
| Channel | Result |
|---|---|
| Channel 1 | ~13.35 Gb/s |
| Channel 2 | ~13.26 Gb/s |
| Combined | ~26.6 Gb/s |
The below-theoretical numbers are a known ib_write_bw artifact — it defaults to a single queue pair and 4096 B MTU. Actual NCCL throughput with multiple QPs approaches the full line rate. This is not a hardware or config problem.
Network configuration: nmcli only
The DGX Spark is fully NetworkManager-based. Netplan with the networkd renderer silently does nothing — there is no /run/systemd/network/ directory. All persistent interface configuration must use nmcli:
sudo nmcli con mod <connection-name> \
ipv4.addresses 192.168.100.10/24 \
ipv4.method manual \
802-3-ethernet.mtu 9000
sudo nmcli con up <connection-name>
Set MTU to 9000 (jumbo frames) on all QSFP-DD interfaces.
RoCE GID index
GID index 3 = RoCEv2 + IPv4 on both machines. Confirm with:
show_gids | grep <interface-name>
This is required for NCCL: NCCL_IB_GID_INDEX=3.
Compute kernel compatibility
What works
| Kernel / Format | Status |
|---|---|
--attention-backend=TRITON_ATTN | ✅ Stable |
--moe-backend=marlin | ✅ Required for MoE models |
gptq_marlin quantization | ✅ Fully supported |
mxfp4 (gpt-oss format only) | ✅ Works for gpt-oss pre-quantized checkpoints |
FP8 KV cache (--kv-cache-dtype=fp8) | ✅ Safe and recommended |
CUDAGraph (--max-cudagraph-capture-size=2048) | ✅ Required for full throughput |
What is broken or absent on SM121
| Kernel / Format | Status |
|---|---|
| CUTLASS FP4 | ❌ Produces garbage outputs silently |
| FlashInfer attention | ❌ Accuracy bugs on Blackwell FP8; MoE backends absent on SM121 |
mxfp4 on standard HuggingFace BF16 checkpoints | ❌ IndexError + shape mismatch — gpt-oss format only |
--enforce-eager | ❌ Disables CUDAGraph — ~55% throughput loss |
--load-format=fastsafetensors | ❌ ImportError in NGC 26.03/26.04 |
For full runtime flag guidance, see the vLLM on DGX Spark SM121 post.
Startup time reference
For a two-node Qwen3-235B-A22B-GPTQ-Int4 cluster:
| Phase | Duration |
|---|---|
| Ray cluster formation | ~2 min |
| Model weight loading (118 GB, TP=2) | ~7 min |
| CUDA graph capture + compile | ~5 min |
| Total: first successful inference | ~15 min |
Single-node (120B model): ~8 min total.
Temperature monitoring
# GPU temperature
nvidia-smi --query-gpu=temperature.gpu --format=csv,noheader
# CPU/SoC temperature
sudo tegrastats --interval 1000
Normal operating range under sustained inference load: GPU 60–75°C, CPU 50–65°C. The GB10 thermal solution handles continuous full-load operation without throttling.