DGX Spark GB10 Hardware Reference: SM121 Architecture, Memory, and Networking

Reference for the NVIDIA DGX Spark GB10 Grace Blackwell Superchip (SM121). The DGX Spark shares the Blackwell name with datacenter hardware but is architecturally distinct. A lot of documentation, forum posts, and vLLM flags written for B100/B200 do not apply here — some actively break things.

SM121 is not datacenter Blackwell

Feature	DGX Spark (GB10 / SM121)	Datacenter Blackwell (B100/B200)
TMEM	No	Yes
WGMMA	No	Yes
DSMEM	No	Yes
NVSwitch	No	Yes
CUTLASS FP4	Broken — silent garbage output	Supported
Memory type	Unified LPDDR5X (shared CPU+GPU)	HBM3e (GPU-only)
Memory per unit	128 GB	192 GB
GPUs per unit	1 logical GPU	1 GPU

When you see forum recommendations or vLLM flags that say “for Blackwell” — verify they’re for SM121 specifically before using them.

Memory architecture

The 128 GB LPDDR5X pool is shared between CPU and GPU. There is no separate VRAM. This affects everything:

The Linux page cache competes directly with CUDA allocations. A warm filesystem cache can push CUDA into OOM.
Never set --gpu-memory-utilization above 0.90. In practice, 0.85–0.87 is the safe range. At long context lengths with large KV caches, you may need to drop further.
If you hit OOM at container startup, drop the page cache first:

sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'

Ray’s memory monitor misreads page cache pressure as real memory pressure and will kill vLLM mid-inference. In multi-node configurations, always set:

RAY_memory_monitor_refresh_ms=0

GPU count and tensor parallelism

Each DGX Spark presents as one logical GPU. Tensor parallelism beyond TP=1 requires a Ray cluster spanning multiple units — there is no NVLink or NVSwitch. Inter-node communication runs over NCCL via a direct RoCE interconnect.

Software stack

Component	Version
OS	Ubuntu 24.04.4 LTS
Kernel	`6.17.0-1014-nvidia`
NVIDIA driver	`580.142`
CUDA (in NGC container)	13.0
Docker CE	29.2.1
NVIDIA Container Toolkit	1.19.0-1

Driver: Stay on 580.x. Driver 590.x has a confirmed CUDAGraph deadlock on GB10. Pin the package to prevent accidental upgrades:

sudo apt-mark hold nvidia-driver-580

QSFP-DD direct interconnect

For multi-node setups, a direct QSFP-DD cable between two Sparks gives you a high-bandwidth RoCE link without a switch.

Cable: A 400G DAC passive copper cable (e.g., Amphenol NJAAKK-N911, 1m) presents as two independent 200 Gb/s logical interfaces on each machine. Both channels share the same serial number — this is normal for this cable type.

Measured bandwidth (ib_write_bw)

Channel	Result
Channel 1	~13.35 Gb/s
Channel 2	~13.26 Gb/s
Combined	~26.6 Gb/s

The below-theoretical numbers are a known ib_write_bw artifact — it defaults to a single queue pair and 4096 B MTU. Actual NCCL throughput with multiple QPs approaches the full line rate. This is not a hardware or config problem.

Network configuration: nmcli only

The DGX Spark is fully NetworkManager-based. Netplan with the networkd renderer silently does nothing — there is no /run/systemd/network/ directory. All persistent interface configuration must use nmcli:

sudo nmcli con mod <connection-name> \
  ipv4.addresses 192.168.100.10/24 \
  ipv4.method manual \
  802-3-ethernet.mtu 9000
sudo nmcli con up <connection-name>

Set MTU to 9000 (jumbo frames) on all QSFP-DD interfaces.

RoCE GID index

GID index 3 = RoCEv2 + IPv4 on both machines. Confirm with:

show_gids | grep <interface-name>

This is required for NCCL: NCCL_IB_GID_INDEX=3.

Compute kernel compatibility

What works

Kernel / Format	Status
`--attention-backend=TRITON_ATTN`	✅ Stable
`--moe-backend=marlin`	✅ Required for MoE models
`gptq_marlin` quantization	✅ Fully supported
`mxfp4` (gpt-oss format only)	✅ Works for gpt-oss pre-quantized checkpoints
FP8 KV cache (`--kv-cache-dtype=fp8`)	✅ Safe and recommended
CUDAGraph (`--max-cudagraph-capture-size=2048`)	✅ Required for full throughput

What is broken or absent on SM121

Kernel / Format	Status
CUTLASS FP4	❌ Produces garbage outputs silently
FlashInfer attention	❌ Accuracy bugs on Blackwell FP8; MoE backends absent on SM121
`mxfp4` on standard HuggingFace BF16 checkpoints	❌ IndexError + shape mismatch — gpt-oss format only
`--enforce-eager`	❌ Disables CUDAGraph — ~55% throughput loss
`--load-format=fastsafetensors`	❌ ImportError in NGC 26.03/26.04

For full runtime flag guidance, see the vLLM on DGX Spark SM121 post.

Startup time reference

For a two-node Qwen3-235B-A22B-GPTQ-Int4 cluster:

Phase	Duration
Ray cluster formation	~2 min
Model weight loading (118 GB, TP=2)	~7 min
CUDA graph capture + compile	~5 min
Total: first successful inference	~15 min

Single-node (120B model): ~8 min total.

Temperature monitoring

# GPU temperature
nvidia-smi --query-gpu=temperature.gpu --format=csv,noheader

# CPU/SoC temperature
sudo tegrastats --interval 1000

Normal operating range under sustained inference load: GPU 60–75°C, CPU 50–65°C. The GB10 thermal solution handles continuous full-load operation without throttling.

SM121 is not datacenter Blackwell#

Memory architecture#

GPU count and tensor parallelism#

Software stack#

QSFP-DD direct interconnect#

Measured bandwidth (ib_write_bw)#

Network configuration: nmcli only#

RoCE GID index#

Compute kernel compatibility#

What works#

What is broken or absent on SM121#

Startup time reference#

Temperature monitoring#