DGX Spark GB10 Hardware Reference: SM121 Architecture, Memory, and Networking

Reference for the NVIDIA DGX Spark GB10 Grace Blackwell Superchip (SM121). The DGX Spark shares the Blackwell name with datacenter hardware but is architecturally distinct. A lot of documentation, forum posts, and vLLM flags written for B100/B200 do not apply here — some actively break things. SM121 is not datacenter Blackwell Feature DGX Spark (GB10 / SM121) Datacenter Blackwell (B100/B200) TMEM No Yes WGMMA No Yes DSMEM No Yes NVSwitch No Yes CUTLASS FP4 Broken — silent garbage output Supported Memory type Unified LPDDR5X (shared CPU+GPU) HBM3e (GPU-only) Memory per unit 128 GB 192 GB GPUs per unit 1 logical GPU 1 GPU When you see forum recommendations or vLLM flags that say “for Blackwell” — verify they’re for SM121 specifically before using them. ...

May 14, 2026 · 4 min · Conselara Labs

vLLM on DGX Spark: What the SM121 Architecture Actually Requires

The DGX Spark GB10 runs SM121 — the Grace Blackwell Superchip. It is not the same silicon as datacenter Blackwell (SM100, H100/H200). SM121 lacks TMEM, WGMMA, DSMEM, and NVSwitch. Several vLLM defaults, forum recommendations, and NVIDIA docs written for datacenter Blackwell do not apply, and some actively break things on SM121. This is a reference for what we learned running vLLM 0.19.0 (NGC container nvcr.io/nvidia/vllm:26.04-py3) on two DGX Sparks — single-node and two-node cluster configurations. ...

May 13, 2026 · 5 min · Conselara Labs

DGX Spark Benchmark Results: vLLM on SM121

Measured throughput and latency on DGX Spark GB10 (SM121) hardware. All results use vLLM 0.19.0 (NGC container nvcr.io/nvidia/vllm:26.04-py3) unless noted. Qwen3-235B-A22B-GPTQ-Int4 — Two-node cluster Date: 2026-05-03 Config: TP=2, EP=2, Ray cluster over QSFP-DD RoCE direct interconnect, --attention-backend=TRITON_ATTN, --quantization=gptq_marlin, --kv-cache-dtype=fp8, --gpu-memory-utilization=0.87 Batch Avg completion tokens tok/s per request Aggregate tok/s 1 (serial) 256 17.0 17.0 2 (concurrent) 256 12.1 24.1 4 (concurrent) 256 9.1 36.4 Prefix cache: 97% delta hit rate on repeated system prompt. Startup to first inference: ~15 minutes (Ray init + weight load across two nodes + compile). Weight resident per node: 57.64 GiB. ...

May 9, 2026 · 2 min · Conselara Labs

DGX Spark Model Comparison: What Fits and What Runs (SM121, 128 GB)

Quick-reference comparison of open-weight models for a single DGX Spark GB10 (SM121, 128 GB unified LPDDR5X memory). Based on tested configurations and community results as of May 2026. Model Architecture Quantization Memory Expected tok/s SM121 notes Qwen3.6-35B-A3B Pure MoE (3B active) FP8 (~35 GB) ✅ easily 100+ Pure MoE, no GDN — fully supported Qwen3.6-27B Dense hybrid (GDN) FP8 (~28 GB) ✅ easily 14–21 (stock) / 136–200 (fork) GDN kernel gap; experimental fork needed for full speed Qwen3-30B-A3B Pure MoE (3.3B active) NVFP4 / FP8 / BF16 (~16–60 GB) ✅ easily 32–50 Solid single-node option; no GDN gpt-oss-120b Sparse MoE (5.1B active) mxfp4 (~61 GB) ✅ 32–60 128K context; proprietary quant format Qwen3.5-122B-A10B Pure MoE (10B active) NVFP4 only (~75 GB) ✅ up to 51 BF16 is 234 GB — does not fit; NVFP4 is the only path Qwen3-235B-A22B Pure MoE (22B active) GPTQ-Int4 (~60 GB/node) ✅ (two nodes) 17–36 agg Requires two DGX Sparks; best quality available Qwen3.5-397B-A17B Pure MoE (17B active) NVFP4 (TP=2) ✅ (two nodes) Unknown SM121 MoE kernel not yet optimized; not recommended Key observations Throughput vs quality tradeoff at single-node: Qwen3.6-35B-A3B gives the highest throughput (100+ tok/s) with pure MoE architecture. Qwen3.5-122B-A10B gives the most capable model (10B active parameters) that fits on one node, at 51 tok/s. For most agentic workloads the bottleneck is tool latency, not token generation — so 51 tok/s is more than sufficient. ...

May 9, 2026 · 2 min · Conselara Labs

vLLM Model Selection for DGX Spark (SM121)

The DGX Spark GB10 SoC (SM121) has specific constraints that determine which models run well and which don’t. This is a practical guide based on what we’ve tested in production. The key constraint: SM121 kernel compatibility Not all model architectures run well on SM121 with the NGC vLLM container. The main constraint is the MoE kernel: Marlin kernel — stable, fast, supports GPTQ-Int4 and mxfp4 CUTLASS FP4 — broken on SM121, produces garbage outputs silently; never use GDN (GatedDeltaNet) — kernel gap on SM121, 14–21 tok/s with stock NGC; requires experimental fork for full speed Prefer pure MoE models over dense hybrid architectures when using the NGC container. Pure MoE (no GDN/Mamba layers) runs fully through Marlin and is well-tested on SM121. ...

May 9, 2026 · 4 min · Conselara Labs