Vllm | Conselara Labs

Building a Two-Node Ray Cluster for Distributed LLM Inference on DGX Spark

Qwen3-235B-A22B-GPTQ-Int4 is ~118 GB. A single DGX Spark has 128 GB unified memory — enough in theory, but once CUDA overhead and KV cache are factored in, it’s tight. Running it across two Sparks with TP=2 gives headroom for real workloads. Each DGX Spark is a single logical GPU with no NVSwitch. Tensor parallelism across two units means Ray + NCCL over a direct interconnect. This is what the setup looks like and what will silently fail if not configured correctly. ...

vLLM on DGX Spark: What the SM121 Architecture Actually Requires

The DGX Spark GB10 runs SM121 — the Grace Blackwell Superchip. It is not the same silicon as datacenter Blackwell (SM100, H100/H200). SM121 lacks TMEM, WGMMA, DSMEM, and NVSwitch. Several vLLM defaults, forum recommendations, and NVIDIA docs written for datacenter Blackwell do not apply, and some actively break things on SM121. This is a reference for what we learned running vLLM 0.19.0 (NGC container nvcr.io/nvidia/vllm:26.04-py3) on two DGX Sparks — single-node and two-node cluster configurations. ...

DGX Spark Benchmark Results: vLLM on SM121

Measured throughput and latency on DGX Spark GB10 (SM121) hardware. All results use vLLM 0.19.0 (NGC container nvcr.io/nvidia/vllm:26.04-py3) unless noted. Qwen3-235B-A22B-GPTQ-Int4 — Two-node cluster Date: 2026-05-03 Config: TP=2, EP=2, Ray cluster over QSFP-DD RoCE direct interconnect, --attention-backend=TRITON_ATTN, --quantization=gptq_marlin, --kv-cache-dtype=fp8, --gpu-memory-utilization=0.87 Batch Avg completion tokens tok/s per request Aggregate tok/s 1 (serial) 256 17.0 17.0 2 (concurrent) 256 12.1 24.1 4 (concurrent) 256 9.1 36.4 Prefix cache: 97% delta hit rate on repeated system prompt. Startup to first inference: ~15 minutes (Ray init + weight load across two nodes + compile). Weight resident per node: 57.64 GiB. ...

DGX Spark Model Comparison: What Fits and What Runs (SM121, 128 GB)

Quick-reference comparison of open-weight models for a single DGX Spark GB10 (SM121, 128 GB unified LPDDR5X memory). Based on tested configurations and community results as of May 2026. Model Architecture Quantization Memory Expected tok/s SM121 notes Qwen3.6-35B-A3B Pure MoE (3B active) FP8 (~35 GB) ✅ easily 100+ Pure MoE, no GDN — fully supported Qwen3.6-27B Dense hybrid (GDN) FP8 (~28 GB) ✅ easily 14–21 (stock) / 136–200 (fork) GDN kernel gap; experimental fork needed for full speed Qwen3-30B-A3B Pure MoE (3.3B active) NVFP4 / FP8 / BF16 (~16–60 GB) ✅ easily 32–50 Solid single-node option; no GDN gpt-oss-120b Sparse MoE (5.1B active) mxfp4 (~61 GB) ✅ 32–60 128K context; proprietary quant format Qwen3.5-122B-A10B Pure MoE (10B active) NVFP4 only (~75 GB) ✅ up to 51 BF16 is 234 GB — does not fit; NVFP4 is the only path Qwen3-235B-A22B Pure MoE (22B active) GPTQ-Int4 (~60 GB/node) ✅ (two nodes) 17–36 agg Requires two DGX Sparks; best quality available Qwen3.5-397B-A17B Pure MoE (17B active) NVFP4 (TP=2) ✅ (two nodes) Unknown SM121 MoE kernel not yet optimized; not recommended Key observations Throughput vs quality tradeoff at single-node: Qwen3.6-35B-A3B gives the highest throughput (100+ tok/s) with pure MoE architecture. Qwen3.5-122B-A10B gives the most capable model (10B active parameters) that fits on one node, at 51 tok/s. For most agentic workloads the bottleneck is tool latency, not token generation — so 51 tok/s is more than sufficient. ...

vLLM Model Selection for DGX Spark (SM121)

The DGX Spark GB10 SoC (SM121) has specific constraints that determine which models run well and which don’t. This is a practical guide based on what we’ve tested in production. The key constraint: SM121 kernel compatibility Not all model architectures run well on SM121 with the NGC vLLM container. The main constraint is the MoE kernel: Marlin kernel — stable, fast, supports GPTQ-Int4 and mxfp4 CUTLASS FP4 — broken on SM121, produces garbage outputs silently; never use GDN (GatedDeltaNet) — kernel gap on SM121, 14–21 tok/s with stock NGC; requires experimental fork for full speed Prefer pure MoE models over dense hybrid architectures when using the NGC container. Pure MoE (no GDN/Mamba layers) runs fully through Marlin and is well-tested on SM121. ...

Running Qwen3.5-122B on a Single DGX Spark

The NVIDIA DGX Spark (GB10 SoC, 128GB unified LPDDR5X memory) can run Qwen3.5-122B-A10B — a 122B parameter MoE model — at usable throughput for production workloads. Here’s what it actually takes. The key constraint: NVFP4 only Qwen3.5-122B-A10B at full precision is ~250GB. In NVFP4 quantization it’s ~75GB, which fits comfortably in 128GB unified memory. There is no other quantization path that both fits and runs correctly on the GB10. The only verified checkpoint we’ve found: bjk110/SPARK_Qwen3.5-122B-A10B-NVFP4 on HuggingFace, which includes 15 patches for the SM121 architecture. Use this; don’t try to quantize the base model yourself unless you’re prepared to debug SM121-specific kernel failures. ...