Qwen3-235B-A22B-GPTQ-Int4 is ~118 GB. A single DGX Spark has 128 GB unified memory — enough in theory, but once CUDA overhead and KV cache are factored in, it’s tight. Running it across two Sparks with TP=2 gives headroom for real workloads.
Each DGX Spark is a single logical GPU with no NVSwitch. Tensor parallelism across two units means Ray + NCCL over a direct interconnect. This is what the setup looks like and what will silently fail if not configured correctly.
For hardware background and kernel compatibility, see the DGX Spark GB10 hardware reference.
Prerequisites
- Two DGX Sparks on the same LAN, reachable by hostname
- NVIDIA driver 580.x on both — do not use 590.x (CUDAGraph deadlock on GB10)
- Docker CE + NVIDIA Container Toolkit installed and working on both
- NGC container image pulled on both:
nvcr.io/nvidia/vllm:26.04-py3
Step 1: Direct link — static IPs and MTU
Connect the two Sparks with a QSFP-DD DAC cable. A 400G passive copper cable presents as two independent 200 Gb/s interfaces on each machine. Assign static IPs on a dedicated subnet with jumbo MTU.
On node-a:
sudo nmcli con mod <qsfp-ch1> ipv4.addresses 192.168.100.10/24 ipv4.method manual 802-3-ethernet.mtu 9000
sudo nmcli con mod <qsfp-ch2> ipv4.addresses 192.168.100.12/24 ipv4.method manual 802-3-ethernet.mtu 9000
sudo nmcli con up <qsfp-ch1>
sudo nmcli con up <qsfp-ch2>
On node-b:
sudo nmcli con mod <qsfp-ch1> ipv4.addresses 192.168.100.11/24 ipv4.method manual 802-3-ethernet.mtu 9000
sudo nmcli con mod <qsfp-ch2> ipv4.addresses 192.168.100.13/24 ipv4.method manual 802-3-ethernet.mtu 9000
sudo nmcli con up <qsfp-ch1>
sudo nmcli con up <qsfp-ch2>
Use nmcli only — not Netplan. Netplan with the networkd renderer silently does nothing on DGX Spark (no /run/systemd/network/).
Verify (from node-a):
ping -c 4 192.168.100.11 # ch1
ping -c 4 192.168.100.13 # ch2
Expect 0% loss, <1.5 ms RTT on both channels.
Step 2: RoCE verification
Confirm RDMA/RoCE is active before proceeding. If NCCL falls back to TCP, you will get roughly 1/10th the bandwidth with no error message.
Install perftest if needed:
sudo apt install -y perftest
On node-b (server):
ib_write_bw -d <roce-interface> --report_gbits
On node-a (client):
ib_write_bw -d <roce-interface> 192.168.100.11 --report_gbits
Expected: 13+ Gb/s. The output header should show Link type: Ethernet and a GID-based address — confirming RoCE, not TCP.
Also confirm GID index 3 is RoCEv2+IPv4 on both machines:
show_gids | grep <interface-name>
# index 3 should show an IPv4 address and RoCEv2
Step 3: Download and sync the model
Download on one node:
docker run --rm \
-v /home/$USER/vllm-cache:/root/.cache/huggingface \
-e HF_HOME=/root/.cache/huggingface \
nvcr.io/nvidia/vllm:26.04-py3 \
huggingface-cli download Qwen/Qwen3-235B-A22B-GPTQ-Int4
~118 GB across 32 shards. At ~75 MB/s on a typical connection, expect 25–30 minutes.
The container writes model files as root. Fix ownership before syncing:
MYUID=$(id -u); MYGID=$(id -g)
docker run --rm \
-v /home/$USER/vllm-cache/hub/models--Qwen--Qwen3-235B-A22B-GPTQ-Int4:/target \
alpine sh -c "chown -R ${MYUID}:${MYGID} /target"
Then sync to the second node over the QSFP-DD link (~500 MB/s, ~4 minutes):
MODEL=models--Qwen--Qwen3-235B-A22B-GPTQ-Int4
SRC=$USER@192.168.100.11:/home/$USER/vllm-cache/hub/${MODEL}
DST=/home/$USER/vllm-cache/hub/${MODEL}
mkdir -p ${DST}/blobs ${DST}/snapshots ${DST}/refs
rsync -av --progress ${SRC}/blobs/ ${DST}/blobs/
rsync -av ${SRC}/snapshots/ ${DST}/snapshots/
rsync -av ${SRC}/refs/ ${DST}/refs/
The QSFP-DD link makes model sync fast enough to be practical. Over a standard gigabit LAN, the same transfer takes ~20 minutes.
Step 4: Docker Compose structure
Two compose files — one per node. The key differences:
| Setting | node-a (head) | node-b (worker) |
|---|---|---|
VLLM_HOST_IP | 192.168.100.10 | 192.168.100.11 |
| Entrypoint | ray start --head then vllm serve | retry loop: ray start --address=192.168.100.10:6379 --block |
| Healthcheck | curl /health | ray status |
Both files share:
network_mode: host— required; bridge networking blocks NCCL rendezvous- All NCCL environment variables (see Step 5)
RAY_memory_monitor_refresh_ms=0- ptxas symlink in entrypoint (see note below)
pip install ray[default]in entrypoint
ptxas symlink: Triton’s JIT backend looks for ptxas in a path that doesn’t exist in the NGC container on SM121. Add this to the entrypoint of both containers:
mkdir -p /usr/local/lib/python3.12/dist-packages/triton/backends/nvidia/bin
ln -sf /usr/local/cuda/bin/ptxas \
/usr/local/lib/python3.12/dist-packages/triton/backends/nvidia/bin/ptxas
Without it, Triton JIT compilation fails at the first inference call.
Step 5: Critical NCCL environment variables
Missing any of these causes silent hangs, TCP fallback, or deadlocks — no useful error message.
NCCL_IB_HCA=<roce-ch1>,<roce-ch2> # pin NCCL to RoCE interfaces — required, or NCCL deadlocks silently
NCCL_IB_DISABLE=0 # keep IB/RoCE path enabled
NCCL_IB_GID_INDEX=3 # RoCEv2+IPv4 (confirmed via show_gids)
NCCL_NET_PLUGIN=none # prevents AWS OFI plugin from falling back to TCP
NCCL_IB_ROCE_VERSION_NUM=2
NCCL_IB_TIMEOUT=22
NCCL_DEBUG=WARN
RAY_memory_monitor_refresh_ms=0 # prevents Ray killing vLLM on unified-memory systems
GLOO_SOCKET_IFNAME=<ifname> # Ray gloo rendezvous — must point to the direct link interface
VLLM_HOST_IP=<per-node IP> # node-a: 192.168.100.10 / node-b: 192.168.100.11
NCCL_IB_HCA is the most common source of silent failure. Without it, NCCL on SM121 can deadlock with no output.
Step 6: Start the cluster
Start the worker node first — its entrypoint retries until the head node’s Ray service is ready:
ssh user@node-b "cd ~/vllm-compose && docker compose up -d"
ssh user@node-a "cd ~/vllm-compose && docker compose up -d"
Watch weight loading (both nodes load in parallel, ~7 min):
ssh user@node-a 'docker logs -f vllm 2>&1 | grep -E "safetensors|Uvicorn|ERROR"'
After ~2 minutes, verify Ray cluster formed — should show 2 nodes, 2 GPUs:
ssh user@node-a 'docker exec vllm ray status'
Poll until the API is ready (~15 min total):
until curl -sf http://node-a:8000/health; do sleep 20; done && echo "serving"
Step 7: Smoke test
curl http://node-a:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3-235b",
"messages": [{"role": "user", "content": "reply with the word: working /no_think"}],
"max_tokens": 50
}'
Expected: "content": "working", "finish_reason": "stop".
Troubleshooting
| Symptom | Likely cause | Fix |
|---|---|---|
| Ray status shows 1 node | Worker not connected yet | Wait 2–3 min; check docker logs on worker |
| vLLM hangs at weight loading | NCCL rendezvous failure | Verify NCCL_IB_HCA is set and interfaces are UP |
IndexError in fused_moe | Wrong quantization format | Use --quantization=gptq_marlin, not mxfp4 |
content: null in smoke test | max_tokens too low | Use max_tokens ≥ 50 with /no_think |
| OOM at startup | Page cache pressure | sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches' |
| rsync permission denied | Cache files written as root by container | Run the Alpine chown fix (see Step 3) |
| NCCL falls back to TCP | OFI plugin or missing HCA env var | Set NCCL_NET_PLUGIN=none and NCCL_IB_HCA |
| Triton JIT fails at first inference | ptxas not found | Add ptxas symlink to entrypoint (see Step 4) |