Qwen3-235B-A22B-GPTQ-Int4 is ~118 GB. A single DGX Spark has 128 GB unified memory — enough in theory, but once CUDA overhead and KV cache are factored in, it’s tight. Running it across two Sparks with TP=2 gives headroom for real workloads.

Each DGX Spark is a single logical GPU with no NVSwitch. Tensor parallelism across two units means Ray + NCCL over a direct interconnect. This is what the setup looks like and what will silently fail if not configured correctly.

For hardware background and kernel compatibility, see the DGX Spark GB10 hardware reference.


Prerequisites

  • Two DGX Sparks on the same LAN, reachable by hostname
  • NVIDIA driver 580.x on both — do not use 590.x (CUDAGraph deadlock on GB10)
  • Docker CE + NVIDIA Container Toolkit installed and working on both
  • NGC container image pulled on both: nvcr.io/nvidia/vllm:26.04-py3

Connect the two Sparks with a QSFP-DD DAC cable. A 400G passive copper cable presents as two independent 200 Gb/s interfaces on each machine. Assign static IPs on a dedicated subnet with jumbo MTU.

On node-a:

sudo nmcli con mod <qsfp-ch1> ipv4.addresses 192.168.100.10/24 ipv4.method manual 802-3-ethernet.mtu 9000
sudo nmcli con mod <qsfp-ch2> ipv4.addresses 192.168.100.12/24 ipv4.method manual 802-3-ethernet.mtu 9000
sudo nmcli con up <qsfp-ch1>
sudo nmcli con up <qsfp-ch2>

On node-b:

sudo nmcli con mod <qsfp-ch1> ipv4.addresses 192.168.100.11/24 ipv4.method manual 802-3-ethernet.mtu 9000
sudo nmcli con mod <qsfp-ch2> ipv4.addresses 192.168.100.13/24 ipv4.method manual 802-3-ethernet.mtu 9000
sudo nmcli con up <qsfp-ch1>
sudo nmcli con up <qsfp-ch2>

Use nmcli only — not Netplan. Netplan with the networkd renderer silently does nothing on DGX Spark (no /run/systemd/network/).

Verify (from node-a):

ping -c 4 192.168.100.11   # ch1
ping -c 4 192.168.100.13   # ch2

Expect 0% loss, <1.5 ms RTT on both channels.


Step 2: RoCE verification

Confirm RDMA/RoCE is active before proceeding. If NCCL falls back to TCP, you will get roughly 1/10th the bandwidth with no error message.

Install perftest if needed:

sudo apt install -y perftest

On node-b (server):

ib_write_bw -d <roce-interface> --report_gbits

On node-a (client):

ib_write_bw -d <roce-interface> 192.168.100.11 --report_gbits

Expected: 13+ Gb/s. The output header should show Link type: Ethernet and a GID-based address — confirming RoCE, not TCP.

Also confirm GID index 3 is RoCEv2+IPv4 on both machines:

show_gids | grep <interface-name>
# index 3 should show an IPv4 address and RoCEv2

Step 3: Download and sync the model

Download on one node:

docker run --rm \
  -v /home/$USER/vllm-cache:/root/.cache/huggingface \
  -e HF_HOME=/root/.cache/huggingface \
  nvcr.io/nvidia/vllm:26.04-py3 \
  huggingface-cli download Qwen/Qwen3-235B-A22B-GPTQ-Int4

~118 GB across 32 shards. At ~75 MB/s on a typical connection, expect 25–30 minutes.

The container writes model files as root. Fix ownership before syncing:

MYUID=$(id -u); MYGID=$(id -g)
docker run --rm \
  -v /home/$USER/vllm-cache/hub/models--Qwen--Qwen3-235B-A22B-GPTQ-Int4:/target \
  alpine sh -c "chown -R ${MYUID}:${MYGID} /target"

Then sync to the second node over the QSFP-DD link (~500 MB/s, ~4 minutes):

MODEL=models--Qwen--Qwen3-235B-A22B-GPTQ-Int4
SRC=$USER@192.168.100.11:/home/$USER/vllm-cache/hub/${MODEL}
DST=/home/$USER/vllm-cache/hub/${MODEL}

mkdir -p ${DST}/blobs ${DST}/snapshots ${DST}/refs
rsync -av --progress ${SRC}/blobs/     ${DST}/blobs/
rsync -av            ${SRC}/snapshots/ ${DST}/snapshots/
rsync -av            ${SRC}/refs/      ${DST}/refs/

The QSFP-DD link makes model sync fast enough to be practical. Over a standard gigabit LAN, the same transfer takes ~20 minutes.


Step 4: Docker Compose structure

Two compose files — one per node. The key differences:

Settingnode-a (head)node-b (worker)
VLLM_HOST_IP192.168.100.10192.168.100.11
Entrypointray start --head then vllm serveretry loop: ray start --address=192.168.100.10:6379 --block
Healthcheckcurl /healthray status

Both files share:

  • network_mode: host — required; bridge networking blocks NCCL rendezvous
  • All NCCL environment variables (see Step 5)
  • RAY_memory_monitor_refresh_ms=0
  • ptxas symlink in entrypoint (see note below)
  • pip install ray[default] in entrypoint

ptxas symlink: Triton’s JIT backend looks for ptxas in a path that doesn’t exist in the NGC container on SM121. Add this to the entrypoint of both containers:

mkdir -p /usr/local/lib/python3.12/dist-packages/triton/backends/nvidia/bin
ln -sf /usr/local/cuda/bin/ptxas \
  /usr/local/lib/python3.12/dist-packages/triton/backends/nvidia/bin/ptxas

Without it, Triton JIT compilation fails at the first inference call.


Step 5: Critical NCCL environment variables

Missing any of these causes silent hangs, TCP fallback, or deadlocks — no useful error message.

NCCL_IB_HCA=<roce-ch1>,<roce-ch2>  # pin NCCL to RoCE interfaces — required, or NCCL deadlocks silently
NCCL_IB_DISABLE=0                   # keep IB/RoCE path enabled
NCCL_IB_GID_INDEX=3                 # RoCEv2+IPv4 (confirmed via show_gids)
NCCL_NET_PLUGIN=none                 # prevents AWS OFI plugin from falling back to TCP
NCCL_IB_ROCE_VERSION_NUM=2
NCCL_IB_TIMEOUT=22
NCCL_DEBUG=WARN
RAY_memory_monitor_refresh_ms=0     # prevents Ray killing vLLM on unified-memory systems
GLOO_SOCKET_IFNAME=<ifname>         # Ray gloo rendezvous — must point to the direct link interface
VLLM_HOST_IP=<per-node IP>          # node-a: 192.168.100.10 / node-b: 192.168.100.11

NCCL_IB_HCA is the most common source of silent failure. Without it, NCCL on SM121 can deadlock with no output.


Step 6: Start the cluster

Start the worker node first — its entrypoint retries until the head node’s Ray service is ready:

ssh user@node-b "cd ~/vllm-compose && docker compose up -d"
ssh user@node-a "cd ~/vllm-compose && docker compose up -d"

Watch weight loading (both nodes load in parallel, ~7 min):

ssh user@node-a 'docker logs -f vllm 2>&1 | grep -E "safetensors|Uvicorn|ERROR"'

After ~2 minutes, verify Ray cluster formed — should show 2 nodes, 2 GPUs:

ssh user@node-a 'docker exec vllm ray status'

Poll until the API is ready (~15 min total):

until curl -sf http://node-a:8000/health; do sleep 20; done && echo "serving"

Step 7: Smoke test

curl http://node-a:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3-235b",
    "messages": [{"role": "user", "content": "reply with the word: working /no_think"}],
    "max_tokens": 50
  }'

Expected: "content": "working", "finish_reason": "stop".


Troubleshooting

SymptomLikely causeFix
Ray status shows 1 nodeWorker not connected yetWait 2–3 min; check docker logs on worker
vLLM hangs at weight loadingNCCL rendezvous failureVerify NCCL_IB_HCA is set and interfaces are UP
IndexError in fused_moeWrong quantization formatUse --quantization=gptq_marlin, not mxfp4
content: null in smoke testmax_tokens too lowUse max_tokens ≥ 50 with /no_think
OOM at startupPage cache pressuresudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'
rsync permission deniedCache files written as root by containerRun the Alpine chown fix (see Step 3)
NCCL falls back to TCPOFI plugin or missing HCA env varSet NCCL_NET_PLUGIN=none and NCCL_IB_HCA
Triton JIT fails at first inferenceptxas not foundAdd ptxas symlink to entrypoint (see Step 4)