Building a Two-Node Ray Cluster for Distributed LLM Inference on DGX Spark

Qwen3-235B-A22B-GPTQ-Int4 is ~118 GB. A single DGX Spark has 128 GB unified memory — enough in theory, but once CUDA overhead and KV cache are factored in, it’s tight. Running it across two Sparks with TP=2 gives headroom for real workloads.

Each DGX Spark is a single logical GPU with no NVSwitch. Tensor parallelism across two units means Ray + NCCL over a direct interconnect. This is what the setup looks like and what will silently fail if not configured correctly.

For hardware background and kernel compatibility, see the DGX Spark GB10 hardware reference.

Prerequisites

Two DGX Sparks on the same LAN, reachable by hostname
NVIDIA driver 580.x on both — do not use 590.x (CUDAGraph deadlock on GB10)
Docker CE + NVIDIA Container Toolkit installed and working on both
NGC container image pulled on both: nvcr.io/nvidia/vllm:26.04-py3

Step 1: Direct link — static IPs and MTU

Connect the two Sparks with a QSFP-DD DAC cable. A 400G passive copper cable presents as two independent 200 Gb/s interfaces on each machine. Assign static IPs on a dedicated subnet with jumbo MTU.

On node-a:

sudo nmcli con mod <qsfp-ch1> ipv4.addresses 192.168.100.10/24 ipv4.method manual 802-3-ethernet.mtu 9000
sudo nmcli con mod <qsfp-ch2> ipv4.addresses 192.168.100.12/24 ipv4.method manual 802-3-ethernet.mtu 9000
sudo nmcli con up <qsfp-ch1>
sudo nmcli con up <qsfp-ch2>

On node-b:

sudo nmcli con mod <qsfp-ch1> ipv4.addresses 192.168.100.11/24 ipv4.method manual 802-3-ethernet.mtu 9000
sudo nmcli con mod <qsfp-ch2> ipv4.addresses 192.168.100.13/24 ipv4.method manual 802-3-ethernet.mtu 9000
sudo nmcli con up <qsfp-ch1>
sudo nmcli con up <qsfp-ch2>

Use nmcli only — not Netplan. Netplan with the networkd renderer silently does nothing on DGX Spark (no /run/systemd/network/).

Verify (from node-a):

ping -c 4 192.168.100.11   # ch1
ping -c 4 192.168.100.13   # ch2

Expect 0% loss, <1.5 ms RTT on both channels.

Step 2: RoCE verification

Confirm RDMA/RoCE is active before proceeding. If NCCL falls back to TCP, you will get roughly 1/10th the bandwidth with no error message.

Install perftest if needed:

sudo apt install -y perftest

On node-b (server):

ib_write_bw -d <roce-interface> --report_gbits

On node-a (client):

ib_write_bw -d <roce-interface> 192.168.100.11 --report_gbits

Expected: 13+ Gb/s. The output header should show Link type: Ethernet and a GID-based address — confirming RoCE, not TCP.

Also confirm GID index 3 is RoCEv2+IPv4 on both machines:

show_gids | grep <interface-name>
# index 3 should show an IPv4 address and RoCEv2

Step 3: Download and sync the model

Download on one node:

docker run --rm \
  -v /home/$USER/vllm-cache:/root/.cache/huggingface \
  -e HF_HOME=/root/.cache/huggingface \
  nvcr.io/nvidia/vllm:26.04-py3 \
  huggingface-cli download Qwen/Qwen3-235B-A22B-GPTQ-Int4

~118 GB across 32 shards. At ~75 MB/s on a typical connection, expect 25–30 minutes.

The container writes model files as root. Fix ownership before syncing:

MYUID=$(id -u); MYGID=$(id -g)
docker run --rm \
  -v /home/$USER/vllm-cache/hub/models--Qwen--Qwen3-235B-A22B-GPTQ-Int4:/target \
  alpine sh -c "chown -R ${MYUID}:${MYGID} /target"

Then sync to the second node over the QSFP-DD link (~500 MB/s, ~4 minutes):

MODEL=models--Qwen--Qwen3-235B-A22B-GPTQ-Int4
SRC=$USER@192.168.100.11:/home/$USER/vllm-cache/hub/${MODEL}
DST=/home/$USER/vllm-cache/hub/${MODEL}

mkdir -p ${DST}/blobs ${DST}/snapshots ${DST}/refs
rsync -av --progress ${SRC}/blobs/     ${DST}/blobs/
rsync -av            ${SRC}/snapshots/ ${DST}/snapshots/
rsync -av            ${SRC}/refs/      ${DST}/refs/

The QSFP-DD link makes model sync fast enough to be practical. Over a standard gigabit LAN, the same transfer takes ~20 minutes.

Step 4: Docker Compose structure

Two compose files — one per node. The key differences:

Setting	node-a (head)	node-b (worker)
`VLLM_HOST_IP`	`192.168.100.10`	`192.168.100.11`
Entrypoint	`ray start --head` then `vllm serve`	retry loop: `ray start --address=192.168.100.10:6379 --block`
Healthcheck	`curl /health`	`ray status`

Both files share:

network_mode: host — required; bridge networking blocks NCCL rendezvous
All NCCL environment variables (see Step 5)
RAY_memory_monitor_refresh_ms=0
ptxas symlink in entrypoint (see note below)
pip install ray[default] in entrypoint

ptxas symlink: Triton’s JIT backend looks for ptxas in a path that doesn’t exist in the NGC container on SM121. Add this to the entrypoint of both containers:

mkdir -p /usr/local/lib/python3.12/dist-packages/triton/backends/nvidia/bin
ln -sf /usr/local/cuda/bin/ptxas \
  /usr/local/lib/python3.12/dist-packages/triton/backends/nvidia/bin/ptxas

Without it, Triton JIT compilation fails at the first inference call.

Step 5: Critical NCCL environment variables

Missing any of these causes silent hangs, TCP fallback, or deadlocks — no useful error message.

NCCL_IB_HCA=<roce-ch1>,<roce-ch2>  # pin NCCL to RoCE interfaces — required, or NCCL deadlocks silently
NCCL_IB_DISABLE=0                   # keep IB/RoCE path enabled
NCCL_IB_GID_INDEX=3                 # RoCEv2+IPv4 (confirmed via show_gids)
NCCL_NET_PLUGIN=none                 # prevents AWS OFI plugin from falling back to TCP
NCCL_IB_ROCE_VERSION_NUM=2
NCCL_IB_TIMEOUT=22
NCCL_DEBUG=WARN
RAY_memory_monitor_refresh_ms=0     # prevents Ray killing vLLM on unified-memory systems
GLOO_SOCKET_IFNAME=<ifname>         # Ray gloo rendezvous — must point to the direct link interface
VLLM_HOST_IP=<per-node IP>          # node-a: 192.168.100.10 / node-b: 192.168.100.11

NCCL_IB_HCA is the most common source of silent failure. Without it, NCCL on SM121 can deadlock with no output.

Step 6: Start the cluster

Start the worker node first — its entrypoint retries until the head node’s Ray service is ready:

ssh user@node-b "cd ~/vllm-compose && docker compose up -d"
ssh user@node-a "cd ~/vllm-compose && docker compose up -d"

Watch weight loading (both nodes load in parallel, ~7 min):

ssh user@node-a 'docker logs -f vllm 2>&1 | grep -E "safetensors|Uvicorn|ERROR"'

After ~2 minutes, verify Ray cluster formed — should show 2 nodes, 2 GPUs:

ssh user@node-a 'docker exec vllm ray status'

Poll until the API is ready (~15 min total):

until curl -sf http://node-a:8000/health; do sleep 20; done && echo "serving"

Step 7: Smoke test

curl http://node-a:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3-235b",
    "messages": [{"role": "user", "content": "reply with the word: working /no_think"}],
    "max_tokens": 50
  }'

Expected: "content": "working", "finish_reason": "stop".

Troubleshooting

Symptom	Likely cause	Fix
Ray status shows 1 node	Worker not connected yet	Wait 2–3 min; check `docker logs` on worker
vLLM hangs at weight loading	NCCL rendezvous failure	Verify `NCCL_IB_HCA` is set and interfaces are UP
`IndexError` in fused_moe	Wrong quantization format	Use `--quantization=gptq_marlin`, not mxfp4
`content: null` in smoke test	`max_tokens` too low	Use `max_tokens ≥ 50` with `/no_think`
OOM at startup	Page cache pressure	`sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'`
rsync permission denied	Cache files written as root by container	Run the Alpine chown fix (see Step 3)
NCCL falls back to TCP	OFI plugin or missing HCA env var	Set `NCCL_NET_PLUGIN=none` and `NCCL_IB_HCA`
Triton JIT fails at first inference	ptxas not found	Add ptxas symlink to entrypoint (see Step 4)

Prerequisites#

Step 1: Direct link — static IPs and MTU#

Step 2: RoCE verification#

Step 3: Download and sync the model#

Step 4: Docker Compose structure#

Step 5: Critical NCCL environment variables#

Step 6: Start the cluster#

Step 7: Smoke test#

Troubleshooting#