Llm | Conselara Labs

Building a Two-Node Ray Cluster for Distributed LLM Inference on DGX Spark

Qwen3-235B-A22B-GPTQ-Int4 is ~118 GB. A single DGX Spark has 128 GB unified memory — enough in theory, but once CUDA overhead and KV cache are factored in, it’s tight. Running it across two Sparks with TP=2 gives headroom for real workloads. Each DGX Spark is a single logical GPU with no NVSwitch. Tensor parallelism across two units means Ray + NCCL over a direct interconnect. This is what the setup looks like and what will silently fail if not configured correctly. ...

DGX Spark GB10 Hardware Reference: SM121 Architecture, Memory, and Networking

Reference for the NVIDIA DGX Spark GB10 Grace Blackwell Superchip (SM121). The DGX Spark shares the Blackwell name with datacenter hardware but is architecturally distinct. A lot of documentation, forum posts, and vLLM flags written for B100/B200 do not apply here — some actively break things. SM121 is not datacenter Blackwell Feature DGX Spark (GB10 / SM121) Datacenter Blackwell (B100/B200) TMEM No Yes WGMMA No Yes DSMEM No Yes NVSwitch No Yes CUTLASS FP4 Broken — silent garbage output Supported Memory type Unified LPDDR5X (shared CPU+GPU) HBM3e (GPU-only) Memory per unit 128 GB 192 GB GPUs per unit 1 logical GPU 1 GPU When you see forum recommendations or vLLM flags that say “for Blackwell” — verify they’re for SM121 specifically before using them. ...

vLLM on DGX Spark: What the SM121 Architecture Actually Requires

The DGX Spark GB10 runs SM121 — the Grace Blackwell Superchip. It is not the same silicon as datacenter Blackwell (SM100, H100/H200). SM121 lacks TMEM, WGMMA, DSMEM, and NVSwitch. Several vLLM defaults, forum recommendations, and NVIDIA docs written for datacenter Blackwell do not apply, and some actively break things on SM121. This is a reference for what we learned running vLLM 0.19.0 (NGC container nvcr.io/nvidia/vllm:26.04-py3) on two DGX Sparks — single-node and two-node cluster configurations. ...

We Replaced an MCP Server with FastAPI and It Worked Everywhere

We built an internal knowledge base server to give our AI agents access to Conselara’s company data — capabilities, past performance, GSA rates, certifications. The idea was straightforward: expose it as an MCP server so any AI client could query it semantically. It worked in Claude Code. It worked nowhere else. What MCP promises The Model Context Protocol is Anthropic’s open standard for connecting AI models to external tools and data sources. The pitch is compelling: define your server once, and any MCP-compatible client can call it. Claude Code has native MCP support. The ecosystem is growing. ...

AI Across a Health Research Information Platform

We are integrating AI across several workstreams on a federal health research information platform. Publication discovery — using LLMs to surface relevant PubMed research, reducing manual literature review time and improving coverage across a high-volume publication landscape. LLM comparative evaluations — running structured benchmarks across models to assess quality, consistency, and cost for specific content tasks on the platform. Evaluations are task-specific rather than general — we score against real outputs the platform needs to produce. ...

DGX Spark Model Comparison: What Fits and What Runs (SM121, 128 GB)

Quick-reference comparison of open-weight models for a single DGX Spark GB10 (SM121, 128 GB unified LPDDR5X memory). Based on tested configurations and community results as of May 2026. Model Architecture Quantization Memory Expected tok/s SM121 notes Qwen3.6-35B-A3B Pure MoE (3B active) FP8 (~35 GB) ✅ easily 100+ Pure MoE, no GDN — fully supported Qwen3.6-27B Dense hybrid (GDN) FP8 (~28 GB) ✅ easily 14–21 (stock) / 136–200 (fork) GDN kernel gap; experimental fork needed for full speed Qwen3-30B-A3B Pure MoE (3.3B active) NVFP4 / FP8 / BF16 (~16–60 GB) ✅ easily 32–50 Solid single-node option; no GDN gpt-oss-120b Sparse MoE (5.1B active) mxfp4 (~61 GB) ✅ 32–60 128K context; proprietary quant format Qwen3.5-122B-A10B Pure MoE (10B active) NVFP4 only (~75 GB) ✅ up to 51 BF16 is 234 GB — does not fit; NVFP4 is the only path Qwen3-235B-A22B Pure MoE (22B active) GPTQ-Int4 (~60 GB/node) ✅ (two nodes) 17–36 agg Requires two DGX Sparks; best quality available Qwen3.5-397B-A17B Pure MoE (17B active) NVFP4 (TP=2) ✅ (two nodes) Unknown SM121 MoE kernel not yet optimized; not recommended Key observations Throughput vs quality tradeoff at single-node: Qwen3.6-35B-A3B gives the highest throughput (100+ tok/s) with pure MoE architecture. Qwen3.5-122B-A10B gives the most capable model (10B active parameters) that fits on one node, at 51 tok/s. For most agentic workloads the bottleneck is tool latency, not token generation — so 51 tok/s is more than sufficient. ...

vLLM Model Selection for DGX Spark (SM121)

The DGX Spark GB10 SoC (SM121) has specific constraints that determine which models run well and which don’t. This is a practical guide based on what we’ve tested in production. The key constraint: SM121 kernel compatibility Not all model architectures run well on SM121 with the NGC vLLM container. The main constraint is the MoE kernel: Marlin kernel — stable, fast, supports GPTQ-Int4 and mxfp4 CUTLASS FP4 — broken on SM121, produces garbage outputs silently; never use GDN (GatedDeltaNet) — kernel gap on SM121, 14–21 tok/s with stock NGC; requires experimental fork for full speed Prefer pure MoE models over dense hybrid architectures when using the NGC container. Pure MoE (no GDN/Mamba layers) runs fully through Marlin and is well-tested on SM121. ...

Running Qwen3.5-122B on a Single DGX Spark

The NVIDIA DGX Spark (GB10 SoC, 128GB unified LPDDR5X memory) can run Qwen3.5-122B-A10B — a 122B parameter MoE model — at usable throughput for production workloads. Here’s what it actually takes. The key constraint: NVFP4 only Qwen3.5-122B-A10B at full precision is ~250GB. In NVFP4 quantization it’s ~75GB, which fits comfortably in 128GB unified memory. There is no other quantization path that both fits and runs correctly on the GB10. The only verified checkpoint we’ve found: bjk110/SPARK_Qwen3.5-122B-A10B-NVFP4 on HuggingFace, which includes 15 patches for the SM121 architecture. Use this; don’t try to quantize the base model yourself unless you’re prepared to debug SM121-specific kernel failures. ...