Notes

Building a Two-Node Ray Cluster for Distributed LLM Inference on DGX Spark

Qwen3-235B-A22B-GPTQ-Int4 is ~118 GB. A single DGX Spark has 128 GB unified memory — enough in theory, but once CUDA overhead and KV cache are factored in, it’s tight. Running it across two Sparks with TP=2 gives headroom for real workloads. Each DGX Spark is a single logical GPU with no NVSwitch. Tensor parallelism across two units means Ray + NCCL over a direct interconnect. This is what the setup looks like and what will silently fail if not configured correctly. ...

Deploying a Hugo Site to S3 + CloudFront: What Actually Bit Us

We migrated a Hugo static site from a self-hosted nginx container on a local server to S3 + CloudFront. The motivation was simple: a static site has no business running on a server we have to patch. The migration took a few hours and involved four gotchas that aren’t obvious from the AWS documentation. This is a record of what we did and what tripped us up. The setup Hugo static site (PaperMod theme) S3 bucket with all public access blocked — Origin Access Control (OAC) only CloudFront distribution with ACM SSL cert Cloudflare DNS, gray cloud (DNS-only) Gitea self-hosted repo with a webhook-triggered deploy container on-prem The deploy flow on push: Gitea fires a webhook → container on saturn pulls the repo, runs hugo --minify, syncs to S3, invalidates CloudFront. ...

DGX Spark GB10 Hardware Reference: SM121 Architecture, Memory, and Networking

Reference for the NVIDIA DGX Spark GB10 Grace Blackwell Superchip (SM121). The DGX Spark shares the Blackwell name with datacenter hardware but is architecturally distinct. A lot of documentation, forum posts, and vLLM flags written for B100/B200 do not apply here — some actively break things. SM121 is not datacenter Blackwell Feature DGX Spark (GB10 / SM121) Datacenter Blackwell (B100/B200) TMEM No Yes WGMMA No Yes DSMEM No Yes NVSwitch No Yes CUTLASS FP4 Broken — silent garbage output Supported Memory type Unified LPDDR5X (shared CPU+GPU) HBM3e (GPU-only) Memory per unit 128 GB 192 GB GPUs per unit 1 logical GPU 1 GPU When you see forum recommendations or vLLM flags that say “for Blackwell” — verify they’re for SM121 specifically before using them. ...

vLLM on DGX Spark: What the SM121 Architecture Actually Requires

The DGX Spark GB10 runs SM121 — the Grace Blackwell Superchip. It is not the same silicon as datacenter Blackwell (SM100, H100/H200). SM121 lacks TMEM, WGMMA, DSMEM, and NVSwitch. Several vLLM defaults, forum recommendations, and NVIDIA docs written for datacenter Blackwell do not apply, and some actively break things on SM121. This is a reference for what we learned running vLLM 0.19.0 (NGC container nvcr.io/nvidia/vllm:26.04-py3) on two DGX Sparks — single-node and two-node cluster configurations. ...

We Replaced an MCP Server with FastAPI and It Worked Everywhere

We built an internal knowledge base server to give our AI agents access to Conselara’s company data — capabilities, past performance, GSA rates, certifications. The idea was straightforward: expose it as an MCP server so any AI client could query it semantically. It worked in Claude Code. It worked nowhere else. What MCP promises The Model Context Protocol is Anthropic’s open standard for connecting AI models to external tools and data sources. The pitch is compelling: define your server once, and any MCP-compatible client can call it. Claude Code has native MCP support. The ecosystem is growing. ...

AI Across a Health Research Information Platform

We are integrating AI across several workstreams on a federal health research information platform. Publication discovery — using LLMs to surface relevant PubMed research, reducing manual literature review time and improving coverage across a high-volume publication landscape. LLM comparative evaluations — running structured benchmarks across models to assess quality, consistency, and cost for specific content tasks on the platform. Evaluations are task-specific rather than general — we score against real outputs the platform needs to produce. ...

DGX Spark Benchmark Results: vLLM on SM121

Measured throughput and latency on DGX Spark GB10 (SM121) hardware. All results use vLLM 0.19.0 (NGC container nvcr.io/nvidia/vllm:26.04-py3) unless noted. Qwen3-235B-A22B-GPTQ-Int4 — Two-node cluster Date: 2026-05-03 Config: TP=2, EP=2, Ray cluster over QSFP-DD RoCE direct interconnect, --attention-backend=TRITON_ATTN, --quantization=gptq_marlin, --kv-cache-dtype=fp8, --gpu-memory-utilization=0.87 Batch Avg completion tokens tok/s per request Aggregate tok/s 1 (serial) 256 17.0 17.0 2 (concurrent) 256 12.1 24.1 4 (concurrent) 256 9.1 36.4 Prefix cache: 97% delta hit rate on repeated system prompt. Startup to first inference: ~15 minutes (Ray init + weight load across two nodes + compile). Weight resident per node: 57.64 GiB. ...

DGX Spark Model Comparison: What Fits and What Runs (SM121, 128 GB)

Quick-reference comparison of open-weight models for a single DGX Spark GB10 (SM121, 128 GB unified LPDDR5X memory). Based on tested configurations and community results as of May 2026. Model Architecture Quantization Memory Expected tok/s SM121 notes Qwen3.6-35B-A3B Pure MoE (3B active) FP8 (~35 GB) ✅ easily 100+ Pure MoE, no GDN — fully supported Qwen3.6-27B Dense hybrid (GDN) FP8 (~28 GB) ✅ easily 14–21 (stock) / 136–200 (fork) GDN kernel gap; experimental fork needed for full speed Qwen3-30B-A3B Pure MoE (3.3B active) NVFP4 / FP8 / BF16 (~16–60 GB) ✅ easily 32–50 Solid single-node option; no GDN gpt-oss-120b Sparse MoE (5.1B active) mxfp4 (~61 GB) ✅ 32–60 128K context; proprietary quant format Qwen3.5-122B-A10B Pure MoE (10B active) NVFP4 only (~75 GB) ✅ up to 51 BF16 is 234 GB — does not fit; NVFP4 is the only path Qwen3-235B-A22B Pure MoE (22B active) GPTQ-Int4 (~60 GB/node) ✅ (two nodes) 17–36 agg Requires two DGX Sparks; best quality available Qwen3.5-397B-A17B Pure MoE (17B active) NVFP4 (TP=2) ✅ (two nodes) Unknown SM121 MoE kernel not yet optimized; not recommended Key observations Throughput vs quality tradeoff at single-node: Qwen3.6-35B-A3B gives the highest throughput (100+ tok/s) with pure MoE architecture. Qwen3.5-122B-A10B gives the most capable model (10B active parameters) that fits on one node, at 51 tok/s. For most agentic workloads the bottleneck is tool latency, not token generation — so 51 tok/s is more than sufficient. ...

Piloting AWS DevOps Guru and Amazon Q for AIOps

We are running a pilot of AWS DevOps Guru paired with Amazon Q across a federal AWS estate. DevOps Guru provides ML-driven anomaly detection and automated root cause analysis. Rather than relying on manually defined alert thresholds, it builds a baseline from operational data and flags deviations — reducing noise and surfacing issues that threshold-based alerting misses. Amazon Q brings generative AI into engineer troubleshooting workflows. When an anomaly is flagged, engineers can query Amazon Q directly for accelerated diagnosis — pulling in relevant runbooks, log context, and suggested remediation paths without switching tools. ...

vLLM Model Selection for DGX Spark (SM121)

The DGX Spark GB10 SoC (SM121) has specific constraints that determine which models run well and which don’t. This is a practical guide based on what we’ve tested in production. The key constraint: SM121 kernel compatibility Not all model architectures run well on SM121 with the NGC vLLM container. The main constraint is the MoE kernel: Marlin kernel — stable, fast, supports GPTQ-Int4 and mxfp4 CUTLASS FP4 — broken on SM121, produces garbage outputs silently; never use GDN (GatedDeltaNet) — kernel gap on SM121, 14–21 tok/s with stock NGC; requires experimental fork for full speed Prefer pure MoE models over dense hybrid architectures when using the NGC container. Pure MoE (no GDN/Mamba layers) runs fully through Marlin and is well-tested on SM121. ...