[{"content":"Qwen3-235B-A22B-GPTQ-Int4 is ~118 GB. A single DGX Spark has 128 GB unified memory — enough in theory, but once CUDA overhead and KV cache are factored in, it\u0026rsquo;s tight. Running it across two Sparks with TP=2 gives headroom for real workloads.\nEach DGX Spark is a single logical GPU with no NVSwitch. Tensor parallelism across two units means Ray + NCCL over a direct interconnect. This is what the setup looks like and what will silently fail if not configured correctly.\nFor hardware background and kernel compatibility, see the DGX Spark GB10 hardware reference.\nPrerequisites Two DGX Sparks on the same LAN, reachable by hostname NVIDIA driver 580.x on both — do not use 590.x (CUDAGraph deadlock on GB10) Docker CE + NVIDIA Container Toolkit installed and working on both NGC container image pulled on both: nvcr.io/nvidia/vllm:26.04-py3 Step 1: Direct link — static IPs and MTU Connect the two Sparks with a QSFP-DD DAC cable. A 400G passive copper cable presents as two independent 200 Gb/s interfaces on each machine. Assign static IPs on a dedicated subnet with jumbo MTU.\nOn node-a:\nsudo nmcli con mod \u0026lt;qsfp-ch1\u0026gt; ipv4.addresses 192.168.100.10/24 ipv4.method manual 802-3-ethernet.mtu 9000 sudo nmcli con mod \u0026lt;qsfp-ch2\u0026gt; ipv4.addresses 192.168.100.12/24 ipv4.method manual 802-3-ethernet.mtu 9000 sudo nmcli con up \u0026lt;qsfp-ch1\u0026gt; sudo nmcli con up \u0026lt;qsfp-ch2\u0026gt; On node-b:\nsudo nmcli con mod \u0026lt;qsfp-ch1\u0026gt; ipv4.addresses 192.168.100.11/24 ipv4.method manual 802-3-ethernet.mtu 9000 sudo nmcli con mod \u0026lt;qsfp-ch2\u0026gt; ipv4.addresses 192.168.100.13/24 ipv4.method manual 802-3-ethernet.mtu 9000 sudo nmcli con up \u0026lt;qsfp-ch1\u0026gt; sudo nmcli con up \u0026lt;qsfp-ch2\u0026gt; Use nmcli only — not Netplan. Netplan with the networkd renderer silently does nothing on DGX Spark (no /run/systemd/network/).\nVerify (from node-a):\nping -c 4 192.168.100.11 # ch1 ping -c 4 192.168.100.13 # ch2 Expect 0% loss, \u0026lt;1.5 ms RTT on both channels.\nStep 2: RoCE verification Confirm RDMA/RoCE is active before proceeding. If NCCL falls back to TCP, you will get roughly 1/10th the bandwidth with no error message.\nInstall perftest if needed:\nsudo apt install -y perftest On node-b (server):\nib_write_bw -d \u0026lt;roce-interface\u0026gt; --report_gbits On node-a (client):\nib_write_bw -d \u0026lt;roce-interface\u0026gt; 192.168.100.11 --report_gbits Expected: 13+ Gb/s. The output header should show Link type: Ethernet and a GID-based address — confirming RoCE, not TCP.\nAlso confirm GID index 3 is RoCEv2+IPv4 on both machines:\nshow_gids | grep \u0026lt;interface-name\u0026gt; # index 3 should show an IPv4 address and RoCEv2 Step 3: Download and sync the model Download on one node:\ndocker run --rm \\ -v /home/$USER/vllm-cache:/root/.cache/huggingface \\ -e HF_HOME=/root/.cache/huggingface \\ nvcr.io/nvidia/vllm:26.04-py3 \\ huggingface-cli download Qwen/Qwen3-235B-A22B-GPTQ-Int4 ~118 GB across 32 shards. At ~75 MB/s on a typical connection, expect 25–30 minutes.\nThe container writes model files as root. Fix ownership before syncing:\nMYUID=$(id -u); MYGID=$(id -g) docker run --rm \\ -v /home/$USER/vllm-cache/hub/models--Qwen--Qwen3-235B-A22B-GPTQ-Int4:/target \\ alpine sh -c \u0026#34;chown -R ${MYUID}:${MYGID} /target\u0026#34; Then sync to the second node over the QSFP-DD link (~500 MB/s, ~4 minutes):\nMODEL=models--Qwen--Qwen3-235B-A22B-GPTQ-Int4 SRC=$USER@192.168.100.11:/home/$USER/vllm-cache/hub/${MODEL} DST=/home/$USER/vllm-cache/hub/${MODEL} mkdir -p ${DST}/blobs ${DST}/snapshots ${DST}/refs rsync -av --progress ${SRC}/blobs/ ${DST}/blobs/ rsync -av ${SRC}/snapshots/ ${DST}/snapshots/ rsync -av ${SRC}/refs/ ${DST}/refs/ The QSFP-DD link makes model sync fast enough to be practical. Over a standard gigabit LAN, the same transfer takes ~20 minutes.\nStep 4: Docker Compose structure Two compose files — one per node. The key differences:\nSetting node-a (head) node-b (worker) VLLM_HOST_IP 192.168.100.10 192.168.100.11 Entrypoint ray start --head then vllm serve retry loop: ray start --address=192.168.100.10:6379 --block Healthcheck curl /health ray status Both files share:\nnetwork_mode: host — required; bridge networking blocks NCCL rendezvous All NCCL environment variables (see Step 5) RAY_memory_monitor_refresh_ms=0 ptxas symlink in entrypoint (see note below) pip install ray[default] in entrypoint ptxas symlink: Triton\u0026rsquo;s JIT backend looks for ptxas in a path that doesn\u0026rsquo;t exist in the NGC container on SM121. Add this to the entrypoint of both containers:\nmkdir -p /usr/local/lib/python3.12/dist-packages/triton/backends/nvidia/bin ln -sf /usr/local/cuda/bin/ptxas \\ /usr/local/lib/python3.12/dist-packages/triton/backends/nvidia/bin/ptxas Without it, Triton JIT compilation fails at the first inference call.\nStep 5: Critical NCCL environment variables Missing any of these causes silent hangs, TCP fallback, or deadlocks — no useful error message.\nNCCL_IB_HCA=\u0026lt;roce-ch1\u0026gt;,\u0026lt;roce-ch2\u0026gt; # pin NCCL to RoCE interfaces — required, or NCCL deadlocks silently NCCL_IB_DISABLE=0 # keep IB/RoCE path enabled NCCL_IB_GID_INDEX=3 # RoCEv2+IPv4 (confirmed via show_gids) NCCL_NET_PLUGIN=none # prevents AWS OFI plugin from falling back to TCP NCCL_IB_ROCE_VERSION_NUM=2 NCCL_IB_TIMEOUT=22 NCCL_DEBUG=WARN RAY_memory_monitor_refresh_ms=0 # prevents Ray killing vLLM on unified-memory systems GLOO_SOCKET_IFNAME=\u0026lt;ifname\u0026gt; # Ray gloo rendezvous — must point to the direct link interface VLLM_HOST_IP=\u0026lt;per-node IP\u0026gt; # node-a: 192.168.100.10 / node-b: 192.168.100.11 NCCL_IB_HCA is the most common source of silent failure. Without it, NCCL on SM121 can deadlock with no output.\nStep 6: Start the cluster Start the worker node first — its entrypoint retries until the head node\u0026rsquo;s Ray service is ready:\nssh user@node-b \u0026#34;cd ~/vllm-compose \u0026amp;\u0026amp; docker compose up -d\u0026#34; ssh user@node-a \u0026#34;cd ~/vllm-compose \u0026amp;\u0026amp; docker compose up -d\u0026#34; Watch weight loading (both nodes load in parallel, ~7 min):\nssh user@node-a \u0026#39;docker logs -f vllm 2\u0026gt;\u0026amp;1 | grep -E \u0026#34;safetensors|Uvicorn|ERROR\u0026#34;\u0026#39; After ~2 minutes, verify Ray cluster formed — should show 2 nodes, 2 GPUs:\nssh user@node-a \u0026#39;docker exec vllm ray status\u0026#39; Poll until the API is ready (~15 min total):\nuntil curl -sf http://node-a:8000/health; do sleep 20; done \u0026amp;\u0026amp; echo \u0026#34;serving\u0026#34; Step 7: Smoke test curl http://node-a:8000/v1/chat/completions \\ -H \u0026#34;Content-Type: application/json\u0026#34; \\ -d \u0026#39;{ \u0026#34;model\u0026#34;: \u0026#34;qwen3-235b\u0026#34;, \u0026#34;messages\u0026#34;: [{\u0026#34;role\u0026#34;: \u0026#34;user\u0026#34;, \u0026#34;content\u0026#34;: \u0026#34;reply with the word: working /no_think\u0026#34;}], \u0026#34;max_tokens\u0026#34;: 50 }\u0026#39; Expected: \u0026quot;content\u0026quot;: \u0026quot;working\u0026quot;, \u0026quot;finish_reason\u0026quot;: \u0026quot;stop\u0026quot;.\nTroubleshooting Symptom Likely cause Fix Ray status shows 1 node Worker not connected yet Wait 2–3 min; check docker logs on worker vLLM hangs at weight loading NCCL rendezvous failure Verify NCCL_IB_HCA is set and interfaces are UP IndexError in fused_moe Wrong quantization format Use --quantization=gptq_marlin, not mxfp4 content: null in smoke test max_tokens too low Use max_tokens ≥ 50 with /no_think OOM at startup Page cache pressure sudo sh -c 'sync; echo 3 \u0026gt; /proc/sys/vm/drop_caches' rsync permission denied Cache files written as root by container Run the Alpine chown fix (see Step 3) NCCL falls back to TCP OFI plugin or missing HCA env var Set NCCL_NET_PLUGIN=none and NCCL_IB_HCA Triton JIT fails at first inference ptxas not found Add ptxas symlink to entrypoint (see Step 4) ","permalink":"https://conselara.dev/notes/two-node-ray-cluster-dgx-spark/","summary":"\u003cp\u003eQwen3-235B-A22B-GPTQ-Int4 is ~118 GB. A single DGX Spark has 128 GB unified memory — enough in theory, but once CUDA overhead and KV cache are factored in, it\u0026rsquo;s tight. Running it across two Sparks with TP=2 gives headroom for real workloads.\u003c/p\u003e\n\u003cp\u003eEach DGX Spark is a single logical GPU with no NVSwitch. Tensor parallelism across two units means Ray + NCCL over a direct interconnect. This is what the setup looks like and what will silently fail if not configured correctly.\u003c/p\u003e","title":"Building a Two-Node Ray Cluster for Distributed LLM Inference on DGX Spark"},{"content":"We migrated a Hugo static site from a self-hosted nginx container on a local server to S3 + CloudFront. The motivation was simple: a static site has no business running on a server we have to patch. The migration took a few hours and involved four gotchas that aren\u0026rsquo;t obvious from the AWS documentation.\nThis is a record of what we did and what tripped us up.\nThe setup Hugo static site (PaperMod theme) S3 bucket with all public access blocked — Origin Access Control (OAC) only CloudFront distribution with ACM SSL cert Cloudflare DNS, gray cloud (DNS-only) Gitea self-hosted repo with a webhook-triggered deploy container on-prem The deploy flow on push: Gitea fires a webhook → container on saturn pulls the repo, runs hugo --minify, syncs to S3, invalidates CloudFront.\nGotcha 1: S3 bucket names cannot have underscores The bucket name needs to be DNS-compliant. Underscores are not allowed. If the bucket is ever used for static website hosting via a CNAME, the bucket name has to match the domain exactly — and domain names can\u0026rsquo;t have underscores.\nWe wanted a name with underscores. We used hyphens instead.\nGotcha 2: ACM certificate must be in us-east-1 CloudFront only reads ACM certificates from us-east-1, regardless of where your bucket or other resources are. If you request the cert in another region, it won\u0026rsquo;t appear in the CloudFront certificate dropdown and the CLI will reject it.\naws acm request-certificate \\ --domain-name example.com \\ --subject-alternative-names www.example.com \\ --validation-method DNS \\ --region us-east-1 DNS validation records need to be added to Cloudflare before the cert issues. ACM polls for them — once they\u0026rsquo;re in, issuance takes a few minutes.\nGotcha 3: Cloudflare must be DNS-only (gray cloud) If Cloudflare is proxying traffic (orange cloud) and CloudFront is also doing SSL via ACM, the two SSL layers conflict. Cloudflare terminates the connection and tries to re-initiate to CloudFront — but CloudFront expects the original Host header, and Cloudflare\u0026rsquo;s proxy rewrites it.\nSet every DNS record pointing to CloudFront to DNS-only (gray cloud). Let CloudFront handle SSL end-to-end via ACM. Cloudflare\u0026rsquo;s proxy adds no value here.\nApex CNAMEs work fine in Cloudflare — it does CNAME flattening at the zone root automatically.\nOne other thing: we had a wildcard * A record in Cloudflare pointing to our local server, which handles staging and other internal subdomains. We left it in place. Explicit records (apex and www) take precedence over the wildcard — subdomains continue routing to the local server, apex and www go to CloudFront.\nGotcha 4: CloudFront default root object only covers / Hugo generates clean URLs — /notes/some-article/ resolves to /notes/some-article/index.html in S3. CloudFront\u0026rsquo;s \u0026ldquo;default root object\u0026rdquo; setting only rewrites requests for the bare /. Every other directory path (/notes/, /notes/some-article/) hits S3 as-is and gets a 404.\nThe fix is a CloudFront Function attached to the viewer-request event:\nfunction handler(event) { var request = event.request; var uri = request.uri; if (uri.endsWith(\u0026#39;/\u0026#39;)) { request.uri += \u0026#39;index.html\u0026#39;; } else if (!uri.includes(\u0026#39;.\u0026#39;)) { request.uri += \u0026#39;/index.html\u0026#39;; } return request; } This rewrites any request ending in / or with no file extension to append index.html before CloudFront fetches from S3. Create it, publish it to LIVE, and associate it with the default cache behavior as a viewer-request function.\nWe discovered this after go-live when every article returned 404. The homepage worked because it hits the root default object. Everything else did not.\nThe deploy webhook Rather than a managed CI/CD service, we run a small webhook container on our local server. On push, Gitea sends a signed POST to it. The container verifies the HMAC-SHA256 signature, then runs the deploy script:\n#!/bin/sh set -e git fetch origin main git reset --hard origin/main hugo --source /site --destination /public --minify aws s3 sync /public/ s3://your-bucket-name --delete aws cloudfront create-invalidation --distribution-id \u0026lt;ID\u0026gt; --paths \u0026#34;/*\u0026#34; AWS credentials are passed as environment variables from a .env file on the host, not baked into the container image.\nThe CloudFront invalidation adds a second or two to the deploy time and ensures cached stale files don\u0026rsquo;t linger. For a low-traffic site the cost is negligible.\nSummary Issue Fix Bucket name with underscores rejected Use hyphens — S3 names are DNS-compliant ACM cert not appearing in CloudFront Request cert in us-east-1 SSL errors with Cloudflare Set DNS records to DNS-only (gray cloud) All subdirectory paths returning 404 Add a CloudFront Function to rewrite directory URLs to index.html The end state: push to Gitea, site updates in under 30 seconds, no servers to maintain.\n","permalink":"https://conselara.dev/notes/hugo-s3-cloudfront-static-deploy/","summary":"\u003cp\u003eWe migrated a Hugo static site from a self-hosted nginx container on a local server to S3 + CloudFront. The motivation was simple: a static site has no business running on a server we have to patch. The migration took a few hours and involved four gotchas that aren\u0026rsquo;t obvious from the AWS documentation.\u003c/p\u003e\n\u003cp\u003eThis is a record of what we did and what tripped us up.\u003c/p\u003e\n\u003chr\u003e\n\u003ch2 id=\"the-setup\"\u003eThe setup\u003c/h2\u003e\n\u003cul\u003e\n\u003cli\u003e\u003cstrong\u003eHugo\u003c/strong\u003e static site (PaperMod theme)\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eS3\u003c/strong\u003e bucket with all public access blocked — Origin Access Control (OAC) only\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eCloudFront\u003c/strong\u003e distribution with ACM SSL cert\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eCloudflare\u003c/strong\u003e DNS, gray cloud (DNS-only)\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eGitea\u003c/strong\u003e self-hosted repo with a webhook-triggered deploy container on-prem\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003eThe deploy flow on push: Gitea fires a webhook → container on saturn pulls the repo, runs \u003ccode\u003ehugo --minify\u003c/code\u003e, syncs to S3, invalidates CloudFront.\u003c/p\u003e","title":"Deploying a Hugo Site to S3 + CloudFront: What Actually Bit Us"},{"content":"Reference for the NVIDIA DGX Spark GB10 Grace Blackwell Superchip (SM121). The DGX Spark shares the Blackwell name with datacenter hardware but is architecturally distinct. A lot of documentation, forum posts, and vLLM flags written for B100/B200 do not apply here — some actively break things.\nSM121 is not datacenter Blackwell Feature DGX Spark (GB10 / SM121) Datacenter Blackwell (B100/B200) TMEM No Yes WGMMA No Yes DSMEM No Yes NVSwitch No Yes CUTLASS FP4 Broken — silent garbage output Supported Memory type Unified LPDDR5X (shared CPU+GPU) HBM3e (GPU-only) Memory per unit 128 GB 192 GB GPUs per unit 1 logical GPU 1 GPU When you see forum recommendations or vLLM flags that say \u0026ldquo;for Blackwell\u0026rdquo; — verify they\u0026rsquo;re for SM121 specifically before using them.\nMemory architecture The 128 GB LPDDR5X pool is shared between CPU and GPU. There is no separate VRAM. This affects everything:\nThe Linux page cache competes directly with CUDA allocations. A warm filesystem cache can push CUDA into OOM. Never set --gpu-memory-utilization above 0.90. In practice, 0.85–0.87 is the safe range. At long context lengths with large KV caches, you may need to drop further. If you hit OOM at container startup, drop the page cache first: sudo sh -c \u0026#39;sync; echo 3 \u0026gt; /proc/sys/vm/drop_caches\u0026#39; Ray\u0026rsquo;s memory monitor misreads page cache pressure as real memory pressure and will kill vLLM mid-inference. In multi-node configurations, always set: RAY_memory_monitor_refresh_ms=0 GPU count and tensor parallelism Each DGX Spark presents as one logical GPU. Tensor parallelism beyond TP=1 requires a Ray cluster spanning multiple units — there is no NVLink or NVSwitch. Inter-node communication runs over NCCL via a direct RoCE interconnect.\nSoftware stack Component Version OS Ubuntu 24.04.4 LTS Kernel 6.17.0-1014-nvidia NVIDIA driver 580.142 CUDA (in NGC container) 13.0 Docker CE 29.2.1 NVIDIA Container Toolkit 1.19.0-1 Driver: Stay on 580.x. Driver 590.x has a confirmed CUDAGraph deadlock on GB10. Pin the package to prevent accidental upgrades:\nsudo apt-mark hold nvidia-driver-580 QSFP-DD direct interconnect For multi-node setups, a direct QSFP-DD cable between two Sparks gives you a high-bandwidth RoCE link without a switch.\nCable: A 400G DAC passive copper cable (e.g., Amphenol NJAAKK-N911, 1m) presents as two independent 200 Gb/s logical interfaces on each machine. Both channels share the same serial number — this is normal for this cable type.\nMeasured bandwidth (ib_write_bw) Channel Result Channel 1 ~13.35 Gb/s Channel 2 ~13.26 Gb/s Combined ~26.6 Gb/s The below-theoretical numbers are a known ib_write_bw artifact — it defaults to a single queue pair and 4096 B MTU. Actual NCCL throughput with multiple QPs approaches the full line rate. This is not a hardware or config problem.\nNetwork configuration: nmcli only The DGX Spark is fully NetworkManager-based. Netplan with the networkd renderer silently does nothing — there is no /run/systemd/network/ directory. All persistent interface configuration must use nmcli:\nsudo nmcli con mod \u0026lt;connection-name\u0026gt; \\ ipv4.addresses 192.168.100.10/24 \\ ipv4.method manual \\ 802-3-ethernet.mtu 9000 sudo nmcli con up \u0026lt;connection-name\u0026gt; Set MTU to 9000 (jumbo frames) on all QSFP-DD interfaces.\nRoCE GID index GID index 3 = RoCEv2 + IPv4 on both machines. Confirm with:\nshow_gids | grep \u0026lt;interface-name\u0026gt; This is required for NCCL: NCCL_IB_GID_INDEX=3.\nCompute kernel compatibility What works Kernel / Format Status --attention-backend=TRITON_ATTN ✅ Stable --moe-backend=marlin ✅ Required for MoE models gptq_marlin quantization ✅ Fully supported mxfp4 (gpt-oss format only) ✅ Works for gpt-oss pre-quantized checkpoints FP8 KV cache (--kv-cache-dtype=fp8) ✅ Safe and recommended CUDAGraph (--max-cudagraph-capture-size=2048) ✅ Required for full throughput What is broken or absent on SM121 Kernel / Format Status CUTLASS FP4 ❌ Produces garbage outputs silently FlashInfer attention ❌ Accuracy bugs on Blackwell FP8; MoE backends absent on SM121 mxfp4 on standard HuggingFace BF16 checkpoints ❌ IndexError + shape mismatch — gpt-oss format only --enforce-eager ❌ Disables CUDAGraph — ~55% throughput loss --load-format=fastsafetensors ❌ ImportError in NGC 26.03/26.04 For full runtime flag guidance, see the vLLM on DGX Spark SM121 post.\nStartup time reference For a two-node Qwen3-235B-A22B-GPTQ-Int4 cluster:\nPhase Duration Ray cluster formation ~2 min Model weight loading (118 GB, TP=2) ~7 min CUDA graph capture + compile ~5 min Total: first successful inference ~15 min Single-node (120B model): ~8 min total.\nTemperature monitoring # GPU temperature nvidia-smi --query-gpu=temperature.gpu --format=csv,noheader # CPU/SoC temperature sudo tegrastats --interval 1000 Normal operating range under sustained inference load: GPU 60–75°C, CPU 50–65°C. The GB10 thermal solution handles continuous full-load operation without throttling.\n","permalink":"https://conselara.dev/notes/dgx-spark-gb10-hardware-reference/","summary":"\u003cp\u003eReference for the NVIDIA DGX Spark GB10 Grace Blackwell Superchip (SM121). The DGX Spark shares the Blackwell name with datacenter hardware but is architecturally distinct. A lot of documentation, forum posts, and vLLM flags written for B100/B200 do not apply here — some actively break things.\u003c/p\u003e\n\u003chr\u003e\n\u003ch2 id=\"sm121-is-not-datacenter-blackwell\"\u003eSM121 is not datacenter Blackwell\u003c/h2\u003e\n\u003ctable\u003e\n  \u003cthead\u003e\n      \u003ctr\u003e\n          \u003cth\u003eFeature\u003c/th\u003e\n          \u003cth\u003eDGX Spark (GB10 / SM121)\u003c/th\u003e\n          \u003cth\u003eDatacenter Blackwell (B100/B200)\u003c/th\u003e\n      \u003c/tr\u003e\n  \u003c/thead\u003e\n  \u003ctbody\u003e\n      \u003ctr\u003e\n          \u003ctd\u003eTMEM\u003c/td\u003e\n          \u003ctd\u003eNo\u003c/td\u003e\n          \u003ctd\u003eYes\u003c/td\u003e\n      \u003c/tr\u003e\n      \u003ctr\u003e\n          \u003ctd\u003eWGMMA\u003c/td\u003e\n          \u003ctd\u003eNo\u003c/td\u003e\n          \u003ctd\u003eYes\u003c/td\u003e\n      \u003c/tr\u003e\n      \u003ctr\u003e\n          \u003ctd\u003eDSMEM\u003c/td\u003e\n          \u003ctd\u003eNo\u003c/td\u003e\n          \u003ctd\u003eYes\u003c/td\u003e\n      \u003c/tr\u003e\n      \u003ctr\u003e\n          \u003ctd\u003eNVSwitch\u003c/td\u003e\n          \u003ctd\u003eNo\u003c/td\u003e\n          \u003ctd\u003eYes\u003c/td\u003e\n      \u003c/tr\u003e\n      \u003ctr\u003e\n          \u003ctd\u003eCUTLASS FP4\u003c/td\u003e\n          \u003ctd\u003eBroken — silent garbage output\u003c/td\u003e\n          \u003ctd\u003eSupported\u003c/td\u003e\n      \u003c/tr\u003e\n      \u003ctr\u003e\n          \u003ctd\u003eMemory type\u003c/td\u003e\n          \u003ctd\u003eUnified LPDDR5X (shared CPU+GPU)\u003c/td\u003e\n          \u003ctd\u003eHBM3e (GPU-only)\u003c/td\u003e\n      \u003c/tr\u003e\n      \u003ctr\u003e\n          \u003ctd\u003eMemory per unit\u003c/td\u003e\n          \u003ctd\u003e128 GB\u003c/td\u003e\n          \u003ctd\u003e192 GB\u003c/td\u003e\n      \u003c/tr\u003e\n      \u003ctr\u003e\n          \u003ctd\u003eGPUs per unit\u003c/td\u003e\n          \u003ctd\u003e1 logical GPU\u003c/td\u003e\n          \u003ctd\u003e1 GPU\u003c/td\u003e\n      \u003c/tr\u003e\n  \u003c/tbody\u003e\n\u003c/table\u003e\n\u003cp\u003eWhen you see forum recommendations or vLLM flags that say \u0026ldquo;for Blackwell\u0026rdquo; — verify they\u0026rsquo;re for SM121 specifically before using them.\u003c/p\u003e","title":"DGX Spark GB10 Hardware Reference: SM121 Architecture, Memory, and Networking"},{"content":"The DGX Spark GB10 runs SM121 — the Grace Blackwell Superchip. It is not the same silicon as datacenter Blackwell (SM100, H100/H200). SM121 lacks TMEM, WGMMA, DSMEM, and NVSwitch. Several vLLM defaults, forum recommendations, and NVIDIA docs written for datacenter Blackwell do not apply, and some actively break things on SM121.\nThis is a reference for what we learned running vLLM 0.19.0 (NGC container nvcr.io/nvidia/vllm:26.04-py3) on two DGX Sparks — single-node and two-node cluster configurations.\nAttention and MoE backends Use --attention-backend=TRITON_ATTN. Do not use FlashInfer.\nFlashInfer has confirmed accuracy bugs with FP8 models on Blackwell (vLLM issue #35138). On SM121 specifically, the FlashInfer MoE backends are unavailable — vLLM silently falls back to Triton anyway. Forum posts recommending FlashInfer are written for datacenter Blackwell. For SM121, set Triton explicitly and move on.\nFor MoE models, --moe-backend=marlin is required.\nCUTLASS FP4 is broken on SM121. It produces garbage output silently — inference runs, generation looks plausible, but outputs are wrong. The correct MoE kernel for SM121 is Marlin. Set it explicitly:\n--attention-backend=TRITON_ATTN --moe-backend=marlin Also set this environment variable:\nVLLM_USE_FLASHINFER_MOE_FP4=0 Without it, MoE routing can go through the broken CUTLASS FP4 path.\nNever use --enforce-eager CUDAGraph is not optional on SM121. Disabling it with --enforce-eager cuts throughput roughly 55% — from ~59 tok/s to ~26 tok/s in our measurements. There is no scenario where this tradeoff is worth it. If you are adding it to work around a startup issue, fix the underlying issue instead.\nUnified memory ceiling The GB10 uses unified LPDDR5X memory — CPU and GPU share the same physical pool. The OS page cache competes directly with CUDA for this memory. Setting --gpu-memory-utilization too high causes OOM crashes or Xid 43 GPU channel preemption under load.\nHard limit: never exceed 0.90. In practice, 0.85–0.87 is the safe range for most models. At 131K context lengths with large KV caches, we needed to drop to 0.82 to stop Xid 43 errors on the two-node cluster.\nIf you hit OOM at startup, drop page cache before restarting:\nsudo sh -c \u0026#39;sync; echo 3 \u0026gt; /proc/sys/vm/drop_caches\u0026#39; Driver version Stay on 580.x. Do not upgrade to 590.x.\nNVIDIA driver 590.x has a confirmed CUDAGraph deadlock on GB10. At the time of writing, 580.142 is the correct version for SM121. Verify before upgrading any NGC container:\nnvidia-smi --query-gpu=driver_version --format=csv,noheader Flags that break NGC containers --load-format fastsafetensors causes an ImportError in NGC 26.03 and 26.04. It is available in some community-built vLLM containers but not in stock NGC images. Omit it. The default mmap format is slower to load on startup but identical at runtime.\nVLLM_MARLIN_USE_ATOMIC_ADD=1 is required for Marlin on SM121. Without it, there is a race condition in the Marlin kernel that produces incorrect outputs. Set it in your environment:\nVLLM_MARLIN_USE_ATOMIC_ADD=1 Quantization: what works with what checkpoint format The NGC 26.04 mxfp4 weight loader only handles gpt-oss pre-quantized checkpoints — specifically the format where expert weights are stored as 3D uint8 tensors. Standard HuggingFace BF16 checkpoints (Qwen3, Llama, etc.) are 2D BF16 tensors. Loading them with --quantization=mxfp4 produces an IndexError in fused_moe/layer.py, and after patching that, a dtype/shape mismatch.\nIf you are running gpt-oss-120b: --quantization=mxfp4 works.\nIf you are running any standard HuggingFace MoE checkpoint: use --quantization=gptq_marlin (for GPTQ-Int4 checkpoints) or --quantization=fp8 (for FP8 checkpoints). There is no online BF16→mxfp4 quantization path in the NGC build.\nReasoning models and content: null Reasoning models (gpt-oss, Qwen3 in thinking mode) generate chain-of-thought tokens before producing content. These tokens consume the max_tokens budget. If max_tokens is too low, the model exhausts the budget during reasoning and returns content: null.\nSet max_tokens to at least 512 for any reasoning model. For tool-calling workflows or complex prompts, 1024 or higher.\nFor latency-sensitive calls with Qwen3, append /no_think to the prompt to skip reasoning mode entirely.\nMulti-node: NCCL and Ray on SM121 Single-GPU per machine (no NVSwitch), so tensor parallelism across two Sparks requires Ray + NCCL over a direct interconnect. We use a 400G QSFP-DD DAC between the two nodes, presenting as 2×200G RoCEv2.\nSeveral things will silently fail or deadlock without explicit configuration.\nnetwork_mode: host is required. Bridge networking blocks NCCL rendezvous. This is not optional.\nNCCL_IB_HCA must be set. Without pinning NCCL to the RoCE interfaces, NCCL on SM121 can deadlock silently — no error, no output, just a hung process.\nNCCL_IB_HCA=rocep1s0f1,roceP2p1s0f1 NCCL_IB_DISABLE=0 NCCL_IB_GID_INDEX=3 # RoCEv2 + IPv4; verify with show_gids on your hardware NCCL_NET_PLUGIN=none # prevents AWS OFI plugin TCP fallback NCCL_IB_ROCE_VERSION_NUM=2 NCCL_IB_TIMEOUT=22 RAY_memory_monitor_refresh_ms=0 is required. Ray\u0026rsquo;s memory monitor can kill inference processes mid-run on unified-memory systems. Set it to 0 to disable it.\nTriton needs a ptxas symlink. The NGC container\u0026rsquo;s Triton backend looks for ptxas in a path it does not have on SM121. Add this to the container entrypoint:\nmkdir -p /usr/local/lib/python3.12/dist-packages/triton/backends/nvidia/bin ln -sf /usr/local/cuda/bin/ptxas \\ /usr/local/lib/python3.12/dist-packages/triton/backends/nvidia/bin/ptxas Without it, Triton JIT compilation fails at the first inference call.\nGLOO_SOCKET_IFNAME must point to the interconnect interface. Ray gloo rendezvous needs to use the direct QSFP-DD interface, not a random one:\nGLOO_SOCKET_IFNAME=enp1s0f1np1 Prefix cache hit rate metric The gpu_prefix_cache_hit_rate gauge is not present in vLLM 0.19.0. Calculate it from raw counters:\nhits = vllm:prefix_cache_hits_total queries = vllm:prefix_cache_queries_total rate = hits / queries Summary table Rule Why --attention-backend=TRITON_ATTN FlashInfer accuracy bugs on Blackwell FP8; MoE backends unavailable on SM121 --moe-backend=marlin CUTLASS FP4 broken on SM121 — silent garbage output VLLM_USE_FLASHINFER_MOE_FP4=0 Prevents MoE routing through broken CUTLASS path VLLM_MARLIN_USE_ATOMIC_ADD=1 Marlin race condition on SM121 Never --enforce-eager Disables CUDAGraph; ~55% throughput drop --gpu-memory-utilization ≤ 0.90 Unified memory — page cache competes with CUDA Stay on driver 580.x 590.x has CUDAGraph deadlock on GB10 No --load-format fastsafetensors ImportError in NGC 26.03/26.04 mxfp4 quantization: gpt-oss checkpoints only Loader is format-specific; use gptq_marlin or fp8 for standard HF checkpoints max_tokens ≥ 512 for reasoning models Reasoning tokens consume budget before content; low values return content: null network_mode: host for multi-node Bridge networking blocks NCCL rendezvous NCCL_IB_HCA set explicitly Silent deadlock without interface pinning RAY_memory_monitor_refresh_ms=0 Ray kills processes mid-inference on unified memory ptxas symlink in entrypoint Triton JIT fails without it on SM121 ","permalink":"https://conselara.dev/notes/vllm-dgx-spark-sm121-gotchas/","summary":"\u003cp\u003eThe DGX Spark GB10 runs SM121 — the Grace Blackwell Superchip. It is not the same silicon as datacenter Blackwell (SM100, H100/H200). SM121 lacks TMEM, WGMMA, DSMEM, and NVSwitch. Several vLLM defaults, forum recommendations, and NVIDIA docs written for datacenter Blackwell do not apply, and some actively break things on SM121.\u003c/p\u003e\n\u003cp\u003eThis is a reference for what we learned running vLLM 0.19.0 (NGC container \u003ccode\u003envcr.io/nvidia/vllm:26.04-py3\u003c/code\u003e) on two DGX Sparks — single-node and two-node cluster configurations.\u003c/p\u003e","title":"vLLM on DGX Spark: What the SM121 Architecture Actually Requires"},{"content":"We built an internal knowledge base server to give our AI agents access to Conselara\u0026rsquo;s company data — capabilities, past performance, GSA rates, certifications. The idea was straightforward: expose it as an MCP server so any AI client could query it semantically.\nIt worked in Claude Code. It worked nowhere else.\nWhat MCP promises The Model Context Protocol is Anthropic\u0026rsquo;s open standard for connecting AI models to external tools and data sources. The pitch is compelling: define your server once, and any MCP-compatible client can call it. Claude Code has native MCP support. The ecosystem is growing.\nWe built the server using FastMCP, a Python framework that makes standing up an MCP SSE (Server-Sent Events) endpoint straightforward. The server embedded documents using sentence-transformers/all-MiniLM-L6-v2, stored vectors in Qdrant, and exposed a search tool over SSE.\nThe client support problem When we tried to connect it to the other tools in our stack, we hit a wall:\nClient MCP SSE REST/OpenAPI Claude Code ✅ Native ✅ via Bash Pi agent Needs custom SDK extension ✅ Simple fetch OpenWebUI ❌ Not supported ✅ Python tool ChatGPT ❌ Not supported ✅ Custom GPT action claude.ai ✅ Remote MCP ✅ OpenAPI connector MCP SSE works natively in Claude Code. For everything else, you are either writing custom protocol wiring or running mcpo — a proxy container that translates MCP to OpenAPI. We had already built the server. We did not want to add another container on top of it just to make it callable from OpenWebUI.\nThe fundamental issue: MCP is young and client support is inconsistent. REST and OpenAPI have 20 years of tooling behind them. Every HTTP client in existence knows how to make a POST request.\nThe rebuild We rewrote the server in FastAPI in an afternoon. The core logic — embedding, Qdrant queries, chunking — stayed identical. We added:\nPOST /search — semantic search, returns ranked chunks POST /ingest — wipes and re-ingests all KB files GET /stats — point count and file list GET /openapi.json — auto-generated by FastAPI, works directly as a ChatGPT action or claude.ai connector The one endpoint worth calling out: GET /owu-tool. OpenWebUI lets you import Python tool definitions from a URL. Rather than asking users to paste code and deal with syntax errors, we serve the tool definition directly from the server:\n@app.get(\u0026#34;/owu-tool\u0026#34;, response_class=PlainTextResponse) def owu_tool(): return OWU_TOOL # Python string served as plain text Point OpenWebUI at http://\u0026lt;host\u0026gt;/owu-tool and it imports cleanly. No copy-paste, no syntax errors.\nConnecting each client Claude Code — no configuration needed, just Bash:\ncurl -s -X POST http://\u0026lt;your-server\u0026gt;/search \\ -H \u0026#34;Content-Type: application/json\u0026#34; \\ -d \u0026#39;{\u0026#34;query\u0026#34;: \u0026#34;GSA labor categories\u0026#34;}\u0026#39; Pi agent — a TypeScript extension in ~/.pi/agent/extensions/ that calls the REST endpoint via fetch. Pi auto-discovers extensions on startup, no flags needed.\nOpenWebUI — Workspace → Tools → Import from URL → http://\u0026lt;host\u0026gt;/owu-tool. Done.\nChatGPT / claude.ai — expose via a public URL, then point each at /openapi.json. FastAPI generates the spec automatically.\nWhat we would do differently One thing we have not resolved yet: the same KB source files now feed two separate Qdrant collections. Our Hermes agent uses mcp-server-qdrant with a namespaced collection. The FastAPI server manages its own separate collection. Edits to KB files need to be synced and ingested twice or the collections drift.\nThe fix is to point the FastAPI server at the existing Hermes collection directly, eliminating the duplicate. We documented it as a known issue and have not gotten to it yet.\nThe takeaway MCP is the right long-term bet for AI tool interoperability. The protocol is well-designed and Anthropic is investing in it seriously. But today, client support is fragmented. If you need your tool server to work across Claude Code, OpenWebUI, ChatGPT, and custom agents simultaneously, REST with an OpenAPI spec is the pragmatic choice — one server, no proxy layer, every client covered.\nWe will likely add MCP back on top of the FastAPI server as an adapter when the client ecosystem matures. For now, HTTP is doing the job.\n","permalink":"https://conselara.dev/notes/mcp-to-fastapi-lessons-learned/","summary":"\u003cp\u003eWe built an internal knowledge base server to give our AI agents access to Conselara\u0026rsquo;s company data — capabilities, past performance, GSA rates, certifications. The idea was straightforward: expose it as an MCP server so any AI client could query it semantically.\u003c/p\u003e\n\u003cp\u003eIt worked in Claude Code. It worked nowhere else.\u003c/p\u003e\n\u003ch2 id=\"what-mcp-promises\"\u003eWhat MCP promises\u003c/h2\u003e\n\u003cp\u003eThe Model Context Protocol is Anthropic\u0026rsquo;s open standard for connecting AI models to external tools and data sources. The pitch is compelling: define your server once, and any MCP-compatible client can call it. Claude Code has native MCP support. The ecosystem is growing.\u003c/p\u003e","title":"We Replaced an MCP Server with FastAPI and It Worked Everywhere"},{"content":"We are integrating AI across several workstreams on a federal health research information platform.\nPublication discovery — using LLMs to surface relevant PubMed research, reducing manual literature review time and improving coverage across a high-volume publication landscape.\nLLM comparative evaluations — running structured benchmarks across models to assess quality, consistency, and cost for specific content tasks on the platform. Evaluations are task-specific rather than general — we score against real outputs the platform needs to produce.\nAI-assisted development workflows — incorporating AI tooling into the engineering workflow for code review, documentation, and implementation acceleration. All outputs are reviewed by the project team before use.\nAI-powered CMS module vetting — evaluating third-party AI-powered Drupal modules before integration. Assessment criteria include data handling, output reliability, and compatibility with federal security requirements.\nCommunications and dissemination — using AI tools to support drafting and refinement of materials, with project team review and approval on all final outputs.\n","permalink":"https://conselara.dev/notes/ai-health-research-platform/","summary":"\u003cp\u003eWe are integrating AI across several workstreams on a federal health research information platform.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003ePublication discovery\u003c/strong\u003e — using LLMs to surface relevant PubMed research, reducing manual literature review time and improving coverage across a high-volume publication landscape.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eLLM comparative evaluations\u003c/strong\u003e — running structured benchmarks across models to assess quality, consistency, and cost for specific content tasks on the platform. Evaluations are task-specific rather than general — we score against real outputs the platform needs to produce.\u003c/p\u003e","title":"AI Across a Health Research Information Platform"},{"content":"Measured throughput and latency on DGX Spark GB10 (SM121) hardware. All results use vLLM 0.19.0 (NGC container nvcr.io/nvidia/vllm:26.04-py3) unless noted.\nQwen3-235B-A22B-GPTQ-Int4 — Two-node cluster Date: 2026-05-03\nConfig: TP=2, EP=2, Ray cluster over QSFP-DD RoCE direct interconnect, --attention-backend=TRITON_ATTN, --quantization=gptq_marlin, --kv-cache-dtype=fp8, --gpu-memory-utilization=0.87\nBatch Avg completion tokens tok/s per request Aggregate tok/s 1 (serial) 256 17.0 17.0 2 (concurrent) 256 12.1 24.1 4 (concurrent) 256 9.1 36.4 Prefix cache: 97% delta hit rate on repeated system prompt.\nStartup to first inference: ~15 minutes (Ray init + weight load across two nodes + compile).\nWeight resident per node: 57.64 GiB.\nvs NVIDIA\u0026rsquo;s official published number Source Config tok/s (batch=1) This lab GPTQ-Int4, vLLM gptq_marlin 17.0 NVIDIA official NVFP4, TRT-LLM 11.73 Our GPTQ-Int4 + vLLM result beats NVIDIA\u0026rsquo;s own published NVFP4 + TRT-LLM number by ~45% at batch=1. The NVFP4 SM121 MoE kernel is still maturing — GPTQ-Marlin is a more optimized path on SM121 today.\ngpt-oss-120b — Single node Date: 2026-05-02\nConfig: TP=1, --quantization=mxfp4, --kv-cache-dtype=fp8, --attention-backend=TRITON_ATTN, --moe-backend=marlin, --gpu-memory-utilization=0.87, --max-cudagraph-capture-size=2048\nMetric Value Generation throughput (with reasoning overhead) ~32–35 tok/s Pure decode (no reasoning) 57–60 tok/s Prefix cache hit rate ~76% Context window 128,000 tokens Effect of --enforce-eager Config tok/s CUDAGraph enabled (default) ~59 tok/s --enforce-eager (CUDAGraph disabled) ~26 tok/s --enforce-eager cuts throughput by ~55% on SM121. Never use it.\nEffect of --max-cudagraph-capture-size The NVIDIA default Blackwell recipe sets this to 32, which limits graph coverage to batch sizes up to 32. Setting it to 2048 provides full coverage from batch=1 through batch=2048 with no meaningful overhead.\nInter-node network (QSFP-DD RoCE direct connect) Measured with ib_write_bw (single QP, 4096 B MTU), RDMA/RoCE confirmed:\nChannel Bandwidth Channel 1 13.35 Gb/s Channel 2 13.26 Gb/s Combined ~26.6 Gb/s Note: single QP with 4096 B MTU caps results well below the 200 Gb/s theoretical maximum — this is a benchmark tool artifact. A multi-QP test with 9000 B MTU should approach the full ceiling. Model weight cache sync over QSFP-DD averages ~500–580 MB/s (118 GB transferred in ~3–4 minutes).\nCommunity benchmarks for comparison Model Source Config tok/s Qwen3-235B This lab GPTQ-Int4, gptq_marlin, TP=2 17 (b=1) / 36 agg (b=4) Qwen3-235B NVIDIA official NVFP4, TRT-LLM 11.73 (b=1) gpt-oss-120b This lab mxfp4, Marlin, TP=1 57–60 (pure decode) Qwen3-30B-A3B Community (jleighfields) NVFP4, vLLM 32–45 Qwen3.6-27B Community (NVIDIA forums) FP8, stock NGC 14–21 Qwen3.6-27B Community (mitkox fork) FP8, DFlash+DDTree 136–200 ","permalink":"https://conselara.dev/notes/dgx-spark-benchmarks/","summary":"\u003cp\u003eMeasured throughput and latency on DGX Spark GB10 (SM121) hardware. All results use vLLM 0.19.0 (NGC container \u003ccode\u003envcr.io/nvidia/vllm:26.04-py3\u003c/code\u003e) unless noted.\u003c/p\u003e\n\u003ch2 id=\"qwen3-235b-a22b-gptq-int4--two-node-cluster\"\u003eQwen3-235B-A22B-GPTQ-Int4 — Two-node cluster\u003c/h2\u003e\n\u003cp\u003e\u003cstrong\u003eDate:\u003c/strong\u003e 2026-05-03\u003cbr\u003e\n\u003cstrong\u003eConfig:\u003c/strong\u003e TP=2, EP=2, Ray cluster over QSFP-DD RoCE direct interconnect, \u003ccode\u003e--attention-backend=TRITON_ATTN\u003c/code\u003e, \u003ccode\u003e--quantization=gptq_marlin\u003c/code\u003e, \u003ccode\u003e--kv-cache-dtype=fp8\u003c/code\u003e, \u003ccode\u003e--gpu-memory-utilization=0.87\u003c/code\u003e\u003c/p\u003e\n\u003ctable\u003e\n  \u003cthead\u003e\n      \u003ctr\u003e\n          \u003cth\u003eBatch\u003c/th\u003e\n          \u003cth\u003eAvg completion tokens\u003c/th\u003e\n          \u003cth\u003etok/s per request\u003c/th\u003e\n          \u003cth\u003eAggregate tok/s\u003c/th\u003e\n      \u003c/tr\u003e\n  \u003c/thead\u003e\n  \u003ctbody\u003e\n      \u003ctr\u003e\n          \u003ctd\u003e1 (serial)\u003c/td\u003e\n          \u003ctd\u003e256\u003c/td\u003e\n          \u003ctd\u003e\u003cstrong\u003e17.0\u003c/strong\u003e\u003c/td\u003e\n          \u003ctd\u003e17.0\u003c/td\u003e\n      \u003c/tr\u003e\n      \u003ctr\u003e\n          \u003ctd\u003e2 (concurrent)\u003c/td\u003e\n          \u003ctd\u003e256\u003c/td\u003e\n          \u003ctd\u003e12.1\u003c/td\u003e\n          \u003ctd\u003e\u003cstrong\u003e24.1\u003c/strong\u003e\u003c/td\u003e\n      \u003c/tr\u003e\n      \u003ctr\u003e\n          \u003ctd\u003e4 (concurrent)\u003c/td\u003e\n          \u003ctd\u003e256\u003c/td\u003e\n          \u003ctd\u003e9.1\u003c/td\u003e\n          \u003ctd\u003e\u003cstrong\u003e36.4\u003c/strong\u003e\u003c/td\u003e\n      \u003c/tr\u003e\n  \u003c/tbody\u003e\n\u003c/table\u003e\n\u003cp\u003e\u003cstrong\u003ePrefix cache:\u003c/strong\u003e 97% delta hit rate on repeated system prompt.\u003cbr\u003e\n\u003cstrong\u003eStartup to first inference:\u003c/strong\u003e ~15 minutes (Ray init + weight load across two nodes + compile).\u003cbr\u003e\n\u003cstrong\u003eWeight resident per node:\u003c/strong\u003e 57.64 GiB.\u003c/p\u003e","title":"DGX Spark Benchmark Results: vLLM on SM121"},{"content":"Quick-reference comparison of open-weight models for a single DGX Spark GB10 (SM121, 128 GB unified LPDDR5X memory). Based on tested configurations and community results as of May 2026.\nModel Architecture Quantization Memory Expected tok/s SM121 notes Qwen3.6-35B-A3B Pure MoE (3B active) FP8 (~35 GB) ✅ easily 100+ Pure MoE, no GDN — fully supported Qwen3.6-27B Dense hybrid (GDN) FP8 (~28 GB) ✅ easily 14–21 (stock) / 136–200 (fork) GDN kernel gap; experimental fork needed for full speed Qwen3-30B-A3B Pure MoE (3.3B active) NVFP4 / FP8 / BF16 (~16–60 GB) ✅ easily 32–50 Solid single-node option; no GDN gpt-oss-120b Sparse MoE (5.1B active) mxfp4 (~61 GB) ✅ 32–60 128K context; proprietary quant format Qwen3.5-122B-A10B Pure MoE (10B active) NVFP4 only (~75 GB) ✅ up to 51 BF16 is 234 GB — does not fit; NVFP4 is the only path Qwen3-235B-A22B Pure MoE (22B active) GPTQ-Int4 (~60 GB/node) ✅ (two nodes) 17–36 agg Requires two DGX Sparks; best quality available Qwen3.5-397B-A17B Pure MoE (17B active) NVFP4 (TP=2) ✅ (two nodes) Unknown SM121 MoE kernel not yet optimized; not recommended Key observations Throughput vs quality tradeoff at single-node: Qwen3.6-35B-A3B gives the highest throughput (100+ tok/s) with pure MoE architecture. Qwen3.5-122B-A10B gives the most capable model (10B active parameters) that fits on one node, at 51 tok/s. For most agentic workloads the bottleneck is tool latency, not token generation — so 51 tok/s is more than sufficient.\nThe GDN trap: Qwen3.6-27B looks attractive on paper — it\u0026rsquo;s small (28 GB), recent, and dense. But the GDN attention kernel has a gap on SM121 that cuts it to 14–21 tok/s with stock NGC. Qwen3.6-35B-A3B is larger on paper but runs 5–7× faster in practice.\nNVFP4 is the only path to 122B on one node: Qwen3.5-122B-A10B at BF16 is 234 GB — it doesn\u0026rsquo;t fit. NVFP4 quantization brings it to ~75 GB. There is no other quantization format that both fits and runs correctly on SM121. See Running Qwen3.5-122B on a Single DGX Spark for setup details.\nTwo-node ceiling: Qwen3-235B-A22B over a QSFP-DD direct interconnect is the highest quality configuration available on two Sparks. Our benchmarks show 17 tok/s at batch=1 and 36 tok/s aggregate at batch=4 — beating NVIDIA\u0026rsquo;s own published TRT-LLM number by ~45%.\n","permalink":"https://conselara.dev/notes/dgx-spark-model-comparison/","summary":"\u003cp\u003eQuick-reference comparison of open-weight models for a single DGX Spark GB10 (SM121, 128 GB unified LPDDR5X memory). Based on tested configurations and community results as of May 2026.\u003c/p\u003e\n\u003ctable\u003e\n  \u003cthead\u003e\n      \u003ctr\u003e\n          \u003cth\u003eModel\u003c/th\u003e\n          \u003cth\u003eArchitecture\u003c/th\u003e\n          \u003cth\u003eQuantization\u003c/th\u003e\n          \u003cth\u003eMemory\u003c/th\u003e\n          \u003cth\u003eExpected tok/s\u003c/th\u003e\n          \u003cth\u003eSM121 notes\u003c/th\u003e\n      \u003c/tr\u003e\n  \u003c/thead\u003e\n  \u003ctbody\u003e\n      \u003ctr\u003e\n          \u003ctd\u003eQwen3.6-35B-A3B\u003c/td\u003e\n          \u003ctd\u003ePure MoE (3B active)\u003c/td\u003e\n          \u003ctd\u003eFP8 (~35 GB)\u003c/td\u003e\n          \u003ctd\u003e✅ easily\u003c/td\u003e\n          \u003ctd\u003e100+\u003c/td\u003e\n          \u003ctd\u003ePure MoE, no GDN — fully supported\u003c/td\u003e\n      \u003c/tr\u003e\n      \u003ctr\u003e\n          \u003ctd\u003eQwen3.6-27B\u003c/td\u003e\n          \u003ctd\u003eDense hybrid (GDN)\u003c/td\u003e\n          \u003ctd\u003eFP8 (~28 GB)\u003c/td\u003e\n          \u003ctd\u003e✅ easily\u003c/td\u003e\n          \u003ctd\u003e14–21 (stock) / 136–200 (fork)\u003c/td\u003e\n          \u003ctd\u003eGDN kernel gap; experimental fork needed for full speed\u003c/td\u003e\n      \u003c/tr\u003e\n      \u003ctr\u003e\n          \u003ctd\u003eQwen3-30B-A3B\u003c/td\u003e\n          \u003ctd\u003ePure MoE (3.3B active)\u003c/td\u003e\n          \u003ctd\u003eNVFP4 / FP8 / BF16 (~16–60 GB)\u003c/td\u003e\n          \u003ctd\u003e✅ easily\u003c/td\u003e\n          \u003ctd\u003e32–50\u003c/td\u003e\n          \u003ctd\u003eSolid single-node option; no GDN\u003c/td\u003e\n      \u003c/tr\u003e\n      \u003ctr\u003e\n          \u003ctd\u003egpt-oss-120b\u003c/td\u003e\n          \u003ctd\u003eSparse MoE (5.1B active)\u003c/td\u003e\n          \u003ctd\u003emxfp4 (~61 GB)\u003c/td\u003e\n          \u003ctd\u003e✅\u003c/td\u003e\n          \u003ctd\u003e32–60\u003c/td\u003e\n          \u003ctd\u003e128K context; proprietary quant format\u003c/td\u003e\n      \u003c/tr\u003e\n      \u003ctr\u003e\n          \u003ctd\u003eQwen3.5-122B-A10B\u003c/td\u003e\n          \u003ctd\u003ePure MoE (10B active)\u003c/td\u003e\n          \u003ctd\u003eNVFP4 only (~75 GB)\u003c/td\u003e\n          \u003ctd\u003e✅\u003c/td\u003e\n          \u003ctd\u003eup to 51\u003c/td\u003e\n          \u003ctd\u003eBF16 is 234 GB — does not fit; NVFP4 is the only path\u003c/td\u003e\n      \u003c/tr\u003e\n      \u003ctr\u003e\n          \u003ctd\u003eQwen3-235B-A22B\u003c/td\u003e\n          \u003ctd\u003ePure MoE (22B active)\u003c/td\u003e\n          \u003ctd\u003eGPTQ-Int4 (~60 GB/node)\u003c/td\u003e\n          \u003ctd\u003e✅ (two nodes)\u003c/td\u003e\n          \u003ctd\u003e17–36 agg\u003c/td\u003e\n          \u003ctd\u003eRequires two DGX Sparks; best quality available\u003c/td\u003e\n      \u003c/tr\u003e\n      \u003ctr\u003e\n          \u003ctd\u003eQwen3.5-397B-A17B\u003c/td\u003e\n          \u003ctd\u003ePure MoE (17B active)\u003c/td\u003e\n          \u003ctd\u003eNVFP4 (TP=2)\u003c/td\u003e\n          \u003ctd\u003e✅ (two nodes)\u003c/td\u003e\n          \u003ctd\u003eUnknown\u003c/td\u003e\n          \u003ctd\u003eSM121 MoE kernel not yet optimized; not recommended\u003c/td\u003e\n      \u003c/tr\u003e\n  \u003c/tbody\u003e\n\u003c/table\u003e\n\u003ch2 id=\"key-observations\"\u003eKey observations\u003c/h2\u003e\n\u003cp\u003e\u003cstrong\u003eThroughput vs quality tradeoff at single-node:\u003c/strong\u003e Qwen3.6-35B-A3B gives the highest throughput (100+ tok/s) with pure MoE architecture. Qwen3.5-122B-A10B gives the most capable model (10B active parameters) that fits on one node, at 51 tok/s. For most agentic workloads the bottleneck is tool latency, not token generation — so 51 tok/s is more than sufficient.\u003c/p\u003e","title":"DGX Spark Model Comparison: What Fits and What Runs (SM121, 128 GB)"},{"content":"We are running a pilot of AWS DevOps Guru paired with Amazon Q across a federal AWS estate.\nDevOps Guru provides ML-driven anomaly detection and automated root cause analysis. Rather than relying on manually defined alert thresholds, it builds a baseline from operational data and flags deviations — reducing noise and surfacing issues that threshold-based alerting misses.\nAmazon Q brings generative AI into engineer troubleshooting workflows. When an anomaly is flagged, engineers can query Amazon Q directly for accelerated diagnosis — pulling in relevant runbooks, log context, and suggested remediation paths without switching tools.\nBenchmark against Datadog Watchdog — the pilot includes a structured head-to-head comparison against Datadog Watchdog across quantified cost and security scenarios. Evaluation criteria include detection accuracy, time-to-diagnosis, alert fatigue, and total cost of ownership.\nResults from the benchmark will inform a longer-term AIOps tooling decision for the environment.\n","permalink":"https://conselara.dev/notes/aws-devops-guru-amazon-q-pilot/","summary":"\u003cp\u003eWe are running a pilot of AWS DevOps Guru paired with Amazon Q across a federal AWS estate.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eDevOps Guru\u003c/strong\u003e provides ML-driven anomaly detection and automated root cause analysis. Rather than relying on manually defined alert thresholds, it builds a baseline from operational data and flags deviations — reducing noise and surfacing issues that threshold-based alerting misses.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAmazon Q\u003c/strong\u003e brings generative AI into engineer troubleshooting workflows. When an anomaly is flagged, engineers can query Amazon Q directly for accelerated diagnosis — pulling in relevant runbooks, log context, and suggested remediation paths without switching tools.\u003c/p\u003e","title":"Piloting AWS DevOps Guru and Amazon Q for AIOps"},{"content":"The DGX Spark GB10 SoC (SM121) has specific constraints that determine which models run well and which don\u0026rsquo;t. This is a practical guide based on what we\u0026rsquo;ve tested in production.\nThe key constraint: SM121 kernel compatibility Not all model architectures run well on SM121 with the NGC vLLM container. The main constraint is the MoE kernel:\nMarlin kernel — stable, fast, supports GPTQ-Int4 and mxfp4 CUTLASS FP4 — broken on SM121, produces garbage outputs silently; never use GDN (GatedDeltaNet) — kernel gap on SM121, 14–21 tok/s with stock NGC; requires experimental fork for full speed Prefer pure MoE models over dense hybrid architectures when using the NGC container. Pure MoE (no GDN/Mamba layers) runs fully through Marlin and is well-tested on SM121.\nQuantization format decision tree Is the checkpoint from openai/gpt-oss-*? ├── Yes → --quantization=mxfp4 (gpt-oss pre-quantized uint8 format) └── No → Is there a GPTQ-Int4 checkpoint available? ├── Yes → --quantization=gptq_marlin ✅ recommended └── No → FP8 checkpoint available? ├── Yes → --quantization=fp8 (test carefully on SM121) └── No → BF16 (no flag; fits only if weights \u0026lt; ~100 GB) Do not use --quantization=mxfp4 on standard HuggingFace BF16 checkpoints. The NGC 26.04 mxfp4 weight loader only handles gpt-oss\u0026rsquo;s proprietary 3D uint8 tensor format. BF16 HF checkpoints will crash with IndexError in fused_moe/layer.py.\nModel comparison for SM121 (single node, 128 GB) Model Architecture Quantization Fits 128 GB Expected tok/s Notes gpt-oss-120b Sparse MoE (5.1B active) mxfp4 ✅ ~61 GB 32–60 128K context; proprietary quant Qwen3-235B-A22B Pure MoE (22B active) GPTQ-Int4 (TP=2) ✅ ~60 GB/node 17–36 Two nodes required; best quality Qwen3-30B-A3B Pure MoE (3.3B active) NVFP4 / FP8 / BF16 ✅ ~16–60 GB 32–50 Solid single-node option; no GDN Qwen3.6-27B Dense hybrid (GDN) FP8 (~28 GB) ✅ easily 14–21 (stock) / 136–200 (fork) GDN kernel gap; fork needed for full speed Qwen3.5-122B-A10B Pure MoE (10B active) NVFP4 (~75 GB) ✅ single node up to 51 Requires NVFP4 checkpoint + SM121 patches Qwen3.6-35B-A3B Pure MoE (3B active) FP8 (~35 GB) ✅ easily 100+ Pure MoE, no GDN; successor to Qwen3-30B-A3B Qwen3.5-397B-A17B Pure MoE (17B active) NVFP4 ✅ (TP=2) Unknown Not yet recommended — SM121 MoE kernel not optimized Architectures to prefer vs avoid Prefer: pure MoE — models using only standard MoE layers (no GDN, no Mamba) run fully through the Marlin kernel and are the most reliable choice on SM121. Examples: Qwen3-235B-A22B, Qwen3-30B-A3B, gpt-oss-120b, Mixtral variants.\nAvoid with stock NGC: GDN hybrid architectures — models with GatedDeltaNet (GDN) linear attention layers hit a kernel gap on SM121. Stock NGC produces 14–21 tok/s. If you need full speed from Qwen3.6-27B (~136–200 tok/s), the mitkox/vllm-dflash-ddtree experimental fork adds DFlash + DDTree speculative decoding for GDN, but it\u0026rsquo;s not yet production-stable.\nQwen3.6 model family Released April 2026. Two architecturally very different open-weight variants:\nQwen3.6-27B Qwen3.6-35B-A3B Architecture Dense hybrid (GDN) Pure MoE Active params per token 27B (all) ~3B FP8 weight size ~28 GB ~35 GB tok/s on DGX Spark 14–21 (stock) / 136–200 (fork) 100+ GDN kernel gap Yes No SM121 stock NGC Underperforms ✅ Fully supported No Qwen3.6-72B exists. As of May 2026, Qwen3.6 tops out at 27B dense and 35B-A3B MoE. For a 70B+ class model on a single DGX Spark, the current best option is Qwen3.5-122B-A10B NVFP4 (10B active, 51 tok/s confirmed).\nSM121 hard rules # Never use --enforce-eager — disables CUDA graphs, ~55% throughput loss # Never set --gpu-memory-utilization above 0.90 — OOM on SM121 # MoE backend must be marlin — default produces garbage tokens on SM121 # Stay on driver 580.x — 590.x has a regression on this chip # Never use CUTLASS FP4 — silent garbage output Checkpoints to avoid Checkpoint Why nvidia/Qwen3-235B-A22B-NVFP4 vLLM parsing bug #22906; TRT-LLM only Any BF16 HF model with --quantization=mxfp4 mxfp4 loader only handles gpt-oss uint8 format FP8 model with FlashInfer FlashInfer crashes on SM121; use TRITON_ATTN Any model requiring --enforce-eager 55% throughput loss Useful community resources eugr/spark-vllm-docker — Ray GPU resource fix + SM121 patches jleighfields/vllm-dgx-spark — Qwen3-Coder-30B-A3B confirmed on DGX Spark NVIDIA \u0026ldquo;Stacked Sparks\u0026rdquo; guide — build.nvidia.com/spark/vllm/stacked-sparks NVIDIA DGX Spark Playbooks (DeepWiki) — Ray cluster, NCCL config, UMA tuning ","permalink":"https://conselara.dev/notes/dgx-spark-model-selection/","summary":"\u003cp\u003eThe DGX Spark GB10 SoC (SM121) has specific constraints that determine which models run well and which don\u0026rsquo;t. This is a practical guide based on what we\u0026rsquo;ve tested in production.\u003c/p\u003e\n\u003ch2 id=\"the-key-constraint-sm121-kernel-compatibility\"\u003eThe key constraint: SM121 kernel compatibility\u003c/h2\u003e\n\u003cp\u003eNot all model architectures run well on SM121 with the NGC vLLM container. The main constraint is the MoE kernel:\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003e\u003cstrong\u003eMarlin kernel\u003c/strong\u003e — stable, fast, supports GPTQ-Int4 and mxfp4\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eCUTLASS FP4\u003c/strong\u003e — broken on SM121, produces garbage outputs silently; never use\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eGDN (GatedDeltaNet)\u003c/strong\u003e — kernel gap on SM121, 14–21 tok/s with stock NGC; requires experimental fork for full speed\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003ePrefer \u003cstrong\u003epure MoE models\u003c/strong\u003e over dense hybrid architectures when using the NGC container. Pure MoE (no GDN/Mamba layers) runs fully through Marlin and is well-tested on SM121.\u003c/p\u003e","title":"vLLM Model Selection for DGX Spark (SM121)"},{"content":"If you\u0026rsquo;re running SearXNG as a self-hosted search backend for an automated pipeline, the default engine selection will cause you problems quickly. Here\u0026rsquo;s what we\u0026rsquo;ve found running SearXNG 24/7 for a federal procurement intelligence pipeline.\nWhat doesn\u0026rsquo;t work Google — returns 403 Forbidden for bot-detected requests. Happens immediately on most self-hosted instances without aggressive Cloudflare bypass configuration. Don\u0026rsquo;t rely on it for automated queries.\nStartpage — CAPTCHAs after a few queries. Fine for occasional manual searches, unusable for scheduled pipelines.\ntime_range parameter — setting time_range=month or time_range=week causes Bing to return 0 results. The parameter appears to be handled inconsistently between engines; Bing\u0026rsquo;s implementation simply returns empty when it\u0026rsquo;s set. Omit it entirely and apply your own date filter in Python.\nWhat works reliably import urllib.parse, urllib.request, json, time SEARXNG = \u0026#34;http://your-searxng-host/search\u0026#34; def search(query: str) -\u0026gt; list[dict]: params = urllib.parse.urlencode({ \u0026#34;q\u0026#34;: query, \u0026#34;format\u0026#34;: \u0026#34;json\u0026#34;, \u0026#34;engines\u0026#34;: \u0026#34;duckduckgo,bing\u0026#34;, # Do NOT set time_range — breaks Bing }) req = urllib.request.Request( f\u0026#34;{SEARXNG}?{params}\u0026#34;, headers={\u0026#34;User-Agent\u0026#34;: \u0026#34;Mozilla/5.0\u0026#34;} ) try: data = json.loads(urllib.request.urlopen(req, timeout=15).read()) return data.get(\u0026#34;results\u0026#34;, []) except Exception as e: print(f\u0026#34;Search failed: {e}\u0026#34;) return [] # Always sleep between queries results = search(\u0026#34;your query here\u0026#34;) time.sleep(3) # mandatory — skipping this triggers CAPTCHA within minutes engines=duckduckgo,bing is the reliable combination. DuckDuckGo handles the bulk of results; Bing covers gaps. Together they\u0026rsquo;re stable across thousands of queries per day.\nRate limit in practice From our experience running 20–40 queries per 3-hour cron window:\nUnder 20 queries: stable, no CAPTCHA 20–40 queries with 3s sleep between: generally stable Over 40 queries or sleep \u0026lt; 2s: CAPTCHA within the session Cap your query count per run slot. We use MAX_QUERIES = 20 as a hard limit.\nintitle: operator intitle: works in DuckDuckGo, not Bing. But since you\u0026rsquo;re running both engines, you still get value from Bing results alongside the intitle:-filtered DuckDuckGo results.\nFor federal procurement monitoring, intitle: queries are the highest-signal pattern:\nintitle:\u0026#34;sources sought\u0026#34; AHRQ OR ONC OR NIH intitle:\u0026#34;request for information\u0026#34; \u0026#34;health IT\u0026#34; intitle:\u0026#34;industry day\u0026#34; CMS OR FDA Flag hits from intitle: queries as high priority — these are active pre-solicitations, not news articles.\nsite: operator site: does not work on most default SearXNG configurations. Queries like site:fda.gov contract 2026 return 0 results because neither DuckDuckGo nor Bing passes the operator through the SearXNG adapter correctly.\nReplace with keyword queries:\nInstead of Use site:fda.gov IT contract 2026 FDA HHS IT health technology contract award 2026 site:ahrq.gov procurement AHRQ health outcomes data cloud contract Dynamic year Never hardcode a year in query templates. Queries go stale on January 1 and you won\u0026rsquo;t notice until you\u0026rsquo;re looking at a month of zero results.\nimport datetime year = datetime.date.today().year query = f\u0026#39;intitle:\u0026#34;sources sought\u0026#34; AHRQ {year}\u0026#39; ","permalink":"https://conselara.dev/notes/searxng-engine-selection/","summary":"\u003cp\u003eIf you\u0026rsquo;re running SearXNG as a self-hosted search backend for an automated pipeline, the default engine selection will cause you problems quickly. Here\u0026rsquo;s what we\u0026rsquo;ve found running SearXNG 24/7 for a federal procurement intelligence pipeline.\u003c/p\u003e\n\u003ch2 id=\"what-doesnt-work\"\u003eWhat doesn\u0026rsquo;t work\u003c/h2\u003e\n\u003cp\u003e\u003cstrong\u003eGoogle\u003c/strong\u003e — returns \u003ccode\u003e403 Forbidden\u003c/code\u003e for bot-detected requests. Happens immediately on most self-hosted instances without aggressive Cloudflare bypass configuration. Don\u0026rsquo;t rely on it for automated queries.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eStartpage\u003c/strong\u003e — CAPTCHAs after a few queries. Fine for occasional manual searches, unusable for scheduled pipelines.\u003c/p\u003e","title":"SearXNG: Engine Selection for Reliable Results"},{"content":"The NVIDIA DGX Spark (GB10 SoC, 128GB unified LPDDR5X memory) can run Qwen3.5-122B-A10B — a 122B parameter MoE model — at usable throughput for production workloads. Here\u0026rsquo;s what it actually takes.\nThe key constraint: NVFP4 only Qwen3.5-122B-A10B at full precision is ~250GB. In NVFP4 quantization it\u0026rsquo;s ~75GB, which fits comfortably in 128GB unified memory. There is no other quantization path that both fits and runs correctly on the GB10.\nThe only verified checkpoint we\u0026rsquo;ve found: bjk110/SPARK_Qwen3.5-122B-A10B-NVFP4 on HuggingFace, which includes 15 patches for the SM121 architecture. Use this; don\u0026rsquo;t try to quantize the base model yourself unless you\u0026rsquo;re prepared to debug SM121-specific kernel failures.\nSM121 vs SM120 The DGX Spark GB10 is SM121. Most documentation (NVIDIA\u0026rsquo;s included) targets SM120 (DGX Spark B200). They\u0026rsquo;re close but not identical — several vLLM optimizations that work on SM120 either fail silently or crash on SM121.\nHard rules for SM121:\n# Never add --enforce-eager — it disables CUDA graphs and tanks throughput # Never set --gpu-memory-utilization above 0.90 — OOM above this on SM121 # MoE backend must be marlin — default produces garbage tokens on SM121 # Stay on NVIDIA driver 580.x — 590.x has a regression on this chip --moe-backend marlin \\ --gpu-memory-utilization 0.85 \\ --max-model-len 8192 Docker run command docker run --rm -it \\ --gpus all \\ --shm-size=16g \\ -v /path/to/model:/model \\ -p 8000:8000 \\ bjk110/vllm-spark:latest \\ python -m vllm.entrypoints.openai.api_server \\ --model /model \\ --served-model-name qwen3.5-122b \\ --dtype auto \\ --moe-backend marlin \\ --gpu-memory-utilization 0.85 \\ --max-model-len 8192 \\ --port 8000 Performance On a single DGX Spark (no clustering):\nMetric Result Throughput ~51 tok/s (generation, single stream) TTFT ~2–4s (8K context) Memory at load ~92GB / 128GB Stable uptime Yes — runs indefinitely, no OOM 51 tok/s is fast enough for agentic workloads where the bottleneck is tool calls, not token generation.\nKnown warnings at startup These appear in logs and are safe to ignore:\nUserWarning: flashinfer is not available, falling back to xformers The model config specified num_hidden_layers=94 but the actual number... The num_hidden_layers warning is a display artifact from the MoE architecture — the model loads and runs correctly.\nWhat doesn\u0026rsquo;t work: Qwen3.6 Qwen3.6 was released in April 2026 with two variants — 27B dense and 35B-A3B MoE. There is no Qwen3.6-72B or Qwen3.6-122B. If you\u0026rsquo;re looking for a 70B+ model in the Qwen3 family for a single DGX Spark, Qwen3.5-122B-A10B NVFP4 is currently the only option.\nThe Qwen3.6-27B (dense) introduces a GDN hybrid attention architecture that has a kernel gap on the GB10 SoC — as of May 2026, you\u0026rsquo;ll get ~14-21 tok/s on stock NGC images. Qwen3.6-35B-A3B (pure MoE) runs at 100+ tok/s but tops out at 35B parameters.\n","permalink":"https://conselara.dev/notes/dgx-spark-qwen35-122b/","summary":"\u003cp\u003eThe NVIDIA DGX Spark (GB10 SoC, 128GB unified LPDDR5X memory) can run Qwen3.5-122B-A10B — a 122B parameter MoE model — at usable throughput for production workloads. Here\u0026rsquo;s what it actually takes.\u003c/p\u003e\n\u003ch2 id=\"the-key-constraint-nvfp4-only\"\u003eThe key constraint: NVFP4 only\u003c/h2\u003e\n\u003cp\u003eQwen3.5-122B-A10B at full precision is ~250GB. In NVFP4 quantization it\u0026rsquo;s ~75GB, which fits comfortably in 128GB unified memory. There is no other quantization path that both fits and runs correctly on the GB10.\u003c/p\u003e\n\u003cp\u003eThe only verified checkpoint we\u0026rsquo;ve found: \u003ccode\u003ebjk110/SPARK_Qwen3.5-122B-A10B-NVFP4\u003c/code\u003e on HuggingFace, which includes 15 patches for the SM121 architecture. Use this; don\u0026rsquo;t try to quantize the base model yourself unless you\u0026rsquo;re prepared to debug SM121-specific kernel failures.\u003c/p\u003e","title":"Running Qwen3.5-122B on a Single DGX Spark"},{"content":"If you\u0026rsquo;re building anything that pulls contract award data from USASpending.gov and you need to filter by HHS sub-agency (NIH, FDA, CDC, CMS, AHRQ), you\u0026rsquo;ve probably hit this: the subtier name filter returns nothing.\nWhat doesn\u0026rsquo;t work body = { \u0026#34;filters\u0026#34;: { \u0026#34;agencies\u0026#34;: [ {\u0026#34;type\u0026#34;: \u0026#34;awarding\u0026#34;, \u0026#34;tier\u0026#34;: \u0026#34;subtier\u0026#34;, \u0026#34;name\u0026#34;: \u0026#34;Food and Drug Administration\u0026#34;} ] } } Returns {\u0026quot;results\u0026quot;: [], \u0026quot;page_metadata\u0026quot;: {\u0026quot;total\u0026quot;: 0}} — every time, for every HHS component, regardless of how you spell the name. The API accepts the request without error.\nThis isn\u0026rsquo;t a typo issue. We tried \u0026quot;FDA\u0026quot;, \u0026quot;Food and Drug Administration\u0026quot;, and the exact strings that appear in USASpending\u0026rsquo;s own agency endpoint. Zero results every time.\nWhat works Query at the toptier level (Department of Health and Human Services) and then identify the sub-agency by parsing a 4-digit code out of the generated_internal_id field that comes back in each award record.\nbody = { \u0026#34;filters\u0026#34;: { \u0026#34;agencies\u0026#34;: [ { \u0026#34;type\u0026#34;: \u0026#34;awarding\u0026#34;, \u0026#34;tier\u0026#34;: \u0026#34;toptier\u0026#34;, \u0026#34;name\u0026#34;: \u0026#34;Department of Health and Human Services\u0026#34; } ], \u0026#34;award_type_codes\u0026#34;: [\u0026#34;A\u0026#34;, \u0026#34;B\u0026#34;, \u0026#34;C\u0026#34;, \u0026#34;D\u0026#34;], }, \u0026#34;fields\u0026#34;: [\u0026#34;Award ID\u0026#34;, \u0026#34;Recipient Name\u0026#34;, \u0026#34;Award Amount\u0026#34;, \u0026#34;End Date\u0026#34;, \u0026#34;Description\u0026#34;, \u0026#34;generated_internal_id\u0026#34;], \u0026#34;page\u0026#34;: 1, \u0026#34;limit\u0026#34;: 100, \u0026#34;sort\u0026#34;: \u0026#34;End Date\u0026#34;, \u0026#34;order\u0026#34;: \u0026#34;desc\u0026#34; } Then parse the sub-agency from generated_internal_id:\nSUBTIER_CODE_MAP = { \u0026#34;7523\u0026#34;: \u0026#34;CDC\u0026#34;, \u0026#34;7524\u0026#34;: \u0026#34;FDA\u0026#34;, \u0026#34;7528\u0026#34;: \u0026#34;AHRQ\u0026#34;, \u0026#34;7529\u0026#34;: \u0026#34;NIH\u0026#34;, \u0026#34;7530\u0026#34;: \u0026#34;CMS\u0026#34;, } def agency_from_gid(gid: str) -\u0026gt; str: \u0026#34;\u0026#34;\u0026#34;Parse 4-digit subtier code from generated_internal_id.\u0026#34;\u0026#34;\u0026#34; # Format: CONT_AWD_{award_id_parts}_{4digit_subtier_code}_{...} parts = gid.split(\u0026#34;_\u0026#34;) code = next((p for p in parts if len(p) == 4 and p.isdigit()), None) return SUBTIER_CODE_MAP.get(code, \u0026#34;hhs-other\u0026#34;) for award in results: gid = award.get(\u0026#34;generated_internal_id\u0026#34;, \u0026#34;\u0026#34;) award[\u0026#34;_agency\u0026#34;] = agency_from_gid(gid) Pagination warning USASpending pagination is slow — each page takes 10–30 seconds. With 5 agencies and multiple pages, a full run can take 30+ minutes. For most use cases, page 1 is sufficient: you get the 100 most recent awards sorted by end date, which is what you want for recompete signal tracking.\n# Always set --max-time; skip and continue on timeout curl -sk --max-time 30 -X POST \\ \u0026#34;https://api.usaspending.gov/api/v2/search/spending_by_award/\u0026#34; \\ -H \u0026#34;Content-Type: application/json\u0026#34; \\ -d \u0026#34;$body\u0026#34; Don\u0026rsquo;t loop through hasNext. You\u0026rsquo;ll time out before you finish, and the marginal signal in pages 2+ doesn\u0026rsquo;t justify it.\nONC note ONC (Office of the National Coordinator for Health IT) has no direct contract awards under IT professional services NAICS codes. ONC contracts in this space flow through AHRQ (subtier code 7528). If you\u0026rsquo;re tracking ONC recompetes, look for AHRQ awards with ONC-relevant descriptions.\n","permalink":"https://conselara.dev/notes/usaspending-hhs-subtier-queries/","summary":"\u003cp\u003eIf you\u0026rsquo;re building anything that pulls contract award data from USASpending.gov and you need to filter by HHS sub-agency (NIH, FDA, CDC, CMS, AHRQ), you\u0026rsquo;ve probably hit this: the subtier name filter returns nothing.\u003c/p\u003e\n\u003ch2 id=\"what-doesnt-work\"\u003eWhat doesn\u0026rsquo;t work\u003c/h2\u003e\n\u003cdiv class=\"highlight\"\u003e\u003cpre tabindex=\"0\" style=\"color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;\"\u003e\u003ccode class=\"language-python\" data-lang=\"python\"\u003e\u003cspan style=\"display:flex;\"\u003e\u003cspan\u003ebody \u003cspan style=\"color:#f92672\"\u003e=\u003c/span\u003e {\n\u003c/span\u003e\u003c/span\u003e\u003cspan style=\"display:flex;\"\u003e\u003cspan\u003e    \u003cspan style=\"color:#e6db74\"\u003e\u0026#34;filters\u0026#34;\u003c/span\u003e: {\n\u003c/span\u003e\u003c/span\u003e\u003cspan style=\"display:flex;\"\u003e\u003cspan\u003e        \u003cspan style=\"color:#e6db74\"\u003e\u0026#34;agencies\u0026#34;\u003c/span\u003e: [\n\u003c/span\u003e\u003c/span\u003e\u003cspan style=\"display:flex;\"\u003e\u003cspan\u003e            {\u003cspan style=\"color:#e6db74\"\u003e\u0026#34;type\u0026#34;\u003c/span\u003e: \u003cspan style=\"color:#e6db74\"\u003e\u0026#34;awarding\u0026#34;\u003c/span\u003e, \u003cspan style=\"color:#e6db74\"\u003e\u0026#34;tier\u0026#34;\u003c/span\u003e: \u003cspan style=\"color:#e6db74\"\u003e\u0026#34;subtier\u0026#34;\u003c/span\u003e, \u003cspan style=\"color:#e6db74\"\u003e\u0026#34;name\u0026#34;\u003c/span\u003e: \u003cspan style=\"color:#e6db74\"\u003e\u0026#34;Food and Drug Administration\u0026#34;\u003c/span\u003e}\n\u003c/span\u003e\u003c/span\u003e\u003cspan style=\"display:flex;\"\u003e\u003cspan\u003e        ]\n\u003c/span\u003e\u003c/span\u003e\u003cspan style=\"display:flex;\"\u003e\u003cspan\u003e    }\n\u003c/span\u003e\u003c/span\u003e\u003cspan style=\"display:flex;\"\u003e\u003cspan\u003e}\n\u003c/span\u003e\u003c/span\u003e\u003c/code\u003e\u003c/pre\u003e\u003c/div\u003e\u003cp\u003eReturns \u003ccode\u003e{\u0026quot;results\u0026quot;: [], \u0026quot;page_metadata\u0026quot;: {\u0026quot;total\u0026quot;: 0}}\u003c/code\u003e — every time, for every HHS component, regardless of how you spell the name. The API accepts the request without error.\u003c/p\u003e","title":"USASpending API: How to Actually Query HHS Sub-Agencies"},{"content":"The SAM.gov Opportunities API v2 accepts a naics query parameter. The documentation implies it will filter results to matching NAICS codes. It doesn\u0026rsquo;t.\nWhat actually happens GET https://api.sam.gov/prod/opportunities/v2/search?api_key=...\u0026amp;naics=541511\u0026amp;limit=100 Returns opportunities with NAICS codes 541330, 561210, 711510, and anything else currently open — not just 541511. The filter is parsed, no error is returned, but the result set is unfiltered.\nThis appears to be a persistent bug in the v2 API. It isn\u0026rsquo;t mentioned in the documentation.\nFix: filter client-side Fetch without the naics parameter and apply your own filter on the naicsCode field in the response:\nimport urllib.request, json SAM_KEY = \u0026#34;your_api_key\u0026#34; TARGET_NAICS = {\u0026#39;541511\u0026#39;, \u0026#39;541512\u0026#39;, \u0026#39;541513\u0026#39;, \u0026#39;541519\u0026#39;, \u0026#39;518210\u0026#39;} url = f\u0026#34;https://api.sam.gov/prod/opportunities/v2/search?api_key={SAM_KEY}\u0026amp;limit=100\u0026amp;postedFrom=01/01/2026\u0026#34; req = urllib.request.Request(url, headers={\u0026#34;Accept\u0026#34;: \u0026#34;application/json\u0026#34;}) data = json.loads(urllib.request.urlopen(req, timeout=30).read()) all_opps = data.get(\u0026#34;opportunitiesData\u0026#34;, []) filtered = [o for o in all_opps if str(o.get(\u0026#34;naicsCode\u0026#34;, \u0026#34;\u0026#34;)) in TARGET_NAICS] print(f\u0026#34;Total returned: {len(all_opps)}, matching NAICS: {len(filtered)}\u0026#34;) The naicsCode field in the response payload is accurate — the problem is only with the server-side query filter.\nAgency and deadline fields are also unreliable While we\u0026rsquo;re here: fullParentPathName (the agency name) is frequently empty even when the opportunity clearly belongs to a specific agency. responseDeadLine is missing on pre-solicitations.\nIf you need accurate agency/deadline data, enrich after the fact by querying for the specific notice ID:\ndef enrich_opp(opp_id, api_key): url = f\u0026#34;https://api.sam.gov/prod/opportunities/v2/search?api_key={api_key}\u0026amp;noticeid={opp_id}\u0026amp;limit=1\u0026#34; req = urllib.request.Request(url, headers={\u0026#34;Accept\u0026#34;: \u0026#34;application/json\u0026#34;}) data = json.loads(urllib.request.urlopen(req, timeout=10).read()) opps = data.get(\u0026#34;opportunitiesData\u0026#34;, []) if not opps: return {} o = opps[0] return { \u0026#34;agency\u0026#34;: o.get(\u0026#34;fullParentPathName\u0026#34;) or o.get(\u0026#34;departmentName\u0026#34;, \u0026#34;\u0026#34;), \u0026#34;deadline\u0026#34;: o.get(\u0026#34;responseDeadLine\u0026#34;) or o.get(\u0026#34;archiveDate\u0026#34;, \u0026#34;\u0026#34;), \u0026#34;naics\u0026#34;: str(o.get(\u0026#34;naicsCode\u0026#34;, \u0026#34;\u0026#34;)), } Stay under 10 requests/second. A time.sleep(0.15) between calls keeps you safely within the rate limit.\n","permalink":"https://conselara.dev/notes/samgov-naics-filter-broken/","summary":"\u003cp\u003eThe SAM.gov Opportunities API v2 accepts a \u003ccode\u003enaics\u003c/code\u003e query parameter. The documentation implies it will filter results to matching NAICS codes. It doesn\u0026rsquo;t.\u003c/p\u003e\n\u003ch2 id=\"what-actually-happens\"\u003eWhat actually happens\u003c/h2\u003e\n\u003cpre tabindex=\"0\"\u003e\u003ccode\u003eGET https://api.sam.gov/prod/opportunities/v2/search?api_key=...\u0026amp;naics=541511\u0026amp;limit=100\n\u003c/code\u003e\u003c/pre\u003e\u003cp\u003eReturns opportunities with NAICS codes \u003ccode\u003e541330\u003c/code\u003e, \u003ccode\u003e561210\u003c/code\u003e, \u003ccode\u003e711510\u003c/code\u003e, and anything else currently open — not just \u003ccode\u003e541511\u003c/code\u003e. The filter is parsed, no error is returned, but the result set is unfiltered.\u003c/p\u003e\n\u003cp\u003eThis appears to be a persistent bug in the v2 API. It isn\u0026rsquo;t mentioned in the documentation.\u003c/p\u003e","title":"SAM.gov Opportunities API: The NAICS Filter Does Nothing"},{"content":"What is this Conselara Labs is the R\u0026amp;D arm of Conselara, LLC — a federal IT consulting firm focused on health IT modernization and AI adoption for HHS agencies. It\u0026rsquo;s where we work out hard problems before they become products or contract deliverables.\nFederal AI has a real gap between what demos can do and what works in production. Systems have to operate at scale, under compliance requirements, without relying on third-party infrastructure that can\u0026rsquo;t be vetted or controlled. A lot of AI tooling that performs well in commercial settings doesn\u0026rsquo;t translate to federal environments. The lab exists to close that gap — to understand what actually works, on real hardware, under realistic constraints, before it touches a federal program.\nWhat we\u0026rsquo;re working on On-prem AI inference — running large open-weight models on NVIDIA DGX hardware via vLLM. Models in active use include Qwen3-235B-A22B, gpt-oss-120b, and Qwen3.5-122B. We benchmark throughput, latency, and quantization tradeoffs against real production constraints — air-gap capable, data-sovereign, no third-party inference APIs.\nAutonomous agents — building agent systems that ingest federal data sources, run continuous analysis pipelines, and deliver structured intelligence outputs entirely on-prem. Tooling covers web search, knowledge base retrieval, document ingestion, and structured report generation.\nAI integration into federal platforms — integrating LLMs into production health research platforms: publication discovery, AI-assisted development, and evaluation of AI-powered CMS modules. Includes head-to-head model evaluation under federal compliance constraints.\nAWS cloud — designing and operating federally-compliant AWS environments: FedRAMP-aligned architectures, ATO support, serverless and managed services (Lambda, S3, CloudFront, RDS), AI/ML services (Bedrock, SageMaker), AIOps (DevOps Guru, Amazon Q), and legacy migration to AWS.\nWhat we publish The notes on this site are working findings — real API bugs, model performance numbers, infrastructure tradeoffs. If we hit a wall with an undocumented API behavior or spent three days on a configuration that should have taken twenty minutes, we write it up so the next person doesn\u0026rsquo;t have to. The posts aren\u0026rsquo;t marketing. They\u0026rsquo;re documentation of what we ran into.\nPrototypes built in the lab move into production on existing federal health IT contracts. We\u0026rsquo;re also producing research and white papers drawing on lab findings and contract experience, with subject matter experts from the federal health IT space contributing to that work.\nThe business Conselara, LLC holds a GSA Multiple Award Schedule (MAS) contract (47QTCA22D0051) covering IT professional services and health IT consulting. We work with federal agencies on health data modernization, system integration, and emerging technology adoption.\nQuestions or corrections: info@conselara.com\n","permalink":"https://conselara.dev/about/","summary":"About Conselara Labs","title":"About"}]