Conselara Labs

DeepSeek-V4-Flash on 2× DGX Spark with the stock NGC container: how far we got

Everyone running DeepSeek-V4-Flash on DGX Spark hardware today is doing it with community forks: patched vLLM builds, custom images, source-built kernels. The received wisdom is that the stock NVIDIA NGC container cannot serve this model on GB10 (SM121) because upstream vLLM rejects every sparse-MLA attention backend on that architecture (vllm-project/vllm#45317, still open). We ran the experiment with only nvcr.io/nvidia/vllm:26.06-py3 on a two-node DGX Spark cluster (TP=2 over the 200G QSFP link). Policy constraint: no forks, no nightlies, no patched kernels. Here is what actually happens. ...

Open vs. Closed: Choosing AI Models for the Data Boundary

We run frontier open-weight models on a desktop-class supercomputer. People assume that means we think local models have caught up to Claude, GPT, and Gemini. They haven’t, and pretending otherwise is the fastest way to make a bad architecture decision. This is how we actually compare models, open and closed, local and cloud, and how that comparison plays out for sensitive domains like federal health. The honest starting point: closed frontier still leads As of June 2026, the most capable general models are closed and cloud-hosted. On the Artificial Analysis Intelligence Index, the leaders are Claude Opus 4.8 (~61), GPT-5.5 (~60), and Gemini 3.1 Pro (~57). No open-weight model you can self-host reaches that tier. ...

Running gpt-oss-120b on a Single DGX Spark

gpt-oss-120b, a 117B-parameter / 5.1B-active MXFP4 mixture-of-experts model, runs comfortably on a single NVIDIA DGX Spark (GB10, 128 GB unified memory). At ~63 GB of weights it leaves room for a 131K context window, and its reasoning quality makes it a useful deep-reasoning node alongside a faster mid-size daily driver. The catch is that the SM121 configuration is unforgiving. Several defaults and forum recommendations either silently corrupt output or refuse to start. This is the setup that actually works on the stock NGC container, and the traps that cost us the most time. ...

Migrating to Claude Opus 4.8? Drop the temperature Parameter

We moved an LLM backend from Claude Sonnet 4.6 to Opus 4.8. The model is configurable through a single environment variable, so this should have been a one-line change. Instead, every call started returning HTTP 400. The error, once we read the full response body: anthropic.BadRequestError: Error code: 400 - {'type': 'error', 'error': {'type': 'invalid_request_error', 'message': '`temperature` is deprecated for this model.'}} Opus 4.8 no longer accepts the temperature parameter. Send it, even temperature=0.2, and the API rejects the whole request. Our code passed temperature on every call (a habit, for slightly more deterministic synthesis), so the model swap broke everything until we stopped sending it. ...

The MCP Tool That Timed Out at Five Seconds

We expose our internal knowledge base over both REST and MCP using fastapi-mcp, which mounts existing FastAPI routes as MCP tools: no second server, no protocol-translation proxy. Two tools matter here: search_kb (semantic retrieval) and ask_kb (retrieval plus an LLM synthesis call that returns a cited answer). search_kb worked flawlessly everywhere. ask_kb failed, intermittently, and the failure was maddeningly opaque. In the MCP client it surfaced as nothing more than: Command failed with no output No stack trace. No error payload. Just silence. And only sometimes. ...

Our FastAPI MCP Server Now Works in Claude Teams, claude.ai, and ChatGPT

In a previous post we rebuilt our company knowledge base server from an MCP SSE endpoint to a plain FastAPI REST server because MCP client support was too fragmented to be reliable. The conclusion was: REST is the pragmatic choice, MCP can come back when the ecosystem matures. The ecosystem has matured faster than expected. We added MCP back on top of the FastAPI server and it now works across every client simultaneously: claude.ai, Claude Teams (company-wide), Claude desktop, ChatGPT, and OpenWebUI. This is what we learned in the process. ...

Building a Two-Node Ray Cluster for Distributed LLM Inference on DGX Spark

Qwen3-235B-A22B-GPTQ-Int4 is ~118 GB. A single DGX Spark has 128 GB unified memory, enough in theory, but once CUDA overhead and KV cache are factored in, it’s tight. Running it across two Sparks with TP=2 gives headroom for real workloads. Each DGX Spark is a single logical GPU with no NVSwitch. Tensor parallelism across two units means Ray + NCCL over a direct interconnect. This is what the setup looks like and what will silently fail if not configured correctly. ...

Deploying a Hugo Site to S3 + CloudFront: What Actually Bit Us

We migrated a Hugo static site from a self-hosted nginx container on a local server to S3 + CloudFront. The motivation was simple: a static site has no business running on a server we have to patch. The migration took a few hours and involved four gotchas that aren’t obvious from the AWS documentation. This is a record of what we did and what tripped us up. The setup Hugo static site (PaperMod theme) S3 bucket with all public access blocked: Origin Access Control (OAC) only CloudFront distribution with ACM SSL cert Cloudflare DNS, gray cloud (DNS-only) Gitea self-hosted repo with a webhook-triggered deploy container on-prem The deploy flow on push: Gitea fires a webhook → container on saturn pulls the repo, runs hugo --minify, syncs to S3, invalidates CloudFront. ...

DGX Spark GB10 Hardware Reference: SM121 Architecture, Memory, and Networking

Reference for the NVIDIA DGX Spark GB10 Grace Blackwell Superchip (SM121). The DGX Spark shares the Blackwell name with datacenter hardware but is architecturally distinct. A lot of documentation, forum posts, and vLLM flags written for B100/B200 do not apply here. Some actively break things. SM121 is not datacenter Blackwell Feature DGX Spark (GB10 / SM121) Datacenter Blackwell (B100/B200) TMEM No Yes WGMMA No Yes DSMEM No Yes NVSwitch No Yes CUTLASS FP4 Broken: silent garbage output Supported Memory type Unified LPDDR5X (shared CPU+GPU) HBM3e (GPU-only) Memory per unit 128 GB 192 GB GPUs per unit 1 logical GPU 1 GPU When you see forum recommendations or vLLM flags that say “for Blackwell,” verify they’re for SM121 specifically before using them. ...

vLLM on DGX Spark: What the SM121 Architecture Actually Requires

The DGX Spark GB10 runs SM121, the Grace Blackwell Superchip. It is not the same silicon as datacenter Blackwell (SM100, H100/H200). SM121 lacks TMEM, WGMMA, DSMEM, and NVSwitch. Several vLLM defaults, forum recommendations, and NVIDIA docs written for datacenter Blackwell do not apply, and some actively break things on SM121. This is a reference for what we learned running vLLM 0.19.0 (NGC container nvcr.io/nvidia/vllm:26.04-py3) on two DGX Sparks: single-node and two-node cluster configurations. ...