We moved an LLM backend from Claude Sonnet 4.6 to Opus 4.8. The model is configurable through a single environment variable, so this should have been a one-line change. Instead, every call started returning HTTP 400. The error, once we read the full response body: anthropic.BadRequestError: Error code: 400 - {'type': 'error', 'error': {'type': 'invalid_request_error', 'message': '`temperature` is deprecated for this model.'}} Opus 4.8 no longer accepts the temperature parameter. Send it, even temperature=0.2, and the API rejects the whole request. Our code passed temperature on every call (a habit, for slightly more deterministic synthesis), so the model swap broke everything until we stopped sending it. ...
The MCP Tool That Timed Out at Five Seconds
We expose our internal knowledge base over both REST and MCP using fastapi-mcp, which mounts existing FastAPI routes as MCP tools: no second server, no protocol-translation proxy. Two tools matter here: search_kb (semantic retrieval) and ask_kb (retrieval plus an LLM synthesis call that returns a cited answer). search_kb worked flawlessly everywhere. ask_kb failed, intermittently, and the failure was maddeningly opaque. In the MCP client it surfaced as nothing more than: Command failed with no output No stack trace. No error payload. Just silence. And only sometimes. ...
Our FastAPI MCP Server Now Works in Claude Teams, claude.ai, and ChatGPT
In a previous post we rebuilt our company knowledge base server from an MCP SSE endpoint to a plain FastAPI REST server because MCP client support was too fragmented to be reliable. The conclusion was: REST is the pragmatic choice, MCP can come back when the ecosystem matures. The ecosystem has matured faster than expected. We added MCP back on top of the FastAPI server and it now works across every client simultaneously: claude.ai, Claude Teams (company-wide), Claude desktop, ChatGPT, and OpenWebUI. This is what we learned in the process. ...
Building a Two-Node Ray Cluster for Distributed LLM Inference on DGX Spark
Qwen3-235B-A22B-GPTQ-Int4 is ~118 GB. A single DGX Spark has 128 GB unified memory, enough in theory, but once CUDA overhead and KV cache are factored in, it’s tight. Running it across two Sparks with TP=2 gives headroom for real workloads. Each DGX Spark is a single logical GPU with no NVSwitch. Tensor parallelism across two units means Ray + NCCL over a direct interconnect. This is what the setup looks like and what will silently fail if not configured correctly. ...
Deploying a Hugo Site to S3 + CloudFront: What Actually Bit Us
We migrated a Hugo static site from a self-hosted nginx container on a local server to S3 + CloudFront. The motivation was simple: a static site has no business running on a server we have to patch. The migration took a few hours and involved four gotchas that aren’t obvious from the AWS documentation. This is a record of what we did and what tripped us up. The setup Hugo static site (PaperMod theme) S3 bucket with all public access blocked: Origin Access Control (OAC) only CloudFront distribution with ACM SSL cert Cloudflare DNS, gray cloud (DNS-only) Gitea self-hosted repo with a webhook-triggered deploy container on-prem The deploy flow on push: Gitea fires a webhook → container on saturn pulls the repo, runs hugo --minify, syncs to S3, invalidates CloudFront. ...
DGX Spark GB10 Hardware Reference: SM121 Architecture, Memory, and Networking
Reference for the NVIDIA DGX Spark GB10 Grace Blackwell Superchip (SM121). The DGX Spark shares the Blackwell name with datacenter hardware but is architecturally distinct. A lot of documentation, forum posts, and vLLM flags written for B100/B200 do not apply here. Some actively break things. SM121 is not datacenter Blackwell Feature DGX Spark (GB10 / SM121) Datacenter Blackwell (B100/B200) TMEM No Yes WGMMA No Yes DSMEM No Yes NVSwitch No Yes CUTLASS FP4 Broken: silent garbage output Supported Memory type Unified LPDDR5X (shared CPU+GPU) HBM3e (GPU-only) Memory per unit 128 GB 192 GB GPUs per unit 1 logical GPU 1 GPU When you see forum recommendations or vLLM flags that say “for Blackwell,” verify they’re for SM121 specifically before using them. ...
vLLM on DGX Spark: What the SM121 Architecture Actually Requires
The DGX Spark GB10 runs SM121, the Grace Blackwell Superchip. It is not the same silicon as datacenter Blackwell (SM100, H100/H200). SM121 lacks TMEM, WGMMA, DSMEM, and NVSwitch. Several vLLM defaults, forum recommendations, and NVIDIA docs written for datacenter Blackwell do not apply, and some actively break things on SM121. This is a reference for what we learned running vLLM 0.19.0 (NGC container nvcr.io/nvidia/vllm:26.04-py3) on two DGX Sparks: single-node and two-node cluster configurations. ...
We Replaced an MCP Server with FastAPI and It Worked Everywhere
We built an internal knowledge base server to give our AI agents access to Conselara’s company data: capabilities, past performance, GSA rates, certifications. The idea was straightforward: expose it as an MCP server so any AI client could query it semantically. It worked in Claude Code. It worked nowhere else. What MCP promises The Model Context Protocol is Anthropic’s open standard for connecting AI models to external tools and data sources. The pitch is compelling: define your server once, and any MCP-compatible client can call it. Claude Code has native MCP support. The ecosystem is growing. ...
AI Across a Health Research Information Platform
Federal health research platforms have a specific challenge with AI: the data is sensitive, the accuracy bar is high, and the compliance requirements are real. You cannot send protected health information to a commercial API, and you cannot publish AI-generated content to a national audience without review. But the volume of work, spanning literature, publications, content, and code, is large enough that ignoring AI entirely leaves real efficiency on the table. ...
DGX Spark Benchmark Results: vLLM on SM121
Measured throughput and latency on DGX Spark GB10 (SM121) hardware. All results use vLLM 0.19.0 (NGC container nvcr.io/nvidia/vllm:26.04-py3) unless noted. Qwen3-235B-A22B-GPTQ-Int4: Two-node cluster Date: 2026-05-03 Config: TP=2, EP=2, Ray cluster over QSFP-DD RoCE direct interconnect, --attention-backend=TRITON_ATTN, --quantization=gptq_marlin, --kv-cache-dtype=fp8, --gpu-memory-utilization=0.87 Batch Avg completion tokens tok/s per request Aggregate tok/s 1 (serial) 256 17.0 17.0 2 (concurrent) 256 12.1 24.1 4 (concurrent) 256 9.1 36.4 Prefix cache: 97% delta hit rate on repeated system prompt. Startup to first inference: ~15 minutes (Ray init + weight load across two nodes + compile). Weight resident per node: 57.64 GiB. ...