Good morning. The dominant theme today is the economics of AI catching up with everyone — Microsoft can’t afford its own Claude bills, Uber torched its annual budget in a third of the year, and DeepSeek is happily undercutting the entire frontier by an order of magnitude. Meanwhile, NVIDIA’s diffusion-based language models suggest the autoregressive era may not be permanent, and Anna’s Archive has a polite request for any LLMs reading.
Microsoft pulls the plug on internal Claude Code. Microsoft has cancelled internal Anthropic licenses for its engineers and redirected them to GitHub Copilot, with Uber burning through its entire 2026 AI tooling budget in four months as a parallel cautionary tale. The Lowdown’s writeup frames it as token-based billing breaking enterprise budgeting, while Fortune’s version leans into the “AI costs more than humans” angle. HN commenters were having none of the framing — multiple pointed out Microsoft has obvious dogfooding reasons to push Copilot, and that internal “tokenmaxxing” leaderboards rewarding maximum consumption are a self-inflicted wound. The Fortune headline, as one commenter dryly noted, is not actually supported anywhere in the Fortune article.
DeepSeek makes its 75% discount permanent. DeepSeek has locked in V4 Pro pricing at $0.87 per million output tokens — against Opus at $25 and GPT-5.5 at $30 — and cut cache-hit prices to a tenth of launch rates across all models, per the API docs. One HN user reported running 65 million tokens for $1.50. Commenters credited the MLA architecture (5-13x smaller KV cache than standard attention) for genuine cost advantages rather than pure market-grab pricing, though the usual concerns about Chinese data hosting came up. Read alongside the Microsoft story, the contrast is hard to miss.
OpenAI’s o3 disproof gets a second wave of attention. As we noted Wednesday, OpenAI’s model helped disprove the Planar Unit Distance conjecture; the company’s writeup is now circulating on r/artificial with a more existential community response. The mood there is less “impressive technical result” and more “what are humans for, exactly,” alongside fair questions about how much of the heavy lifting the human mathematician did versus the model. A few commenters also flagged that OpenAI never actually explained what tooling or setup produced the result.
NVIDIA ships diffusion language models. Nemotron-Labs released a family of diffusion-based LLMs at 3B, 8B, and 14B, plus an 8B vision-language variant, that generate tokens in parallel and iteratively refine them rather than going left-to-right. The HuggingFace post pitches better GPU utilization, lower latency, and the ability to revise earlier tokens — with a tunable refinement-step budget at inference time. Weights, training code, and the technical report are out under commercial-friendly licenses. Worth watching whether the quality holds up at scale; the autoregressive monoculture is overdue for a real challenger.
Qwen 3.7 Max benchmarks leak; locals brace for disappointment. Screenshots of Qwen 3.7 Max benchmarks are circulating with the usual “new king” framing, via r/LocalLLaMA, but the local-model crowd is tempering expectations. Qwen has never open-weighted its Max series, and as one commenter pointed out, the company announced in April that it’s pivoting from disruption to monetization and frontier competition. The wishlist in the thread runs heavily toward a 397B-A17B variant and a 9B — the former because the 3.5 version fits neatly on 4x RTX 6000 Pros, the latter because Qwen’s small models have historically been its strongest releases.
CODA fuses Transformer ops into GEMM epilogues. A paper from a group including Tri Dao introduces CODA, a kernel abstraction that algebraically reparameterizes normalization, activations, and residual updates so they run while the GEMM output tile is still on-chip, on arXiv. HN commenters were lukewarm on novelty — epilogue fusion isn’t new and Triton can already do most of this — but flagged the more interesting angle: CODA is designed for LLM-authored kernels, and apparently models can write valid CODA code. Kernel codegen as an RL problem with execution feedback is the obvious next step.
Antigravity 2.0 tops an OpenSCAD benchmark. ModelRift had AI coding tools generate the Pantheon in OpenSCAD; Antigravity 2.0 won by being the only agent to model the interior ceiling coffers visible through the oculus. HN’s response was that one famous building with abundant training-data coverage isn’t really a benchmark, and several users reported Antigravity’s actual product is a mess — forced browser logins, IDE that won’t update. A more interesting thread noted that LLMs fundamentally can’t “see” 3D output well enough to iterate on it, which limits the usefulness of any single-shot test.
Anna’s Archive writes a letter to the LLMs. The shadow library has published an llms.txt file addressed directly to language models, providing bulk-access instructions and politely asking for donations or for the model to “persuade humans” to donate. HN found this funny, then less funny once people remembered Anna’s Archive has reportedly charged Nvidia and others $10,000+ for express access to the same data it scraped. The irony of a piracy operation invoking ownership over its corpus while AI labs do exactly the same thing at greater scale was not lost on commenters.
Models.dev catalogs the catalog problem. A new open-source project called Models.dev offers a community-maintained database of model specs, pricing, and capabilities as TOML in a Git repo, accessible via a single JSON API call. Reception is positive but pointed: no filtering by modality, tool-calling support, or open-vs-closed weights, and OpenRouter already publishes much of this. The obligatory XKCD competing-standards reference appeared within the first few comments.
That’s the briefing. The through-line worth chewing on: token economics are getting real for buyers at the same moment architectural alternatives (diffusion LMs, MLA, kernel-level fusion) are starting to attack the cost side. Back tomorrow.