AI News — June 08, 2026: DeepSeek Finishes $1 Benchmark GPT-5.5 Hit at $22, Copilot Goes Per-Token

Good morning. The economics of inference are quietly becoming today’s main storyline — DeepSeek’s cost gap is back in the discourse, Microsoft is moving Copilot to per-token billing, and a fresh arXiv paper says agentic coding burns most of its tokens on review loops. Meanwhile, designers and engineers keep wrestling with what to do as Claude starts eating their workflows.

DeepSeek V4 Pro “beats” GPT-5.5 Pro, sort of. A post claiming DeepSeek V4 Pro outperforms GPT-5.5 Pro on precision got picked apart on Hacker News as thin, possibly auto-generated clickbait — four single-run tests with no temperature accounting or reproducible methodology. The genuinely interesting bit came from the comments: one engineer running a vulnerability scanning benchmark reported GPT-5.5 Pro burning roughly $22 per case before hitting a $100 budget cap, while DeepSeek V4 Pro finished the whole benchmark for about a dollar. Others said they’d switched to DeepSeek through Opencode and can’t really tell the difference for daily work, reserving GPT or Claude for the trickier 10%.

The Tokenpocalypse arrives. Right on cue, TechCrunch’s Equity podcast is calling Microsoft’s shift to per-token pricing for GitHub Copilot the start of a broader subsidy rollback. The discussion connects to the Uber budget blowups and Anthropic IPO questions we’ve been tracking all week — the open question is whether labs can drive costs down fast enough to meet what customers will actually pay, and how IPO-bound companies are supposed to disclose risk around a cost structure that changes month to month.

Where the tokens actually go. A new arXiv paper on tokenomics measured token consumption across 30 ChatDev tasks with GPT-5 and found that the Code Review stage alone consumed 59.4% of tokens, with input tokens making up 53.9% of the total. HN commenters said the numbers match their experience — one reported a roughly 10:1 input-to-output ratio in real agent use, with the agent reading a million tokens to patch a single line. Another commenter’s GitHub Copilot allotment that lasted a month under flat pricing now runs out in two days under the new scheme.

A fintech engineer watches their moat evaporate. A 10-year payments and PCI specialist wrote a widely-shared post about how LLMs are eroding their software engineering career, describing how the domain expertise that once justified their title now feels redundant as models adequately connect the dots on complex financial systems. The HN thread split between engineers who say LLMs still fail on their specific local regulations and business logic, and others warning that today’s failure modes are a poor guide to the trajectory. One DevOps commenter described their firm cycling through “lots of LLM, even more, multi-agent, back to single agent” before landing on a measured workflow.

A Jane Street designer drops Figma for Claude. In a related vein, a designer at Jane Street wrote about building working prototypes in code with Claude instead of mocking them in Figma, treating prototypes as living proposal docs that engineers later reimplement. The author admits a worry that Claude pushes them toward iterative rather than exploratory design. Commenters noted Jane Street is an Anthropic investor, and several engineers grumbled that interpreting design intent from generated code adds cognitive overhead they didn’t have with written specs.

OpenAI’s super app, still cooking. TechCrunch reports OpenAI is preparing a revamped ChatGPT positioned as a “super app” bundling coding tools and agents, with one senior employee bluntly declaring “Chat is dead.” The plan reportedly involves funneling free users toward paid products like Codex and sidelining standalone bets like Sora — a consolidation play heading into the IPO window.

Speculative KV coding draws skepticism. A blog post introduced speculative KV coding, a lossless KV cache compression technique that uses a smaller predictor model and arithmetic coding on residuals to get up to ~4× compression, or ~8× combined with fp8. HN’s reaction was mostly “why bother” — KV cache can be offloaded to RAM or disk for milliseconds of latency, and recomputing even a draft cache stays quadratic in context length. One commenter pointed out the logical endpoint of the approach: use the original model as the predictor and get infinite compression, at infinite cost.

Two more worth a look. An interactive piece on how LLMs do arithmetic without numbers uses probes on frozen activations to show how GCDs and similar operations get encoded in vector space, though the author is careful to note that readable doesn’t mean causal. And Lathe, a Show HN, uses LLMs to generate hands-on tutorials you then work through manually in a local UI — framing the model as a teacher rather than a shortcut, which the HN crowd appreciated. Finally, The Verge has a piece on how AI-generated influencers are getting hard to spot, with personas like Aitana Lopez blending into normal lifestyle feeds while platforms fumble for a labeling policy.

That’s it for today. The pricing story is the one to watch — once Copilot’s per-token model has been running a few weeks, we’ll see whether developers stick around or whether DeepSeek’s dollar-per-benchmark math starts pulling real volume.