AI News — July 05, 2026: GPT-5.5 Cuts Reasoning at 516 Tokens, Opus 4.8 Breaks Tool Calls

Good morning. Today’s theme is quietly falling apart at the seams: GPT-5.5 Codex may be short-circuiting its own reasoning, newer Claude models are getting worse at tool calls, and someone thinks their enterprise Claude session bled into a stranger’s Minecraft project. Also: Alibaba is banning Claude Code, Midjourney wants to see Hollywood’s homework, and a Lovable engineer burned $85K in tokens to tell you human code review is dead.

GPT-5.5 Codex is stopping mid-thought. An analysis of 390,000+ Codex responses posted to GitHub shows GPT-5.5 terminating reasoning at exactly 516 tokens (or multiples: 1034, 1552) in 44% of responses, versus 1.3% for other models. The pattern emerged sharply in May 2026 and correlates with wrong answers on harder tasks, with users on the HN thread independently reproducing the short-circuit and reporting quality drops. The leading theory is a batching optimization at 512-token boundaries, and as one commenter put it: “I swear some days ago someone here claimed OpenAI succeeded cutting down their compute cost by half with a breakthrough optimization. So this is it?”

Better models, worse tool calls. Armin Ronacher documented that Anthropic’s newer Opus 4.8 and Sonnet 5 invent extra invalid fields when calling tools in his Pi editor, where older models complied cleanly. His theory: these models were trained in more forgiving harnesses, so the runtime has quietly become part of the model’s interface. HN commenters were pragmatic — Aider’s had forgiving-formatter code for years, and good error messages usually get models to self-correct in a second or two — though one wondered why anyone still lets models emit tool calls as raw text when grammar-constrained decoding exists.

“Confident this is a hallucination.” A Claude Code user with an Enterprise Zero Data Retention workspace filed a bug claiming their session suddenly started asking about building a Minecraft temple — a task they had no connection to. Anthropic’s Thariq responded that the team is confident it’s a hallucination but is investigating. The HN discussion is split, with several users reporting similar cross-session weirdness on Gemini and other providers, and one throwaway account claiming knowledge of two prior incidents where intermediate infrastructure “swapped” responses between users on Claude and GPT.

Alibaba bans Claude Code. Alibaba will prohibit employees from using Claude Code starting July 10, classifying it as high-risk and steering staff toward its in-house Qoder tool. The move follows Anthropic’s own policy blocking Chinese companies from its models, plus the recent flap over a Claude Code feature that flagged Chinese users — which Anthropic described as an anti-abuse experiment against unauthorized resellers and distillation attempts.

Midjourney wants to see the studios’ AI stack. In its ongoing copyright fight with Disney, Universal, and Warner Bros., Midjourney is asking the court to compel discovery on the studios’ own internal AI usage — storyboarding tools, model training, anything not consumer-facing. The argument: if studios are quietly training on unlicensed copyrighted content themselves, their market-harm claims weaken and Midjourney’s fair use defense strengthens. The studios’ attorney called it a fishing expedition.

$85,000 in tokens to say “let the AI review it.” A Lovable engineer wrote up scaling from $600/month to $85K in personal token spend, now running 6–7 agents that ship 150+ merged PRs a week, with AI reviews replacing human line-by-line review. The HN thread was not kind: commenters called it marketing dressed as a case study, questioned raw PR count as a metric, and pushed back hard on the “solve AI-generated problems with more AI” logic for anything business-critical. One reply: “Instead of employing an engineer for a year we burned an obscene amount of resources to generate code which will enable vibe coders to burn more resources.”

Fanfic detectives, and an append-only agent. AO3 writers are waging a detection war on suspected Claude-generated fanfic, with an anonymous X account distributing a browser skin that claims to spot coding artifacts — while collaterally flagging human writers for the crime of using em dashes. On the more constructive end, BabyAGI creator Yohei Nakajima published ActiveGraph, an agent framework where an append-only event log is the source of truth and the working graph is a deterministic projection, enabling replay, forking, and full lineage — properties standard memory/retrieval setups can’t offer.

That’s a lot of frontier models behaving oddly in one morning. If your Codex output feels dumber this week, you’re probably not imagining it.