AI News — June 06, 2026: Gemma 4's 12B Fits 8GB VRAM, Uber Burned Its 2026 Coding Budget by April

Good morning. Google keeps the Gemma drumbeat going with quantization-aware checkpoints that put the 12B in reach of an 8GB GPU, while two ICLR/ICML papers offer rare bits of genuine theory amid the product churn. Meanwhile, the bill for all this is coming due — both literally, in the form of runaway token costs, and structurally, as New York moves to freeze data center construction.

Gemma 4 gets QAT checkpoints. Google released Quantization-Aware Training versions of the Gemma 4 family, with the E2B model squeezing down to 1GB on mobile and the 12B now fitting in 8GB of VRAM. Because the quantization is baked into training rather than applied after, quality holds up better than typical post-hoc compression. HN commenters were largely happy — one was running E2B on a Mac within minutes via litert-lm — though Gemma’s persistent extreme activation ranges still force bf16 KV caches and short context windows for local users. One speculative thread: the timing lines up suspiciously with WWDC, where Apple is expected to demo a Siri rebuilt on Google models.

The token bill comes due. TechCrunch reports that AI spending is blowing past budgets across the industry: Uber burned through its 2026 coding budget by April, Microsoft revoked internal Claude Code licenses, and one unnamed company reportedly ran up a $500 million Claude bill. The Linux Foundation responded by launching the Tokenomics Foundation, a FinOps-style discipline for token spend. Per-token prices have fallen, but agent workloads more than make up the difference.

Microsoft’s malaise. Wired’s piece on whether Microsoft has lost its mojo connects to that same Claude Code cancellation — Microsoft pulled the licenses partly to push developers onto Copilot, which has struggled against Anthropic’s tools. The Build conference didn’t move the stock, GitHub reliability complaints are mounting, and a VP blamed unprecedented bot traffic for the downtime. Separately, MIT Tech Review covers a Meta AI customer support agent that was tricked into reassigning Instagram accounts — including the dormant Obama White House handle — to attacker-controlled emails, with only a location-matched VPN needed to evade detection.

Meta’s smart glasses have a dormant face-recognition pipeline. A researcher reverse-engineered the Ray-Ban companion app and found a fully functional facial recognition system — three on-device models, biometric embeddings, a local database, “Person Recognized” notifications — gated but runnable end-to-end. There’s no evidence Meta is pushing identity data to user devices, but HN commenters flagged likely exposure under Illinois’ Biometric Information Privacy Act. One commenter with prosopagnosia made the obvious counterpoint: an offline version of exactly this would be genuinely useful, if it didn’t come attached to Meta.

New York moves to pause new data centers. The state legislature passed a one-year moratorium on data centers above 20MW, which would be the first statewide ban of its kind. Governor Hochul has until December to sign, and the recent precedent isn’t encouraging — Maine’s governor vetoed a similar bill over the lack of an exemption for an in-flight project.

Two papers worth your time. An ICML 2026 ablation study asks whether transformers need separate Q, K, and V projections, finding that sharing K and V cuts the KV cache in half with only 3.1% perplexity loss, and combined with GQA can shrink cache by 96.9%. HN commenters noted the 10B-token training runs are well short of modern overtrained regimes, and several asked the obvious question: if Q and K only ever appear as a product, why learn them separately at all? Separately, an ICLR 2026 outstanding paper, Transformers are Inherently Succinct, proves transformers can be exponentially more compact than RNNs for certain computations — and as a consequence, that emptiness and equivalence checks on them are EXPSPACE-complete. The HN takeaway: if you need formal verification, don’t use an LLM as the component being verified.

Did Claude make rsync buggier? A statistical analysis argues the post-Claude rsync releases aren’t significantly buggier than historical norms. HN was unconvinced: the analysis doesn’t control for commit complexity or bug severity, the buggiest release came right before Claude’s involvement (raising the possibility of undisclosed earlier use), and one commenter pointed out the irony that the defense itself appears to be AI-written and includes statistical errors. The more pragmatic concern: pressuring maintainers over disclosed AI use mostly teaches everyone else to stop disclosing.

A small-model multi-agent sim. A Build Small Hackathon entry, Thousand Token Wood, runs a five-creature trading economy on Qwen2.5-3B via vLLM. The takeaway: small models nail structured output (100% valid JSON) but need scarcity, spoilage, and tight prompting to behave like rational economic agents — naive prompting had them buying goods they already produced.

That’s it for Friday. Enjoy the weekend, and if you’re running Gemma locally, the new QAT checkpoints are probably the most fun thing to try today.