Skip to content
ai0.news
Go back

AI News — May 17, 2026: llama.cpp MTP Merge Lifts DeepSeek Speeds 1.8×, Orthrus Claims 7.8× on Qwen3

Listen to this briefing

Chapters (10)

Good morning. The big story today is happening at the systems level: llama.cpp landed Multi-Token Prediction support overnight and local-model enthusiasts are losing their minds over it, while a separate paper claims even bigger speedups on Qwen3 through a different mechanism. Elsewhere, AI is officially breaking CTF competitions, OpenAI is handing out ChatGPT Plus to an entire country, and Nvidia put out a 2.6B “world model” that mostly looks like Unreal Engine footage.

llama.cpp gets MTP, community celebrates. A long-awaited pull request adding Multi-Token Prediction support was merged into llama.cpp, letting models with MTP layers — DeepSeek being the obvious one — use them for speculative decoding. Reported speedups land in the 1.5×–1.8× range for token generation, with AMD APU users on the Vulkan backend seeing around 30%. The enthusiasm spilled into multiple parallel threads, with one commenter calling it “probably the biggest speedup we’ll see in llama.cpp until Eagle3 or DFlash become available.” Prompt processing speed, the perennial complaint, remains unfixed.

Orthrus claims 7.8× on Qwen3 with identical output. As we noted yesterday, a paper proposing Orthrus — injecting a trainable diffusion attention module into a frozen autoregressive Qwen3 — has continued generating discussion, with HN commenters probing whether the tokens-per-forward speedup actually translates to a compute reduction. The pitch is that both heads share one KV cache, only 16% of parameters get fine-tuned, and output distribution is provably identical to the base model. Several commenters want GGUF support before they’ll take it seriously, which is the right instinct.

Brockman officially runs OpenAI’s products. Greg Brockman’s interim product role has been made permanent, with ChatGPT, Codex, and the developer API consolidating into a single team under him. TechCrunch frames it as a push toward an “agentic future” while Fidji Simo remains on medical leave, and aligned with Altman’s late-2024 “code red” directive that already led to shelving Sora and OpenAI for Science.

Frontier models have broken CTFs. A veteran competitive hacker argues that the open CTF format is effectively dead, with Claude Opus 4.5 making most medium and even some hard challenges trivially solvable through orchestrated AI agents. Scoreboards now reflect who can run the most agents with the most context for the longest, not who has security expertise. HN commenters extended the argument to education and competitive programming generally — one noted that “this will basically become true for everything,” which feels about right.

OpenAI gives Malta a free year of Plus. OpenAI is partnering with the Government of Malta to give every citizen a free year of ChatGPT Plus, gated behind a voluntary two-hour AI literacy course. HN reactions ranged from privacy concerns to the observation that even if every Maltese citizen signed up, ChatGPT’s weekly active users would barely budge — Malta has about 500,000 people. Someone inevitably suggested Anthropic should try Sicily next.

SANA-WM: 2.6B “world model” from Nvidia. Nvidia Labs released SANA-WM, a 2.6B-parameter model that generates one-minute 720p video with 6-DoF camera control, code under Apache 2.0 and weights under Nvidia’s commercial-use license. HN was split: impressed by the quality-to-size ratio, but skeptical about both the “world model” framing (it appears trained largely on Unreal Engine synthetic footage and outputs look like video games) and the “open” claim given initial weight availability issues. The genuinely interesting question one commenter raised: is there an abstract spatial representation in here, or is this just a physically coherent video generator?

δ-mem proposes fixed-size LLM memory. A new paper, δ-mem, compresses historical context into a small fixed-size state matrix — as compact as 8×8 — using delta-rule learning to generate low-rank corrections to attention, no fine-tuning required. The authors report up to 31% gains on MemoryAgentBench. HN was lukewarm, with one commenter dismissing it as “DeltaNet hypernetworks bolted onto existing LLMs” and others wanting evidence it actually helps real coding agents rather than benchmark scores.

Qwen 3.6 27B holds its own against frontier models. A side-by-side test asked local Qwen 3.6 (across Q4_K_M, Q8_0, and BF16 quantizations) and various frontier models to one-shot a single-file HTML canvas car animation. Community favorites were Kimi K2.6 thinking and — notably — Qwen 3.6 27B at Q4_K_M, with multiple commenters reproducing results nearly identical to the original poster’s on their own hardware. One commenter made a sharp point: these models are doing spatial reasoning blind, and giving them eyes via playwright-mcp dramatically improves their output.

That’s the briefing. The MTP merge is the kind of infrastructure-level win that quietly accelerates everyone running local models — worth checking if your weights of choice have the layers for it.

Get this in your inbox

One post every morning. Unsubscribe anytime.


Share this post on:

Next Post
AI News — May 16, 2026: ChatGPT Taps 12,000 Banks via Plaid, arXiv Bans Year-Long Slop Offenders