AI News — May 30, 2026: Liquid's 8B MoE Trained on 38T Tokens, Tencent's Hy3 Undercuts Claude at $0.066

Good morning. The model release cadence isn’t slowing — Anthropic shipped Opus 4.8 yesterday, Liquid AI dropped a heavily overtrained edge MoE today, and a mysterious Tencent model is somehow eating Claude’s lunch on OpenRouter. On the hardware side, Groq is back asking for money and a Korean startup is betting that memory, not compute, is the wall everyone’s about to hit.

Liquid AI’s 8B-A1B MoE arrives heavily overtrained. Liquid released LFM2.5-8B-A1B, a Mixture-of-Experts edge model with 8B total parameters and 1B active, trained on 38 trillion tokens — roughly 1800x Chinchilla scaling, which several HN commenters flagged as extreme. Benchmarks look strong on instruction following and tool calling, but a user testing it on a bug-fixing benchmark found Qwen2.5-Coder-3B from two years ago fixed 50% of bugs versus 12% for the new Liquid model. Day-one support for llama.cpp, MLX, vLLM, and SGLang is the more interesting story for anyone building on-device.

A “mysterious” Tencent model is topping OpenRouter. Max Woolf dug into Hy3, which has somehow leapt past Claude in OpenRouter token volume by more than 50% despite mediocre benchmarks and almost no documentation. The likely explanation is price — $0.066 per million input tokens — but commenters were quick to note the rankings only reflect OpenRouter’s own traffic (most Anthropic users hit the API directly), count tokens rather than users, and could easily be skewed by a single whale. One commenter speculated it might be a recycling or advertising experiment given the single inference provider.

Groq is raising $650M after Nvidia’s $20B not-acqui-hire. TechCrunch reports Groq is going back to existing investors to fund its inference cloud, six months after Nvidia’s December licensing deal paid out shareholders in cash and pulled senior staff over to Nvidia. The company is running on interim executives, and the round is essentially backstopped — Disruptive and Infinitium have committed to absorb whatever existing investors pass on.

XCENA bets $135M that memory is the real bottleneck. Founded by Samsung and SK Hynix veterans, XCENA closed a Series B at a $570M valuation for its MX1 chip, which puts compute near DRAM over CXL to handle KV cache management and preprocessing without shuttling data between CPUs, GPUs, and memory. The pitch is collapsing ten-server workloads to one. The timing fits: Samsung, SK Hynix, and Micron all crossed trillion-dollar valuations this year.

Kog claims 3,000 tokens/sec per request on “standard” GPUs. Kog.ai’s writeup describes a co-designed model-runtime-kernel pipeline that targets memory bandwidth rather than FLOPS, demonstrated on a 2B coding model running on 8× H200 or MI300X. HN commenters pushed back on two things: 8× H200 is not most people’s idea of “standard,” and a 2B model is orders of magnitude smaller than the frontier models they’re implicitly being compared to. The interesting question — whether the same optimizations hold at 27B+ — is unanswered.

Mistral pitches itself as Europe’s AI stack. Koen van Gilst’s notes from the Now Summit describe a company pivoting hard toward owning compute (a 40MW Paris data center), specialized small models, and on-prem deployment for regulated European customers like BNP Paribas and ASML. The HN thread was sympathetic but worried: multiple commenters argued Mistral has fallen behind on reasoning at usable context lengths since Q3 2025, and the on-prem strategy doesn’t obviously beat fine-tuning open-source Chinese models for the same customers.

Shift wants to clean your house in exchange for robot training data. The Verge covered Shift’s offer of free cleaning in New York — with London, SF, and Zurich coming — in exchange for letting a camera-equipped “magic hat” record the cleaners (who are contractors, not employees). HN’s response was unenthusiastic, with the top comment pointing out that the byproduct is a 3D map of your home and an inventory of its contents, neither of which has a great track record of staying private. One commenter pointed to a less-invasive alternative: just sign a hotel chain.

Tiny-vLLM is a teaching-grade inference engine. A Show HN for a C++/CUDA implementation of PagedAttention, continuous batching, and FlashAttention-like kernels for Llama 3.2 1B, with a long lesson-style README that the author considers the main artifact. Reception was warm — multiple commenters compared it favorably to the early days of llama.cpp — though one noted the code skips checking CUDA API return values, which is the kind of “tiny” that bites you later.

That’s it for today. The Hy3 mystery is the one I’d watch — if a $0.066/M model can take the top of any leaderboard, even a noisy one, it says something about how price-sensitive the agentic workload market has become.