NewsletterNLP Newsletter (Elvis)· 06-14 · 15:00

本周顶级 AI 论文

🥇Top AI Papers of the Week

打开原文约 31 分钟读

1. MiniMax Sparse Attention

Ultra-long context is now a core requirement for agents, codebase-scale reasoning, multimodal workflows, and persistent memory, but dense softmax attention still makes million-token deployment expensive. MiniMax Sparse Attention (MSA) tackles this by adding blockwise sparsity on top of Grouped Query Attention, with a lightweight routing branch that chooses which key-value blocks each query group should actually attend to.

Paper | Tweet


Message from the Editor

We just released 30 Days of Hermes Agent, a hands-on lab that teaches agent workflows in a real, interactive terminal. Across 30 short labs, you use Hermes Agent to turn a messy Personal Knowledge Vault into a working knowledge operations system with readable notes, searchable context, reusable templates, review workflows, task boards, safety rules, and handoff docs.

Start 30 Days of Hermes Agent


2. Self-Harness

Most agent scaffolds are built once by hand and then frozen, even as the underlying models keep changing. This paper introduces Self-Harness, a paradigm where an LLM agent improves its own operating harness, the prompts, tools, memory, and orchestration around the base model, without human engineers or a stronger external agent. Because every model fails in its own way, the system mines those model-specific weaknesses and turns them into concrete, executable harness edits rather than generic advice.

Paper | Tweet


3. Agents’ Last Exam

From Berkeley RDI, Agents’ Last Exam (ALE) is a living benchmark built to measure whether agents can do economically valuable work, not just score well on academic tests. It was assembled with more than 250 industry experts and maps over 1,000 verifiable tasks to the U.S. federal occupational taxonomy, organized as 55 subfields across 13 industry clusters. Every task has an objective, checkable outcome, so there is no subjective human grading, and the pool is designed to keep growing as new workflows are onboarded.

Paper | Tweet


4. How AI Agents Reshape Knowledge Work

This economics paper, drawing on large-scale production data from Perplexity, studies how the shift from conversational assistants to autonomous agents is reshaping knowledge work. It compares Search, a conversational assistant, with Computer, a general-purpose agent system, along three dimensions: autonomy, efficiency, and the scope of tasks people take on. The framing is a cost-structure model in which agents carry higher fixed and delegation costs but lower per-step marginal costs, so they win once tasks are complex enough.

Paper | Tweet


5. Agentopia

Agentopia is one of the most ambitious agent-society testbeds yet, a 79-page release that drops 100 LLM agents into a persistent world and lets them live, form relationships, and pursue goals over 10 simulated years, a horizon orders of magnitude longer than prior day-level work. Beyond observing emergent social behavior, the authors use the simulation as a training signal, optimizing models toward a life reward that reflects human well-being via rejection sampling.

Paper | Tweet


6. The Geometry of On-Policy Distillation

On-policy distillation (OPD) has become one of the most discussed post-training recipes of the year, but it has mostly been treated as a black box sitting somewhere between supervised fine-tuning and RL. This paper opens it up, characterizing how OPD changes a model’s weights at the level of parameter geometry, and argues OPD is not a midpoint between SFT and RLVR but its own distinct kind of update.

Paper


7. Lookahead Sparse Attention

Long-context decoding is bottlenecked by the KV cache, which grows with every token and quickly dominates memory at extreme context lengths. This work, branded around DeepSeek-V4, introduces Lookahead Sparse Attention (LSA), which avoids storing the full KV cache by predicting which parts of the context future decoding will actually need and retaining only those query-critical chunks.

Paper


8. Latent Spatial Memory

Video world models struggle to stay consistent over long horizons because explicit 3D memory usually requires an expensive pixel-space loop. Mirage instead stores scene information directly in diffusion latent space, using depth-guided back-projection and latent-space warping to maintain persistent spatial memory. The approach reports up to 10.57 times faster end-to-end generation and 55 times lower memory use than explicit 3D-memory baselines while improving long-horizon spatial consistency.

Paper


9. The Consistency Illusion

Multi-agent debate is often judged by whether the agents end up agreeing, but this paper shows that output-level consensus can hide deep disagreement in the reasoning that produced it. The authors abstract agents’ reasoning traces and decisions into four states along two axes, reasoning similarity and conclusion agreement, and flag divergent agreement, where agents reach the same answer through very different paths. Across 600 content-moderation items, divergent agreement appeared in 118 cases and separated cleanly from genuine disagreement states with a Cohen’s d of 0.80, and routing on these categories beat divergence-only methods at flagging high-disagreement cases.

Paper | Tweet


10. Beyond Scalar Rewards

Reward models usually compress a judgment into a single scalar, but this paper argues human preferences are better captured as score distributions, and proposes Z-Reward, which internalizes reasoning into a predicted distribution before scoring. A large vision-language teacher does the reasoning-heavy judgment and is distilled into a compact student for efficient deployment, with the 27B teacher reaching 89.6% human-preference accuracy and the 9B student nearly matching it at 88.6%. Used as a reinforcement learning signal, it delivers a 41.3% net preference improvement over a supervised baseline, beating GRPO and other reward methods.

Paper

这篇还没有中文全文

该条目暂未提供中文翻译。标题/摘要已自动中译;本系统只对人工挑选的内容生成全文翻译。

挑中后 → markitdown 取正文 → 精翻 → 此处切换为译文