研究Microsoft Research· 06-14 · 07:17

关于我们近期 AI 委派与长程可靠性研究的补充说明

Further Notes on Our Recent Research on AI Delegation and Long-Horizon Reliability

Our recent paper, “LLMs Corrupt Your Documents When You Delegate”, has generated discussion about the reliability of AI systems in delegated workflows. We appreciate the interest in this work and want to clarify several important points about what the paper does—and does not—claim.

The research aims to develop robust evaluation methods for long-horizon delegated and collaborative tasks. More broadly, this work reflects an ongoing effort to better understand the gap between strong benchmark performance and certain real-world tasks. Using a controlled evaluation methodology, we examine how well information is preserved across these extended workflows. Within this constrained setting, we observe that models can accumulate fidelity degradation over repeated edits. Note however, that current production systems can mitigate these effects through verification loops, orchestration, and domain-specific tooling.

Our goal is not to argue against the use of AI systems in professional workflows, but rather to identify where current systems need further research and engineering to help make them more trustworthy collaborators. This benchmark is intended as a diagnostic tool for examining delegation patterns, not a measure of overall model capability, task success, or user outcomes.

Main results

The paper evaluates a specific interaction pattern we call delegated work—situations where a user entrusts an AI system to carry out multi-step modifications to important artifacts such as documents, spreadsheets, code, or structured files with limited human verification between steps.

We use chained transformation-and-inversion tasks that evaluate whether semantic content is preserved accurately across extended delegated workflows. Our evaluation uses domain-specific semantic parsing to focus on meaningful changes to the underlying artifact rather than superficial formatting or stylistic differences. The errors we report thus correspond to degradation in the underlying semantic content but, our measure of “corruption” did not include task completion or user satisfaction.

Using this methodology, we find that current frontier models can introduce sparse but consequential errors during long-horizon workflows, and that these errors may accumulate over repeated interactions. Across the evaluated settings, strong state-of-the-art models showed roughly a 19–34% degradation in artifact fidelity over 20 delegated iterations. Notably, Python workflows generally exhibited stronger robustness under extended delegated interactions, with less than 1% degradation on average.

Azure AI Foundry Labs

Get a glimpse of potential future directions for AI, with these experimental technologies from Microsoft Research.

Methodological limitations

DELEGATE-52 was intentionally designed as a stress test for long-horizon delegated execution. The benchmark evaluates whether systems preserve artifact integrity across extended sequences of transformations and inversions.

The study focuses specifically on delegated execution with limited human intervention between steps. It does not attempt to measure the full range of real-world AI deployments, many of which involve substantially more oversight, verification, and workflow structure.

The paper also evaluated a simplified agentic harness with tool use capabilities such as Python execution and file operations. While this setup did not eliminate the observed degradation, it should not be interpreted as representative of production-grade systems optimized for specific workflows or enterprise domains.

Implications

We believe the primary implication of this work is that reliable long-horizon delegation remains an important open research and engineering challenge.

The results suggest that strong short-horizon benchmark performance alone may not guarantee dependable delegated execution over extended workflows. At the same time, the findings should not be interpreted as evidence that AI systems lack practical value in real-world work today.

In practice, many deployed AI systems combine models with specialized harnesses, orchestration layers, retrieval systems, verification procedures, memory mechanisms, and human oversight designed to improve reliability and deliver useful user outcomes despite underlying model limitations. We expect continued improvements in models, workflow-aware training, memory systems, and production-grade agentic harnesses to further reduce these failure modes over time.

这篇还没有中文全文

该条目暂未提供中文翻译。标题/摘要已自动中译;本系统只对人工挑选的内容生成全文翻译。

挑中后 → markitdown 取正文 → 精翻 → 此处切换为译文