资讯Hacker News· 07-04 · 21:51

GPT-5.5 Codex 推理token聚类可能会导致性能下降

GPT-5.5 Codex reasoning-token clustering may be leading to degraded performance

Uh oh!

GPT-5.5 Codex reasoning-token clustering at 516/1034/1552 may be leading to degraded performance on complex tasks #30364

Description

Summary

I found an aggregate pattern in Codex token_count metadata: gpt-5.5 responses disproportionately land at exactly reasoning_output_tokens = 516, with additional fixed-boundary spikes around 1034 and 1552.

This appears model-specific and coincides with lower overall reasoning-token intensity, which may help explain degraded performance on complex/high-stakes Codex tasks.

This is related to #29353, which reported a task-level reproduction where gpt-5.5 runs ending at exactly 516 reasoning tokens returned the wrong answer. This issue adds aggregate evidence across a larger Feb-Jun window.

I am not claiming this proves hidden chain-of-thought truncation. The narrower claim is that Codex telemetry shows a GPT-5.5-specific fixed-token clustering anomaly that looks consistent with thresholded reasoning-budget behavior.

Environment

Evidence

At the same time, overall reasoning-token intensity decreased:

Why this looks suspicious

The anomaly is not simply higher reasoning-token usage overall. Mean and P90 reasoning-token intensity fell from February-April to May-June, while exact-516 clustering rose sharply.

The clustering is also not evenly distributed across models. gpt-5.5 accounts for only 19.3% of responses but 82.0% of exact-516 events. Its exact-516 / >=516 ratio is about 33.6x higher than the non-GPT-5.5 baseline.

The fixed values are also notable: 516, 1034, and 1552 look like repeated threshold boundaries rather than a naturally varying reasoning-token distribution.

Expected behavior

Reasoning-token counts for complex Codex tasks should vary naturally with task complexity and should not disproportionately cluster at exact fixed values for one model family.

Actual behavior

gpt-5.5 responses cluster heavily at exactly 516 reasoning tokens, with related spikes around 1034 and 1552. This pattern is much weaker or absent in several other models.

Ask

Could the Codex team investigate whether gpt-5.5 has a reasoning-budget, routing, truncation, fallback, or scheduler behavior that causes responses to terminate around 516/1034/1552 reasoning tokens?

If this is expected behavior, it would be useful to know whether exact 516 indicates a normal stopping point, a budget cap, a degraded tier, or another internal threshold.

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

这篇还没有中文全文

该条目暂未提供中文翻译。标题/摘要已自动中译;本系统只对人工挑选的内容生成全文翻译。

挑中后 → markitdown 取正文 → 精翻 → 此处切换为译文