NewsletterLenny's Newsletter· 06-30 · 23:22

Sonnet 5 评测：我跑了 64 次生成，看看它值不值得用

Sonnet 5 review: I ran 64 generations to find out if it's worth it

I’ve been testing every major frontier model release since the start of the year, and when Anthropic dropped Sonnet 5, I wanted more than a vibe check. I got tired of one-off tests I couldn’t repeat or compare over time, so I built something better: the How I AI Bench, a repeatable eval harness I constructed live using Claude Code while recording this episode. I ran Sonnet 5 blind against four other frontier models (Sonnet 4.6, Opus 4.8, GPT-5.5, and Gemini 3 Pro) across PRD quality, prototype generation, agentic task completion, and agent personality. The results were not what I expected.

Listen or watch on YouTube, Spotify, or Apple Podcasts

What you’ll learn:

What Anthropic claims Sonnet 5 improves over Sonnet 4.6, and where the benchmark data actually backs that up
How I built the How I AI Bench in under 45 minutes using Claude Code, starting from my own stored session history
Why I combined human vibe scoring (70%) with LLM as judge scoring (30%) instead of trusting either alone
How to set up a local HTML scoring page so you can rate AI outputs on gut feel and export those scores as JSON
Which model I recommend for PRDs, which for complex prototypes, and which for chatting with an agent daily