Why Claude Felt Dumber (2026 Postmortem)

On this page · 10 sections▾

TL;DR
What users reported
What the postmortem found
Why LLM quality regresses
Why evals missed it
Regression vs. prompt drift
How to report a regression
Takeaways for Claude Code users
FAQ
Sources

TL;DR — three harness bugs, not a nerf

The complaints were real. Anthropic’s April 23, 2026 postmortem — “An update on recent Claude Code quality reports” — confirms that roughly six weeks of quality complaints traced to actual technical causes, not user error.
The model was not changed. In Anthropic’s words, “The API was not impacted.” The model weights behind Claude Code stayed the same. The bugs were in the surrounding product layer — the harness.
Three overlapping causes. A default reasoning-effort downgrade (March 4, “the wrong tradeoff”), a thinking-history cache bug that erased Claude’s memory every turn (March 26), and a verbosity-capping system prompt that hurt coding quality (April 16). Each hit a different slice of traffic on its own schedule, which is why the symptom was so slippery.
Resolved. All three were fixed by April 20 in v2.1.116; usage limits were reset for all subscribers on April 23.
The durable lesson. An LLM’s perceived quality is a product of the whole serving stack, not just the weights. The rest of this post explains the mechanism and how to tell a real regression from your own context drift.

What users reported

The complaints clustered around the same handful of symptoms. Claude Code seemed to forget its own plan mid-task, repeating steps it had already done. It cut answers short where it used to be thorough. On harder engineering work, the output quality dropped enough that some users questioned whether they could trust it. The reporting in Fortune captured the frustration: one user wrote that Claude Code had “been barely usable for me in the past couple of days,” and a senior AI executive at AMD described the tool as “unusable for complex engineering tasks.”

Part of what made the episode sting was timing. Before the postmortem, the prevailing public response treated the reports as subjective — the familiar “the model feels worse” refrain that surfaces around every release and rarely survives a controlled test. That skepticism is usually healthy. Here it was misplaced: the reports were pointing at a genuine, if hidden, set of bugs.

“The models themselves were not to blame, but three separate issues in the Claude Code harness caused complex but material problems which directly affected users.”

Simon Willison · Blog

Willison's summary of the April 23 postmortem — the cleanest statement of the harness-vs-model distinction that frames the whole episode.

Source

What the postmortem found

Anthropic’s account names three changes that shipped between March and April 2026. Each affected a different portion of traffic, on a different timeline, which is exactly why no single “it broke on this date” story ever fit.

Cause 1 — Reasoning effort lowered (Mar 4 → reverted Apr 7)

Anthropic dropped Claude Code’s default reasoning effort from high to medium to fix long thinking pauses that made the UI look frozen. The postmortem calls it plainly: “This was the wrong tradeoff.” Less thinking effort meant weaker answers on hard tasks for everyone on the default.

Cause 2 — Thinking-history cache bug (Mar 26 → fixed Apr 10)

A cache optimisation was meant to clear stale thinking once from idle sessions. A bug made it fire on every turn instead. In Anthropic’s words, “Claude would continue executing, but increasingly without memory of why it had chosen to do what it was doing.” That is the forgetful, repetitive behaviour users described.

Cause 3 — Verbosity cap in the system prompt (Apr 16 → reverted Apr 20)

A system-prompt instruction told Claude to “keep text between tool calls to ≤25 words” and “keep final responses to ≤100 words unless the task requires more detail.” Combined with other prompt changes, it hurt coding quality — terse where the task needed depth.

The throughline: not one of these touched the model. They touched the settings, the cache, and the prompt that surround it. Anthropic resolved all three by April 20 in v2.1.116, and the post notes “we’re resetting usage limits for all subscribers” as of April 23. The remediation commitments were process-level: broader per-model evaluations, soak periods and gradual rollouts for prompt and harness changes, and ensuring internal staff run the exact build the public runs.

Why an LLM can genuinely regress without a model change

The 2026 episode is a clean illustration of a point that is easy to miss: the answer you get from “Claude” is the output of a whole stack, and the weights are only one layer of it. Several other layers can change quality on their own.

Reasoning-effort and decoding settings. How much the model is allowed to think, and the sampling parameters used to decode its output, are tunable knobs outside the weights. Cause 1 above was exactly this — a single default flipped from high to medium made every default session reason less.

System prompts and harness instructions. The hidden instructions wrapped around your prompt shape tone, length, and tool behaviour. A verbosity cap, a changed tool description, or a reordered instruction can shift output quality with no model change at all. Cause 3 was this category.

Context and cache management. Long sessions accumulate files and tool output that crowd the context window; idle sessions can lose warm prompt cache and reload everything cold. A bug in how that context is pruned — like Cause 2 — can quietly delete the model’s working memory. This is the same cost surface as ordinary MCP context bloat: every connected server’s tool definitions, every file read, and every stale session eats budget the model needs to do useful work. When the context-management layer misbehaves, the model looks dumber even though it is identical.

Serving and quantisation. Inference providers routinely change how a model is served — quantisation level, batching, the specific GPU kernels — to manage cost and load. These changes are invisible to the user and can move output quality in either direction. This was not a cause in Anthropic’s April postmortem (they were explicit the model and API were untouched), but it belongs in the mental model: serving changes are a real, separate way quality can move under a fixed set of weights.

Load-based routing. Some platforms route requests across model variants or capacity tiers under load. When that happens, two users with the same prompt can get materially different answers depending on where they land. Add the inherent non-determinism of sampling — the same prompt yields a slightly different answer each run — and you have several independent reasons a model can feel inconsistent without anyone shipping a new model. Choosing the right tier deliberately is its own discipline; see our model routing guide.

Why internal evals missed it for weeks

The most uncomfortable part of the story is the gap between user reports and Anthropic’s own monitoring. The postmortem acknowledges that “neither our internal usage nor evals initially reproduced the issues identified.” Three structural reasons explain that gap, and each generalises beyond this one incident.

State-dependent bugs evade single-shot tests. The cache bug only manifested in a specific state — resumed idle sessions, where stale thinking was being cleared. Most eval suites fire fresh single-turn prompts and never reach that state. Simon Willison flagged exactly why this matters for real usage:

“I frequently have Claude Code sessions which I leave for an hour (or often a day or longer) before returning to them.”

Simon Willison · Blog

Willison on why the idle-session cache bug hit heavy users hardest — the state the bug needed was his normal workflow, not an edge case.

Source

Eval suites can be too narrow. Independent reporting on the postmortem noted that the suite was “too narrow to detect a 3% quality drop from prompt changes.” A few percentage points is hard to see in aggregate numbers but very visible to a developer doing the same task daily — the human is a higher-resolution detector than a coarse benchmark.

Internal builds drifted from public builds. When staff run a different build than customers, internal dogfooding can look clean while the shipped product is broken. Anthropic’s remediation commitment to run the exact public build internally is a direct response to this.

The takeaway for users is not that evals are useless — it is that aggregate evals and your lived experience measure different things. Trust your reproducible observations, and make them reproducible so others can confirm them.

How to tell a real regression from prompt drift

Most of the time a model “feeling worse” is your session, not the platform. Before you conclude there’s a regression, run this check. It takes two minutes and resolves the question most of the time.

Step 1 — Start a fresh session

Open a new Claude Code session with an empty context. A long or stale session is the single most common cause of apparent degradation. If the fresh session is fine, your old context was the problem.

Step 2 — Run a known-good prompt

Keep a few fixed test prompts whose good output you remember. Run one. Comparing against a known baseline removes the “am I imagining this?” problem — you are measuring against yourself, not a vibe.

Step 3 — Check version, model, and status

Run claude --version and confirm the model in use. A recent update or a model switch explains a lot. Then check the Anthropic status page and release notes for the dates in question.

Step 4 — Look for corroboration

If a clean, minimal prompt is still bad and others report the same symptom on the same dates, that points to a platform issue worth reporting. One person’s bad day is noise; a dated, reproducible pattern across users is signal.

The logic is simple. Prompt drift and context bloat are local — they live in your session and vanish in a fresh one. A real regression is global — it survives a clean session and shows up for other people on the same build. The fresh-session test is what separates the two.

How to report a regression usefully

If you’ve done the check above and it still looks like a platform problem, a good report makes it actionable. The difference between “it feels worse” and a fixable bug report is reproducibility.

Use the built-in channel. The /bug command inside Claude Code attaches session context automatically. For tracked issues, file on the anthropics/claude-code GitHub repo.
Pin the version. Include the exact output of claude --version and the model in use. The 2026 fixes landed in a specific build (v2.1.116); version awareness is what lets a maintainer map your report to a known change.
Give a minimal repro. The smallest prompt that shows the problem, plus whether a fresh session reproduces it. State-dependent bugs — like the idle-session cache bug — are only findable when the report says “happens after the session sits idle for an hour.”
Date it. Note when you first saw the change. Because the 2026 causes each started on different days, precise dates were essential to untangling them.

Takeaways for Claude Code users

The episode is resolved, but the operating habits it teaches are evergreen.

Stay current, and read release notes. The fixes shipped in v2.1.116. Running an old build means living with old bugs. When quality shifts, the changelog is the first place to look.
Manage your context deliberately. A long session degrades on its own. Start fresh for new tasks, and keep MCP overhead lean — the same discipline covered in our context-bloat guide directly affects how much room the model has to reason.
Keep a small baseline. A handful of fixed test prompts turns “feels worse” into a measurement you can act on. This is the single highest-leverage habit from the whole story.
Separate the model from the harness. When Claude underperforms, the cause is more often a setting, a prompt, or your context than the model itself. The same weights served through the API behaved normally throughout the 2026 incident. Solid prompting and session hygiene — the ground covered in Claude Code best practices — insulate you from most of the day-to-day variance.
Report, don’t just rage. Reproducible, dated, version-pinned reports are what let Anthropic correlate complaints to a build and ship a fix. The postmortem exists because enough people reported precisely.

The headline question — “did Anthropic nerf Claude?” — has a clear answer for this episode: no. The model held steady; three layers in front of it slipped. That is reassuring and instructive at once. The model you rely on is only as good as the stack serving it, and the best defence is the boring one: stay current, keep a baseline, manage your context, and report what you can reproduce.

Frequently asked questions

Did Claude get worse in early 2026?

Yes, for Claude Code users — and the cause was documented, not imagined. Anthropic's April 23, 2026 postmortem ("An update on recent Claude Code quality reports") traces roughly six weeks of complaints to three separate changes in the Claude Code harness, shipped between March 4 and April 16. The underlying model and the API were not affected; the bugs lived in the layers around the model.

Did Anthropic nerf Claude or change the model weights?

No. Anthropic states plainly: "The API was not impacted." The model weights behind Claude Code did not change during the degradation window. The three issues were a default reasoning-effort downgrade, a thinking-history cache bug, and a verbosity system-prompt instruction — all in the Claude Code product layer, not the model. Identical model weights served through the API behaved normally throughout.

What did Anthropic's postmortem actually say caused it?

Three overlapping changes. (1) On March 4 they lowered the default reasoning effort from high to medium to fix UI latency; the post calls it "the wrong tradeoff" and it was reverted April 7. (2) On March 26 a cache optimisation meant to clear stale thinking once instead cleared it every turn, so Claude lost track of why it was doing things; fixed April 10. (3) On April 16 a system-prompt instruction capping response length hurt coding quality; reverted April 20.

Is Claude Code degraded right now?

The three documented issues were resolved as of April 20, 2026 in v2.1.116, and Anthropic reset usage limits for all subscribers on April 23. If Claude Code feels off today, first rule out the everyday causes — a bloated context window, a stale long-running session, or prompt drift — before assuming a platform regression. Check the Anthropic status page and recent release notes before concluding it is a server-side problem.

Why does Claude feel dumber some days but fine others?

Day-to-day variance usually comes from your session state, not the model. A long session fills its context with old files and tool output; an idle session can lose warm prompt cache; a context near its limit degrades reasoning. Non-determinism also means the same prompt can produce a weaker answer once in a while. The 2026 postmortem adds a third possibility: an infrastructure bug that affects only a specific state, such as resumed idle sessions.

Can an AI model regress without the model itself changing?

Yes — that is the core lesson of the 2026 episode. A model is served through a harness: reasoning-effort settings, system prompts, context-management caching, routing, and serving infrastructure. A bug or a tuning change in any of those layers can make output measurably worse while the weights are byte-for-byte identical. Quantisation or serving changes can do the same. The model is only one variable in the quality you experience.

How do I tell a real regression from my own prompt drift?

Reproduce in a fresh session with a minimal, known-good prompt. If a clean session on the latest build produces a good answer, the problem was your context (drift, bloat, or a stale session), not the platform. If a clean, minimal prompt is also bad — and others report the same on the same dates — that points to a platform regression worth reporting. Keeping a small set of fixed test prompts makes this comparison fast.

How should I report a Claude Code quality regression?

Use the /bug command inside Claude Code, or file a GitHub issue on anthropics/claude-code. Include your exact version (claude --version), the date and time, the model in use, a minimal reproduction prompt, and whether a fresh session reproduces it. Concrete, reproducible reports are far more useful than "it feels worse," and they are what let Anthropic correlate complaints to a specific build and date.

Sources

Primary — Anthropic

anthropic.com/engineering/april-23-postmortem — “An update on recent Claude Code quality reports.” The three causes, the dates, “the wrong tradeoff,” “the API was not impacted,” v2.1.116, and the usage-limit reset (Apr 23, 2026).

Community / corroborating

simonwillison.net — recent Claude Code quality reports — the harness-vs-model framing and the idle-session usage pattern (Apr 24, 2026).
infoq.com — Anthropic traces six weeks of complaints — “three unrelated product-layer changes” framing; the narrow-eval and internal-build analysis (May 14, 2026).
fortune.com — Anthropic explains the performance decline — user-reported symptoms and reaction (Apr 24, 2026).
Hacker News 47878905 — the postmortem discussion thread.

Related on mcp.directory

/blog/mcp-context-bloat-fix-2026-tool-search-code-mode-progressive-disclosure — how context and tool definitions eat the budget the model needs.
/blog/claude-code-best-practices — prompting and session hygiene.
/blog/fable-5-claude-code-model-routing-guide-2026 — choosing the right model and effort tier.

Why Claude felt dumber in early 2026 — and what the postmortem revealed