MCP Context Bloat Fix 2026 (Tool Search)

On this page · 13 sections▾

TL;DR + the 67k number
Why MCP burns tokens
The three families of fix
Anthropic Tool Search (85%)
Cloudflare Code Mode (99.9%)
Anthropic code execution (98.7%)
Other techniques
Measure your own MCP
Decision matrix
What to do today
Community signal
FAQ
Sources

TL;DR — the 67k number, and the three fixes

The bloat is real. davidmoneil’s measurement on claude-code issue #11364 (filed November 10, 2025): seven MCP servers consume 67,300 tokens of tool definitions — 33.7% of a 200k context window — before the user types anything. GitHub MCP alone is ~18k tokens for 27 tools.
Three families of fix shipped in 2026. Anthropic’s Tool Search Tool (Nov 24, 2025): “an 85% reduction in token usage while maintaining access to your full tool library.” Cloudflare’s Code Mode (Feb 20, 2026): “reduces the number of input tokens used by 99.9%.” Anthropic’s code execution with MCP (Nov 4, 2025): “150,000 tokens to 2,000 tokens — a time and cost saving of 98.7%.”
They’re all the same idea. Progressive disclosure: stop loading tool schemas until the agent needs them. Tool Search loads on demand. Code Mode wraps the API surface in a typed SDK. Code execution presents servers as filesystem-like discoverable APIs.
What to do today. If you’re on Claude Code v2.1.7 or later, Tool Search is already on. If you’re on Cursor or Cline, audit and trim — see What to do today.

We’ll cover each fix below, with verbatim quotes and the evaluation methodology behind each percentage. Then a reproducible script for measuring your own setup, a decision matrix, and the practical to-do list for today.

Why MCP burns tokens — the mechanism

The Model Context Protocol spec says nothing about lazy loading. When a client like Claude Code, Cursor, or Cline connects to an MCP server, it calls tools/list over JSON-RPC and gets back the server’s full tool catalogue: every tool name, the Markdown description, the JSON Schema for each parameter, and often inline examples. The client concatenates that into the system prompt and ships it to the model on every turn. Caching helps — Anthropic’s prompt caching can preserve those tokens across consecutive turns inside the same session — but the cost recurs whenever the cache misses, which on idle sessions or new sessions is always.

Three pieces of evidence make this concrete.

1. Anthropic’s own number. The advanced tool use post opens with: “That’s 58 tools consuming approximately 55K tokens before the conversation even starts.” And later: “At Anthropic, we’ve seen tool definitions consume 134K tokens before optimization.” Those are headline figures from the team that designed the protocol — not adversarial benchmarks.

2. The community measurement. A Claude Code user under the handle davidmoneil filed issue #11364 with the precise breakdown: seven servers (GitHub, Docker, Filesystem, Git, SSH, MCP Gateway with Memory/Fetch/Playwright) consuming 67,300 tokens. GitHub alone — “~18k tokens (27 tools, even in sessions that never use GitHub).” The issue title: “Lazy-load MCP tool definitions to reduce context usage.”

3. Simon Willison’s shorthand. In his widely-cited Claude Skills piece (October 16, 2025):

“The most significant is in terms of token usage: GitHub's official MCP on its own famously consumes tens of thousands of tokens of context, and once you've added a few more to that there's precious little space left for the LLM to actually do useful work.”

Simon Willison · Blog

Willison's framing of why Claude Skills matter, October 2025 — the canonical popular-press articulation of the bloat problem.

Source

The mechanism behind these numbers is the same in every case. The MCP server’s tool catalogue is descriptive prose plus schema — for a well-documented tool, easily 200–800 tokens per tool. Multiply by 50 tools per server, by 4–7 connected servers, and you arrive at five-figure overhead per session. None of that overhead does work; it’s the price of letting the model know what’s available.

That’s the problem. The interesting part of 2026 is what three teams shipped to fix it.

The three families of fix

The interventions look different on the surface — different vendors, different APIs, different language — but underneath they’re three points on the same spectrum.

Family 1: Search-then-load

Anthropic’s Tool Search Tool. Tool descriptions are tagged defer_loading: true and replaced with a semantic search tool. The model searches first, loads only the matching tools’ full schema, then calls them. Headline: 85% token reduction.

Family 2: Code-as-API

Cloudflare’s Code Mode and Anthropic’s code execution with MCP. Replace many MCP tools with a typed SDK that lives outside the model. The agent writes JavaScript / TypeScript that calls the SDK in a sandbox, only reading back what it needs. Headline: 99.9% (Cloudflare) and 98.7% (Anthropic).

Family 3: Disclosure on demand

The architectural pattern, not a product. Tool catalogues expose only metadata up front; full descriptions arrive when the agent reads a specific tool. Anthropic’s Agent Skills are an example: each skill consumes ~100 tokens during metadata scanning and only loads its full body when relevant.

The right way to think about these is: family 1 is the conservative fix (keep MCP, just lazy-load), family 2 is the radical fix (replace MCP tool calls with code), family 3 is the principle that explains why both work. Most teams will use all three.

Anthropic Tool Search — what 85% means

Anthropic shipped the Tool Search Tool on November 24, 2025 as part of the broader “advanced tool use” release. The headline:

“This represents an 85% reduction in token usage while maintaining access to your full tool library.”

A second number circulating on Medium and X is 46.9% — 51K tokens down to 8.5K, attributed to Anthropic’s Tool Search. That figure traces to a single third-party article and isn’t in Anthropic’s official copy. We’ll use the 85% number with the source. If you see “46.9%”, it’s real but under-sourced — an aggregated screenshot from one workload, not the headline metric.

How it works mechanically. Tools you mark with defer_loading: true aren’t injected into Claude’s context at session start. Instead, Claude sees the Tool Search Tool, which exposes a search query interface over your tool catalogue. When the agent needs a tool, it queries Tool Search, gets the matching tool name(s), and only then does Anthropic load the full schema for those specific tools. From the same post: “the Tool Search Tool discovers tools on-demand. Claude only sees the tools it actually needs for the current task.”

Methodology behind the percentages. Anthropic reports two things. The 85% number is overall token reduction in tool definitions. They also publish accuracy improvements on MCP evaluations: “Opus 4 improved from 49% to 74%, and Opus 4.5 improved from 79.5% to 88.1% with Tool Search Tool enabled.” That’s arguably the more important number — fewer tokens isn’t worth much if accuracy drops, and Anthropic claims it actually rises.

Trade-offs. Two real ones. First, latency: every Tool Search step adds an extra model turn before the actual tool call. For one-shot prompts that fire a single tool, you pay an extra round-trip to save tokens you weren’t going to spend much time accumulating. Second, false negatives: the agent can fail to retrieve the right tool from search because the description doesn’t match the model’s paraphrase of intent. Both are mitigated by writing tighter descriptions and naming tools well.

Claude Code rolled this out as MCP Tool Search starting v2.1.7. The behaviour is automatic: when your active MCP tool descriptions exceed 10% of the context budget, the client switches to deferred loading. You don’t configure it. You can verify it’s active by running /context inside Claude Code — the “MCP tools” line should drop sharply.

One-line install · GitHub

Open server page

Install

Cloudflare Code Mode — what 99.9% means

Cloudflare’s Matt Carey published “Code Mode: give agents an entire API in 1,000 tokens” on February 20, 2026. The headline metric:

“For a large API like the Cloudflare API, Code Mode reduces the number of input tokens used by 99.9%.”

That number specifically applies to Cloudflare’s own API — over 2,500 endpoints. The control case Cloudflare benchmarks is “an equivalent MCP server without Code Mode would consume 1.17 million tokens — more than the entire context window of the most advanced foundation models.” The Code Mode version reduces that to about 1,000 tokens.

How it works. Two MCP tools replace the entire surface area: search() — “the agent calls search(). It writes JavaScript against a typed representation of the OpenAPI spec.” And execute() — “When the agent is ready to act, it calls execute(). The agent writes code that can make Cloudflare API requests, handle pagination, check responses, and chain operations together.”

The code runs in a Cloudflare Workers Dynamic Worker Loader — a V8 isolate sandbox that has no internet access except through the typed SDK Cloudflare exposes. From the original September 26, 2025 Code Mode announcement by Kenton Varda: “The code is then executed in a secure sandbox. The sandbox is totally isolated from the Internet. Its only access to the outside world is through the TypeScript APIs representing its connected MCP servers.” That’s the key safety property — the model can’t exfiltrate data, only call the SDK methods you’ve granted.

Why “LLMs are better at writing code to call MCP, than at calling MCP directly.” Varda’s opening claim. Code is a more compact, more familiar representation than free-text JSON tool calls. Models train on billions of TypeScript files. They’ve seen API calls wrapped in retries, pagination loops, error handling. When you give the model a typed SDK, it can chain calls in a single block — pagination, filtering, error handling — without the client having to round-trip every intermediate result back through the model.

Trade-offs. Code Mode requires sandboxing infrastructure. Cloudflare ships theirs as a Workers feature; if you’re self-hosting, you need an equivalent (Deno Deploy isolates, Lambda + container, Firecracker, etc.). It’s also more sensitive to model quality — generating correct code is harder than picking a tool. The 99.9% figure applies to a specific control (an MCP wrapper of the entire Cloudflare API). For smaller APIs the gap shrinks; the value of Code Mode is largest when the underlying tool surface is vast.

Where it’s relevant for everyone else. Cloudflare deploys Code Mode for their own API, but the pattern is replicable. Any platform with hundreds of endpoints — AWS, GCP, Azure, Stripe’s full surface, GitHub’s full surface — benefits more from Code Mode than from MCP-style tool catalogues.

Anthropic code execution with MCP — the 98.7% case

Anthropic shipped a complementary post on November 4, 2025: “Code execution with MCP: Building more efficient agents”. The headline:

“This reduces the token usage from 150,000 tokens to 2,000 tokens — a time and cost saving of 98.7%.”

The approach is structurally identical to Cloudflare’s: expose MCP servers as TypeScript files in a directory the agent explores, let the agent write code that imports and chains their methods, execute that code in a sandbox, return only the final result to the model.

Anthropic’s own framing of why this matters, paraphrased from the post: most MCP clients load all tool definitions upfront, exposing them with direct tool-calling syntax; tool results and definitions can consume 50,000+ tokens before an agent reads a request. With code execution, agents can load only the tools they need and process data in the sandbox before passing results back to the model.

Simon Willison followed up the same day with a sharper articulation of the underlying problem:

“all of those tool descriptions take up a lot of valuable real estate in the agent context even before you start using them”

Simon Willison · Blog

Willison's response to Anthropic's code-execution-with-MCP post — the clearest one-sentence statement of the bloat tax.

Source

And on the chaining penalty — why it’s not enough to just cut the upfront cost:

“chaining multiple MCP tools together involves passing their responses through the context, absorbing more valuable tokens and introducing chances for the LLM to make additional mistakes”

Simon Willison · Blog

The second-order cost: even after fixing tool definitions, every intermediate result still passes through the model. Code execution removes that too.

Source

That’s the part the percentage doesn’t capture. When you chain three MCP tool calls in the traditional way, each tool’s output becomes the next tool’s input, round-tripped through the model. With code execution, the code in the sandbox handles the chaining and only returns the final answer. That removes both the tool-definition cost (upfront) and the response-token cost (per-call).

Other techniques worth knowing

The three big-vendor fixes get the headlines, but they’re not the whole story. Five complementary techniques that come up in practice:

1. Subagents (separate context). Hand a sub-task to a subagent — Claude Code’s Task tool, Claude Agent SDK, or your own orchestrator — and the subagent gets a fresh context window. You pay the tool-definition cost once per subagent, and only the final result returns to the parent. This is the technique that powers most production long-running-agent setups, frequently in combination with deferred loading inside the subagent. See the Claude Agent SDK announcement.

2. CLI shell-out. Simon Willison’s position. From the same Claude Skills post:

“Almost everything I might achieve with an MCP can be handled by a CLI tool instead. LLMs know how to call cli-tool --help, which means you don't have to spend many tokens describing how to use them.”

Simon Willison · Blog

The 'just use the shell' position — token-efficient because CLI tools document themselves at runtime via --help.

Source

The CLI shell-out approach lets a coding agent call gh, jq, kubectl, or aws directly. The agent learns each tool’s signature on demand from --help. Cost: a couple of round-trips of latency. Benefit: zero upfront token tax, and the same skills compose with everything you already have in $PATH. See also Vercel’s AGENTS.md vs Skills evals — same family of argument, applied to docs rather than tools.

3. Tool subset selection. The blunt fix. Most MCP servers expose toolsets you can disable in the client config. The official GitHub MCP server lets you opt into repos, issues, pull_requests, actions, code_security, experiments independently. If you only use the issues toolset, disabling the rest reclaims most of the 18k tokens. Read the README of each server you’ve installed and audit aggressively.

4. Lazy loading via Skills. Anthropic’s Agent Skills are a different shape of progressive disclosure. Each skill is a folder; only the metadata (name, description, ~100 tokens) is scanned at session start. The skill’s full body — instructions, scripts, files — loads only when the agent decides the skill is relevant. For task-specific automation this is often a better fit than installing yet another MCP server. Read What are Claude Code Skills for the full pattern.

5. MCP gateways. Composio’s MCP Gateway (300+ integrations behind one URL), Glama’s registry-as-a-gateway, and Strata’s 22-ecosystem aggregator each ship a single MCP endpoint that fans out to many backends. They solve auth and credential management cleanly. Whether they fix tokens depends on whether the gateway implements deferred loading itself. MetaMCP on the directory is one example of a gateway-style server. Verify by running the count-tokens script (next section) before and after connecting the gateway.

Measure your own MCP — a reproducible script

The most valuable thing you can do is stop arguing about percentages and measure your own setup. Anthropic’s Messages API has a count_tokens endpoint that’s free to use and accepts the same tools array as a real call. The endpoint is https://api.anthropic.com/v1/messages/count_tokens; it returns input_tokens.

The simplest measurement: run a one-line user prompt against the count_tokens endpoint twice, once with no tools and once with the MCP server’s full tools/listresponse in the tools array. The delta is your tool-definition cost.

// measure-mcp.ts — count tool-definition tokens for any MCP server
import Anthropic from "@anthropic-ai/sdk";
import { spawn } from "node:child_process";

const client = new Anthropic();

// 1. Spawn your MCP server (stdio) and call tools/list.
//    Substitute the command for whatever server you're measuring,
//    e.g. ["npx", "-y", "@modelcontextprotocol/server-filesystem", "/tmp"]
//    or ["docker", "run", "-i", "ghcr.io/github/github-mcp-server"].
const tools = await listToolsFromMcp([
  "npx", "-y", "@modelcontextprotocol/server-filesystem", "/tmp",
]);

// 2. Baseline: count tokens for a one-line prompt with no tools.
const baseline = await client.messages.countTokens({
  model: "claude-opus-4-7",
  messages: [{ role: "user", content: "hi" }],
});

// 3. With tools: same prompt, with the MCP server's tools attached.
const withTools = await client.messages.countTokens({
  model: "claude-opus-4-7",
  tools: tools.map((t: any) => ({
    name: t.name,
    description: t.description,
    input_schema: t.inputSchema,
  })),
  messages: [{ role: "user", content: "hi" }],
});

console.log("baseline:", baseline.input_tokens);
console.log("with tools:", withTools.input_tokens);
console.log("tool-definition cost:",
  withTools.input_tokens - baseline.input_tokens, "tokens");

The listToolsFromMcp helper just speaks JSON-RPC to the spawned process — read the MCP spec or the TypeScript SDK README for an example. Run this against each server you’re considering and you’ll have hard numbers in five minutes. Sample expected output (rough): filesystem ~3.5k, Notion ~9k, GitHub ~18k, Stripe ~25k+, Cloudflare API ~1.17M (per Cloudflare’s own count).

If you don’t want to wire up a stdio MCP client, the shortcut is to copy the tools/list response from a packet capture or from a hosted server’s OpenAPI spec and paste it into the tools array directly. That’s how Cloudflare arrived at the 1.17M figure for their own API — they translated the OpenAPI spec to MCP tool shape, called count_tokens, read the number.

Decision matrix — which fix for your shape

The answer depends on the shape of your tool surface. Three common shapes:

Shape	Symptom	Use
Small (1–3 servers, <15 tools)	MCP overhead under 10k tokens — bloat is not your real problem	Skip the fixes. Audit, trim toolsets, focus on prompt design.
Medium (4–10 servers, 30–80 tools)	Overhead 30k–80k tokens; sessions feel slow on first turn	Tool Search (Claude Code 2.1.7+). Trim aggressively. Skills for repeated tasks.
Large (massive single API, 100s–1000s of endpoints)	MCP wrapper would exceed context window; team builds internal API agents	Code Mode pattern (Cloudflare or Anthropic style). Wrap your API as a typed SDK; sandbox the execution.
Enterprise (multi-team, audit, RBAC)	Tool sprawl across teams; security wants central allow-list	MCP gateway (Composio, Strata, MetaMCP). Verify the gateway itself does deferred loading or you've shifted the bloat without fixing it.
Long-running agent (1+ hour sessions)	Idle-cache misses, context fills with intermediate results	Subagents for parallel work + Skills for recurring patterns + code execution for chained MCP calls.

Two heuristics. Tool count under 30, fix nothing first. The bloat narrative dominates Twitter, but most personal Claude Code setups never breach 10k of overhead. Audit before optimising. Tool count over 100, you cannot use traditional MCP. Either move to Code Mode or split into subagents with disjoint tool sets. The middle band — 30 to 100 — is where Tool Search earns its keep.

What to do today — Claude Code, Cursor, Cline

Concrete steps for the three big editor clients, today.

Claude Code

Run claude --version. If you’re on v2.1.7 or later, MCP Tool Search is already on automatically when active tool descriptions exceed 10% of context. There’s nothing to enable.
Run /context to see the current MCP overhead. If the “MCP tools” line is >15% of your context, audit your ~/.claude/settings.json and comment out servers you don’t actively use.
For each remaining server, check its README for toolset flags. The official GitHub MCP server has --toolsets; opt in to only what you need.
Replace recurring task-specific MCP servers with Claude Code Skills. Skills consume ~100 tokens per skill at scan time vs. the full schema cost of an MCP server.

Cursor

Cursor scopes MCP servers per workspace via .cursor/mcp.json. Don’t install org-wide servers globally; install them only in projects that need them.
Cursor doesn’t yet ship deferred loading natively (as of May 2026). The interventions available are: trim, scope per-project, prefer hosted remote MCP endpoints (which let the server choose what to expose).
For very large surfaces, run a Cloudflare-style Code Mode wrapper as your only MCP server. Cursor sees two tools, the worker handles the rest.

Cline

Cline supports per-server enable/disable in the marketplace UI. Disable everything you don’t use today; re-enable on demand.
Cline’s “auto-approve” settings amplify cost — every auto-approved tool fires unmoderated. If you’ve got expensive tools (anything that scans a filesystem or does network I/O), keep them in approval prompts, not auto-approve.
Cline + Code Mode works well: the model writes JS, Cline executes, only the final output returns to context.

Community signal

“Idle sessions face full cache misses, consuming significant tokens.”

Anthropic (engineering) · Hacker News

Boris Cherny, on the April 23, 2026 Claude Code postmortem HN thread (942 points). Establishes the cost mechanism: caching helps inside warm sessions; the moment a session goes idle, every tool definition reloads into context.

Source

“I don't use MCP at all any more when working with coding agents - I find CLI utilities and libraries like Playwright Python to be a more effective way of achieving the same goals”

Simon Willison · Blog

The hard-line position — that for some workloads MCP's cost just isn't worth it. Worth taking seriously even if you ultimately decide otherwise.

Source

“Currently, all tool definitions for all active MCP servers are loaded into the conversation context for every session, consuming significant token budget regardless of whether the tools are actually used.”

davidmoneil · Blog

The opening of claude-code issue #11364 (Nov 10, 2025), the artifact most often cited as 'the 67k tokens issue'. Closed as duplicate after MCP Tool Search shipped.

Source

Frequently asked questions

Why is my Claude using so many tokens before I type?

Every connected MCP server's full tool schema — tool names, descriptions, parameter types, JSON Schemas — gets injected into the system prompt at session start. With 7 MCP servers active, one Claude Code user (issue #11364) measured 67,300 tokens consumed before the first user message, 33.7% of a 200k context budget. That's the canonical "MCP context bloat" problem.

How many tokens does the GitHub MCP use?

Roughly 18,000 tokens for the official GitHub MCP server with 27 tools enabled, per the same davidmoneil measurement in claude-code issue #11364. Simon Willison summarised this in October 2025: "GitHub's official MCP on its own famously consumes tens of thousands of tokens of context." The exact number depends on which toolsets you enable — repos, issues, pull requests, actions, code security each add tools.

Does Anthropic's Tool Search really cut MCP tokens 46.9%?

The number Anthropic published is 85%, not 46.9%. From the November 24, 2025 engineering post: "This represents an 85% reduction in token usage while maintaining access to your full tool library." The 46.9% figure circulating on Medium/Twitter is a single benchmark from one third-party article (51K tokens down to 8.5K) — it's plausible but not Anthropic's headline. Cite the 85% figure with the source.

What is Cloudflare Code Mode and is the 99.9% claim real?

Yes — for the specific scenario Cloudflare benchmarked. From the February 20, 2026 post by Matt Carey: "For a large API like the Cloudflare API, Code Mode reduces the number of input tokens used by 99.9%." The control case is the Cloudflare API's 2,500+ endpoints exposed as MCP tools, which Cloudflare estimates would consume 1.17 million tokens. Code Mode replaces all that with two tools (search() and execute()) plus a TypeScript SDK — about 1,000 tokens.

What is MCP progressive disclosure?

Progressive disclosure is the architectural pattern underneath both Tool Search and Code Mode: don't load tool definitions until the agent actually needs them. Instead of injecting all 50+ tools at session start, the agent first sees only tool names (or a search index, or a typed SDK), and the full schema loads on-demand. Anthropic's defer_loading: true flag, Claude Code's MCP Tool Search, and Cloudflare's search()/execute() pair are all instances of the same pattern.

How do I count how many tokens an MCP server is using?

Anthropic's Messages count_tokens endpoint accepts the same tools array as a real call and returns input_tokens. Pass the MCP server's full tool schema in the tools array along with a one-line user message; the difference between that count and the count without tools is your tool-definition cost. Endpoint: POST https://api.anthropic.com/v1/messages/count_tokens — free to use, separate rate limit from message creation.

Why is Claude Code expensive?

Three structural reasons, in order of impact: (1) tool definitions from connected MCP servers stay in context every turn, so a 67k-token bloat compounds across a long session; (2) every file Claude reads stays in context permanently for that session; (3) prompt caching mitigates the cost only when sessions stay warm — idle sessions over an hour incur full cache misses. Anthropic's April 23, 2026 postmortem documents the third issue specifically.

Should I use MCP Tool Search or Code Mode?

Different problems. Tool Search is the right fix if your servers expose 30–200 tools and you want to keep using MCP-style tool calls (Anthropic's pattern). Code Mode is the right fix if your servers expose hundreds-to-thousands of tools (a large API surface) and you can run code in a sandbox (Cloudflare's pattern). Most Claude Code, Cursor, and Cline users will hit Tool Search first; teams wrapping internal APIs at scale will reach for Code Mode.

How do I reduce MCP tokens in Cursor or Cline today?

Three steps. (1) Audit: list every connected MCP server and disable any you haven't invoked in the past two weeks. (2) Trim: most servers expose toolsets — turn off the ones you don't need (e.g. GitHub MCP can run with just repos+issues, dropping ~10 tools). (3) Adopt deferred-loading where the client supports it: Claude Code v2.1.7+ ships MCP Tool Search; Cline supports per-server enable/disable; Cursor lets you scope MCP servers per workspace.

Is progressive disclosure the same as a subagent?

No, but they're complementary. A subagent (Task tool, Claude Agent SDK) gets its own context window — so giving a refactor task to a subagent isolates its tool-definition cost from the parent. Progressive disclosure happens within one agent's context: tools load on-demand instead of upfront. Real production setups use both: subagents for parallelism, deferred loading inside each subagent.

What about CLI tools instead of MCP — does that help?

Yes, sometimes a lot. Simon Willison's argument: "Almost everything I might achieve with an MCP can be handled by a CLI tool instead. LLMs know how to call cli-tool --help, which means you don't have to spend many tokens describing how to use them." gh, jq, kubectl, and similar tools document themselves at runtime. The trade-off is the agent has to discover behaviour through --help, which costs latency.

Are MCP gateways like Composio, Strata, or Glama a real fix?

They help operationally (one endpoint, one auth, one allow-list) but only fix tokens if they implement deferred loading themselves. Composio's MCP Gateway (300+ integrations behind one URL) and Strata's 22-ecosystem aggregator both publish dynamic tool selection in 2026. Verify by counting tokens: if connecting the gateway adds the same N tokens for each underlying integration, the gateway hasn't fixed bloat — it's just hidden it.

Sources

Primary — Anthropic

anthropic.com/engineering/advanced-tool-use — Tool Search Tool, Programmatic Tool Calling, the 85% number, the 49→74 / 79.5→88.1 accuracy claims (Nov 24, 2025)
anthropic.com/engineering/code-execution-with-mcp — the 150K → 2K (98.7%) example, presenting MCP servers as code APIs (Nov 4, 2025)
platform.claude.com — Token counting — the count_tokens API used in the measurement script
platform.claude.com — Agent Skills overview — the ~100-tokens-at-scan claim for Skills

Primary — Cloudflare

blog.cloudflare.com/code-mode-mcp — the 99.9% number, 1.17M tokens control, search() / execute() architecture (Matt Carey, Feb 20, 2026)
blog.cloudflare.com/code-mode — the original Code Mode announcement, V8 isolate sandboxing (Kenton Varda, Sep 26, 2025)

Community / corroborating

claude-code issue #11364 — davidmoneil’s 67,300-token measurement, GitHub MCP at ~18k (Nov 10, 2025)
simonwillison.net — Claude Skills are awesome — the “tens of thousands of tokens” line, the CLI position (Oct 16, 2025)
simonwillison.net — Code execution with MCP — the chaining-tax framing (Nov 4, 2025)
vercel.com/blog/agents-md-outperforms-skills-in-our-agent-evals — Jude Gao’s eval (Jan 27, 2026); related passive-context argument
HN 47878905 — April 23 Claude Code postmortem — 942 points; the cache-miss / idle-session mechanism
paddo.dev — Claude Code’s Hidden MCP Flag — 31.7k tokens recovered with the ENABLE_TOOL_SEARCH flag (Dec 29, 2025; updated Feb 2026)
layered.dev — MCP tool schema bloat — 91% savings (54,604 → 4,899 tokens) reference implementation (Jan 16, 2026)

Internal links

/blog/what-is-mcp — protocol primer
/blog/what-are-claude-code-skills — Skills as the lazy-loading cousin
/blog/claude-sonnet-4-6-mcp-whats-new — the 1M context window
/blog/anthropic-launches-claude-agent-sdk — subagent / orchestrator pattern
/blog/revolution-in-mcp-how-code-changes-the-game-with-ai-agents — the code-as-API view
/blog/context7-vs-deepwiki-vs-ref-vs-docfork-2026 — docs-RAG MCP servers (a different token-cost angle)
/blog/best-mcp-servers-for-coding-development
/blog/most-popular-mcp-tools-2026
/blog/mcp-security-checklist-for-teams
/servers/github, /servers/notion, /servers/context7, /servers/ref-tools, /servers/filesystem, /servers/metamcp
/best-mcp-servers — curated roundup
/servers — browse all 3,000+

MCP context bloat: the 2026 fixes that actually work

TL;DR — the 67k number, and the three fixes

Why MCP burns tokens — the mechanism

The three families of fix

Anthropic Tool Search — what 85% means

Install

Cloudflare Code Mode — what 99.9% means

Anthropic code execution with MCP — the 98.7% case

Other techniques worth knowing

Measure your own MCP — a reproducible script

Decision matrix — which fix for your shape

What to do today — Claude Code, Cursor, Cline

Claude Code

Cursor

Cline

Community signal

Frequently asked questions

Why is my Claude using so many tokens before I type?

How many tokens does the GitHub MCP use?

Does Anthropic's Tool Search really cut MCP tokens 46.9%?

What is Cloudflare Code Mode and is the 99.9% claim real?

What is MCP progressive disclosure?

How do I count how many tokens an MCP server is using?

Why is Claude Code expensive?

Should I use MCP Tool Search or Code Mode?

How do I reduce MCP tokens in Cursor or Cline today?

Is progressive disclosure the same as a subagent?

What about CLI tools instead of MCP — does that help?

Are MCP gateways like Composio, Strata, or Glama a real fix?

Sources

Keep reading

What are Claude Code Skills?

Context7 vs DeepWiki vs Ref vs Docfork

Skills vs MCP vs Subagents vs CLI

MCP Apps Spec 2026: Server-Rendered UI

OAuth 2.1 for Remote MCP Servers

Browse all MCP servers