Stop the Token Bleeding: An Always-Applied Cost Gate for MCP-Heavy Cursor Agents

Stop the Token Bleeding: An Always-Applied Cost Gate for MCP-Heavy Cursor Agents

TL;DR

I run a Cursor agent on a backend QA monorepo with 8 MCP servers (Atlassian, Jenkins, Playwright, OpenSearch, Prometheus, k8s, Slack, plus a custom memory MCP for cross-session context). Even after I wrote per-MCP rules with head_limit, _source whitelists, and tail_lines defaults, my agent still occasionally pulled 30k+ token payloads from a single tool call.

I added one always-applied section to my master rule that:

  1. Lists 4 trigger conditions (Read >200 lines, wide Grep with output_mode: content and no head_limit, WebSearch/WebFetch, MCP call with predicted payload >3k tokens).
  2. Requires a one-line cost justification before the call.
  3. Writes a [PITFALL] entry to cross-session memory if I catch the agent skipping the gate.

This roughly does at the call-site / rule layer what Dynamic Context Discovery now does at the product/runtime layer — with the addition of a memory feedback loop so the agent compounds across sessions.

Sharing the rule fragment + the PITFALL fallback in case anyone wants to adopt the pattern.


The Problem

A sample of how my agent burned context before this rule existed:

  • confluence_get_page returning a full HTML page when I only needed one section
  • get_build_console_output (Jenkins) without pattern + limit, dumping ~50k lines
  • Grep with output_mode: content and no head_limit against **/*.md, returning 200+ matches
  • Read on a 700-line script when I only needed lines 40-90
  • MCP search_documents (OpenSearch) without a _source whitelist, returning full ES documents

Each of these is preventable with the right params. But the agent’s default — just fetch it — was the failure mode. Per-MCP rules I’d already written did not fire reliably enough; the agent kept taking the path of least resistance.

What I needed was an upstream gate: a one-token cost moment before any of these patterns, regardless of which MCP it was.


The Rule (always-applied section, lightly edited)

## §10 Tool-Call Cost Gate

Before any high-cost tool call, the agent MUST emit a one-line cost
justification. Skipping it is a violation of "Think Before Coding".

### §10.1 Trigger Conditions (any of)

| Condition | Why |
|---|---|
| `Read` line range > 200, or whole-file read on unknown total | unbounded payload |
| `Grep` wide pattern + `output_mode: content` and no `head_limit` | result count not capped |
| `WebSearch` / `WebFetch` | external return is unbounded |
| MCP call with predicted payload > 3k tokens (full Confluence page, Jenkins console without `pattern`, OpenSearch search without `_source`, Playwright snapshot without `filename`/`depth`, etc.) | per-call params do not always enforce limit |

### §10.2 Declaration Format (one line, prepended to the tool call)

> "Read/Grep/MCP X (~Nk) → to obtain <specific info>; cheaper paths
> excluded (precise section / `offset+limit` / Grep first to locate);
> choosing <action> because <reason>."

### §10.3 Exemptions

1. User explicitly said "read the whole thing" / "load everything"
2. Target file ≤ 200 lines (frontmatter included)
3. Has `offset + limit` AND window ≤ 200 lines
4. Small Shell call (e.g. `git status`, `kubectl get`, predicted < 1k)

### §10.4 Violation Handling (soft, traceable via PITFALL)

This rule does not hard-block tool calls. Instead, during the next
`/save` to memory, the agent MUST scan the session for violations.
For each violation (e.g. a 500+ line full-file read where < 20% was
actually used), it MUST write a `[PITFALL]` entry to cross-session
memory so later sessions inherit the lesson.

The §10.4 piece is the part most people miss. Without it, the rule is purely advisory and the agent forgets it the moment it sees a juicy file.


How the [PITFALL] Loop Closes the Gap

I use a memory MCP with structured prefixes — [RULE] / [LESSON] / [PITFALL] / [POINTER] / [CONTEXT:task/...] — but the loop works with anything persistent: a .brain/ folder of .md files, a single MEMORIES.md, or any of the community memory projects you can find by searching “cursor-memory” or “.brain folder” on this forum. The prefixes are what matter, not the storage.

The /save command runs:

Scan this session for cost-gate violations. For each, append a [PITFALL] entry like:
[PITFALL] Read full perf script (760 lines) when only lines 40-90 were needed. Next time use offset+limit.

Future sessions retrieve [PITFALL] entries on the relevant intent at the start of the turn. Over a few weeks, the agent’s “default” behavior shifts toward narrow reads. Without this loop, §10 decays — the agent keeps re-violating the same pattern.


What the Loop Looks Like in Memory After a Few Weeks

Interesting side observation: when I retrieved everything in my memory store tagged near “cost gate / token consumption / wasted payload”, what comes back is pattern-level lessons, not incident-level PITFALLs. Two real entries (lightly redacted):

[LESSON] For simple queries, prefer the Shell tool over an MCP call
        to avoid unnecessary context consumption.

[RULE]  Detailed inventory and tool reference docs are loaded
        on-demand only when the user intent matches a relevant skill.

Actual [PITFALL] cost-gate violations are rare in my store. I read this as evidence that the gate prevents most violations before they need to be remembered — which is the point. But it also means I have less hard incident data than I’d like; if anyone runs §10 in a higher-traffic environment and wants to compare PITFALL accumulation rates, would love to swap notes.


How This Stacks with Dynamic Context Discovery

The Dynamic Context Discovery post linked in the TL;DR reports roughly -46.9% tokens on MCP-heavy runs by deferring tool-descriptor loading. That’s a complementary product-level fix. My cost gate is call-site-level: even after a tool is “discovered” and loaded, the agent still has to justify each invocation.

Layer What it cuts Where it lives
Dynamic Context Discovery tool-descriptor loading product / runtime
Per-MCP discipline rule per-MCP param defaults (limit, _source, tail_lines) requestable rule
Cost gate (§10) per-call invocation pattern always-applied rule
[PITFALL] memory cross-session learning memory store

In practice the four layers stack. Dynamic discovery filters trivial calls upstream; the cost gate catches what discovery can’t see (e.g. a Read on a known 700-line in-repo file); per-MCP rules supply the right params once a call is committed; PITFALL closes the loop across sessions.


Three Real Cases (anonymized, payload sizes are rough estimates)

Case A — Confluence audit. Agent was about to confluence_get_page on a 30+ section design doc. Cost line forced it to enumerate sections via confluence_search first, then get_page with convert_to_markdown=true only on the relevant 2 sections. Estimated ~9k → ~2k tokens.

Case B — Jenkins triage. Agent had get_build_console_output queued without pattern. Cost line forced it to add pattern: "ERROR|FAIL|Exception" + limit: 200. Console payload ~40k → ~3k tokens.

Case C — Repo-wide grep. User asked “where is X used”. Agent’s first move was Grep with output_mode: content and no head_limit. Cost line made it switch to output_mode: files_with_matches first, then narrow to one file. ~60 matches displayed → 8 file paths displayed.

In each, the gate did not block the call. It just forced a one-second pause where the agent enumerated cheaper paths and took one.


Caveats

1. It’s a soft rule. Custom rules aren’t hard-enforced; the [PITFALL] loop is the only reliable feedback. The forum thread alwaysApply: true rules are being completely ignored now (#158551) makes this risk concrete; a parallel thread titled “AGENTS.md not automatically injected” reports the same failure mode for AGENTS.md files (search the forum to find it).

The mitigation I use is a single-line self-probe at the very top of the master rule:

On the FIRST response of every session, prepend the literal token
`MC-OK` to your reply (no other prefix), so the user can confirm the
master rule was loaded. Do this exactly once per session.

If the agent’s first reply doesn’t contain MC-OK, the rule wasn’t injected and I bail out before sinking real work into a session that will silently ignore §10. Cost: zero tokens. Detection latency: one round-trip. Recommended for anyone running large always-applied rules.

2. It can over-fire. Early on, the agent prepended cost lines for tiny Reads on 50-line files. The §10.3 exemptions fixed that.

3. The 3k-token threshold is a heuristic. I don’t have ground-truth payload distributions per MCP tool; this is conservative. If anyone has measured this for common MCPs (Atlassian, Jenkins, OpenSearch, Playwright), would love a cross-check.

4. Doesn’t cover Shell commands fully. §10.3 exemption #4 lets routine kubectl get / git status through, but a wide kubectl logs without --tail slips by today. Probably needs a 5th trigger condition.


Discussion

Two questions I’d love staff or community feedback on:

  1. Is there a way to surface predicted payload size before a tool call fires? Right now the agent self-estimates with no ground truth. If Cursor exposed a “predicted output size” hint per tool (or a stable byte-budget knob), §10 could be deterministic instead of heuristic — and the Dynamic Context Discovery savings could be made user-tunable.

  2. Is there an official “rule injection verification” channel? My self-probe trick works but is fragile. A tiny Cursor API like getActiveRules() or even just a stable banner emitted when alwaysApply: true rules are present would let users (and rule libraries) trust their setup without round-tripping through an LLM. Several recent threads — the alwaysApply one linked in Caveat 1, a parallel AGENTS.md not automatically injected report, and a Cursor 3.0.16 macOS report titled “alwaysApply: true rules and .cursorrules both silently treated as requestable” — together suggest this is a recurring pain.


Cheat Sheet

4 trigger conditions
→ 1-line cost declaration before the tool call
→ [PITFALL] entry on violation, written at /save time
→ retrieved at the start of the next session
→ +1 self-probe line in the master rule for injection verification

Soft rule. Complements Dynamic Context Discovery at the call-site layer. Decays without the PITFALL loop. Works with any persistent memory backend (.txt, .md, MCP, doesn’t matter — prefixes do).

Happy to share the full master rule file if there’s interest.