Context Window (Must Know if You Don't Know)

TL;DR A context window is an LLM’s short-term working memory. A “128 k-token” model (≈ 96 k words, or roughly a 300-page book) can at most see 128 000 tokens at once, including every previous user message, assistant reply, system prompt and the new answer it is about to write. Tokens are counted and billed every turn, so the history grows linearly until you trim or summarize it.

  1. Tokens & Context Windows
  • Token ≈ ¾ of an English word (1 token ≈ 4 characters)

  • The context window is the total input + output the model can process in one go—its “working memory,” separate from its long-term pre-training data.

2. What “128 k Context” Means in Plain English

  • OpenAI’s GPT-4o expose a 128 000-token limit; Anthropic’s Claude 3 goes to 200 k.

  • 128 k tokens ≈ 96 k words ≈ 300 paperback pages.

  • You can reserve any slice of that window for the reply: e.g., if you want a 4 096-token completion, your prompt (system + history + current user message) must stay ≤ 123 904 tokens.

3. How the 128 k Is Consumed During Chat

Turn What goes in Running total (tokens) Notes
0 System message (e.g. 500 t) 500 Sent every call community.openai.com
1 + User #1 (1 000 t) 1 500
1 → Assistant #1 (800 t output) 2 300 Output also stored as future input community.openai.com
2 + User #2 (600 t) 2 900
2 → Assistant #2 (800 t) 3 700

Each API call resends the entire history, so you pay for those tokens again community.openai.com. If a future prompt would push the sum > 128 k, you must:

  1. Drop or summarize old turns (first-in-first-out).

  2. Compress them (tools like RAG or “middle-out” transforms).

  3. Move fixed, lengthy instructions to an external retrieval step.

4. Practical Tips for Token Budgeting

  • Count before you send. Use tiktoken or the OpenAI tokenizer to estimate tokens cookbook.openai.com.

  • Keep system prompts tight. They cost the same as user tokens help.openai.com.

  • Cap max_tokens so the model can’t overflow the window even on long answers.

  • Stream & monitor usage to detect overruns early cookbook.openai.com.

5. Why Bigger Isn’t Always Better

Large windows invite the “murky middle” problem—models may ignore or blur details buried deep in the prompt. Retrieval-augmented or hierarchical prompting often beats brute-force stuffing once you approach tens of thousands of tokens.


Bottom Line

A 128 k-token LLM can juggle a lot of text, but every byte counts each round. Treat the window like RAM: budget, monitor, and garbage-collect your history or you’ll hit the ceiling, and pay for it fast.

Best-practice cheat-sheet for a Cursor AI chat session

  1. User prompt (intent)
    • State exactly what you want in one or two sentences; every extra word costs tokens and dilutes signal Cursor.
    • Reference code surgically with @code, @file, @folder so the model grabs only what matters Cursor.
    • Open a new chat for each distinct task to avoid dragging old context forward Cursor.

  2. Files in the context window (state)
    • Attach just the slice you’re working on; Cursor’s default chat cap is ≈ 20 k tokens, Cmd-K ≈ 10 k Cursor - Community Forum.
    • If a file is huge, send the key functions or lines, not the whole thing, or the model will prune unpredictably Cursor.
    • Use @file path/to/foo.py:100-180 style snippets to keep noise down.

  3. .cursor/rules/*.mdc (persistent guidance)
    • Convert any instruction you find yourself repeating into an MDC rule; Cursor prepends it in every prompt for you Cursor.
    • Keep each rule atomic (one concern), YAML-front-matter, and use the type that fits: Always, Auto Attached, Agent Requested, Manual CursorCursor - Community Forum.
    Name and number files predictably (001-Security.mdc, 100-API.mdc) so humans—and the IDE, resolve clashes cleanly Cursor - Community Forum.

  4. Thinking & token burn
    • “Thinking” models double request cost and latency; toggle it only for hard reasoning tasks Cursor.
    • Large-context mode (200 k on Claude 3 MAX, 128 k on GPT-4o MAX) also doubles price; disable when you don’t need the whole codebase Cursor.
    • Watch the running token meter and start a fresh chat or summarize history before you breach the cap.

  5. Answers, tool calls, diffs
    • Ask the Agent to produce patch-style diffs (git-style) so you can review and apply selectively rather than rewriting whole files.
    • Limit tool-call chains to ≤ 25 per request; Cursor will prompt for confirmation after that threshold Cursor.
    • Merge small, testable changes; rerun tests; then iterate, this keeps context small and feedback tight.

Follow this flow and you get maximum signal per token, predictable rule behavior, and clean, reviewable diffs every time.

1 Like