TL;DR A context window is an LLM’s short-term working memory. A “128 k-token” model (≈ 96 k words, or roughly a 300-page book) can at most see 128 000 tokens at once, including every previous user message, assistant reply, system prompt and the new answer it is about to write. Tokens are counted and billed every turn, so the history grows linearly until you trim or summarize it.
- Tokens & Context Windows
-
Token ≈ ¾ of an English word (1 token ≈ 4 characters)
-
The context window is the total input + output the model can process in one go—its “working memory,” separate from its long-term pre-training data.
2. What “128 k Context” Means in Plain English
-
OpenAI’s GPT-4o expose a 128 000-token limit; Anthropic’s Claude 3 goes to 200 k.
-
128 k tokens ≈ 96 k words ≈ 300 paperback pages.
-
You can reserve any slice of that window for the reply: e.g., if you want a 4 096-token completion, your prompt (system + history + current user message) must stay ≤ 123 904 tokens.
3. How the 128 k Is Consumed During Chat
Turn | What goes in | Running total (tokens) | Notes |
---|---|---|---|
0 | System message (e.g. 500 t) | 500 | Sent every call community.openai.com |
1 | + User #1 (1 000 t) | 1 500 | |
1 | → Assistant #1 (800 t output) | 2 300 | Output also stored as future input community.openai.com |
2 | + User #2 (600 t) | 2 900 | |
2 | → Assistant #2 (800 t) | 3 700 |
Each API call resends the entire history, so you pay for those tokens again community.openai.com. If a future prompt would push the sum > 128 k, you must:
-
Drop or summarize old turns (first-in-first-out).
-
Compress them (tools like RAG or “middle-out” transforms).
-
Move fixed, lengthy instructions to an external retrieval step.
4. Practical Tips for Token Budgeting
-
Count before you send. Use
tiktoken
or the OpenAI tokenizer to estimate tokens cookbook.openai.com. -
Keep system prompts tight. They cost the same as user tokens help.openai.com.
-
Cap
max_tokens
so the model can’t overflow the window even on long answers. -
Stream & monitor usage to detect overruns early cookbook.openai.com.
5. Why Bigger Isn’t Always Better
Large windows invite the “murky middle” problem—models may ignore or blur details buried deep in the prompt. Retrieval-augmented or hierarchical prompting often beats brute-force stuffing once you approach tens of thousands of tokens.
Bottom Line
A 128 k-token LLM can juggle a lot of text, but every byte counts each round. Treat the window like RAM: budget, monitor, and garbage-collect your history or you’ll hit the ceiling, and pay for it fast.