Context Window

A context window is the limited amount of information—measured in tokens—that a large language model can consider in a single inference call. For example, GPT-4 has a 128K token context window, meaning it can process roughly 96,000 words at once. Anything beyond that limit is invisible to the model.

Context windows constrain what ephemeral context you can provide in a prompt. More importantly, they don't provide persistent memory—once the inference call ends, the model retains nothing. Each new call starts fresh unless external systems (like agent memory) persist context between calls.

The outcome is understanding that large context windows help within a session, but they don't solve multi-session memory, learning, or continuity—those require durable agent memory infrastructure.

Why it matters

Defines within-call limits: Larger context windows let you fit more documents, messages, or context into one prompt—useful for complex reasoning tasks.
Highlights persistence gap: Context windows are ephemeral—they reset after each call, making them insufficient for stateful agents or multi-session workflows.
Affects token costs: Longer context windows consume more tokens, increasing API costs—optimizing what context you send saves money.
Informs retrieval strategies: Since you can't send everything, you need retrieval systems (RAG, semantic search) to select relevant context within the window.
Clarifies memory architecture: A big context window doesn't equal agent memory—you still need external storage, entity linking, and temporal tracking.
Impacts latency: Larger context increases inference time—balancing context relevance vs. performance matters.

How it works

Context windows operate through token-based attention mechanisms:

Input Tokenization → The prompt (instructions + context + question) is converted into tokens. Each word or subword is one token.
Attention Mechanism → The model computes attention over all tokens within the context window—every token can "see" every other token.
Window Limit → If input exceeds the context window (e.g., 130K tokens for a 128K model), the excess is truncated or the call fails.
Output Generation → The model generates a response based solely on what's in the context window for this call.
State Reset → After the response, the model retains nothing. The next call starts with a fresh context window unless external systems reintroduce prior context.

This cycle means context windows enable rich single-turn reasoning but don't provide durable memory.

Comparison & confusion to avoid

Term	What it is	What it isn't	When to use
Context Window	The LLM's token limit for one inference call	Persistent memory that survives across calls	Understanding model constraints, not a memory solution
Agent Memory	Durable, structured memory across sessions	Temporary prompt context—agent memory persists after calls	Multi-session continuity, learning, reasoning over history
Session Memory	Context persisting within a conversation or run	Ephemeral per-call context—session memory lasts longer	Multi-turn conversations within one session
Prompt Engineering	Crafting what context to send within the window	A memory system—prompts are ephemeral instructions	Optimizing single LLM calls, not building stateful agents

Examples & uses

Within-window document Q&A
User uploads a 50-page PDF (50K tokens). The full document fits in a 128K context window. User asks: "Summarize the conclusion." The model processes the entire PDF in one call—no retrieval needed.

Context window exceeded (requires retrieval)
User has 1,000 documents (5M tokens total). A 128K context window can't fit them all. A retrieval system (RAG) selects the most relevant 10 documents (60K tokens) to fit within the window. The model reasons over this subset, not the full corpus.

Multi-session agent (requires persistent memory)
Day 1: User asks agent to refactor code. Agent does it. Day 5: User says "add OAuth to that refactor." Without persistent memory, the agent doesn't know what "that refactor" refers to—the prior session is outside the current context window.

Best practices

Select relevant context: Don't fill the window with noise—use retrieval systems to send only the most pertinent information.
Monitor token usage: Track tokens per call to optimize costs and latency—remove redundant or low-value context.
Chunk large documents: If a document exceeds the window, chunk it and process in parts or use retrieval to select relevant sections.
Combine with agent memory: Use the context window for immediate task details, but store long-term memory externally for cross-session continuity.
Don't re-send full history: Sending all prior messages in every prompt is wasteful—use session memory or agent memory to manage history efficiently.
Test edge cases: What happens when context exceeds the window? Implement graceful truncation or retrieval fallback.

Common pitfalls

Assuming context window equals memory: A large window helps within a call but doesn't make context persistent—you need agent memory for that.
Overfilling the window: Cramming in all available context degrades model performance and increases cost—quality over quantity.
No retrieval strategy: When content exceeds the window, you need retrieval—don't assume everything fits.
Ignoring token costs: Larger windows are expensive—optimizing what you send reduces costs without sacrificing quality.
Expecting cross-session memory: Each call is independent—context from a prior session is gone unless you explicitly store and retrieve it.

Context Window

Context Window

Why it matters

How it works

Comparison & confusion to avoid

Examples & uses

Best practices

Common pitfalls

See also

Ready to build with Graphlit?