7 Ways Context Windows Still Break Modern LLMs

Context is everything. For the last year, one of the easiest mistakes to make in AI has been to confuse a bigger context window with better memory.

They are not the same thing.

Yes, modern large language models can accept dramatically longer inputs than earlier generations. But “can take in more tokens” is not the same as “can reliably use everything it was given.” Research on long-context behavior has repeatedly shown that model performance drops when the relevant information sits in the middle of a long prompt rather than near the beginning or end. In other words, the context may be technically available while still being practically hard for the model to use well. [1]

That distinction matters because it explains three of the biggest reliability problems we still see in production AI systems:

context window limits are real, even when the advertised window is large
hallucinations are still common when the model is under-specified or poorly grounded
long, multi-step workflows often fail because the model loses state, drops constraints, or makes one small mistake that compounds over time

The practical takeaway is simple: if you want dependable agentic AI, you cannot treat the model itself as the whole system. You need memory design, retrieval design, state management, and verification around it. Official agent-building guidance now frames agents as a composition of models, tools, orchestration, and state or memory rather than “just a prompt.” [2]

Context windows are capacity, not comprehension

A context window is the amount of text a model can process in a single pass. That sounds like memory, but it behaves more like a temporary working set.

The problem is not only that context is finite. The deeper problem is that using long context well is hard. In Lost in the Middle, researchers found a consistent U-shaped pattern: models perform better when relevant information appears near the beginning or the end of the input, and worse when the same information appears in the middle. The paper also found cases where GPT-3.5-Turbo performed worse with the relevant material buried in the middle than with no retrieved documents at all. [1]

That helps explain a lot of what practitioners call “forgetting” in long chats. Sometimes the model has not forgotten in the human sense. It is failing to retrieve, prioritize, or apply the right information at the right moment from a crowded prompt. Recent long-context work continues to treat this as an active research problem, not a solved one, especially when prompts get noisy, tasks become sequential, or the model must reason over long interaction histories. [1][3][4]

There is also a hard engineering cost here. Long-context inference remains expensive and slow because attention costs rise sharply with sequence length, and newer long-context papers still describe quadratic attention cost as a major bottleneck for practical deployment. [5][6]

Hallucination is not just “the model making things up”

Hallucination gets oversimplified in casual discussion. In practice, it covers several different failure modes.

A recent Computational Linguistics survey defines the issue broadly: LLMs can generate content that diverges from the user input, contradicts earlier generated context, or misaligns with established world knowledge. That is important because it means hallucination is not only about factual errors from thin air. It can also be a failure to stay consistent with the prompt, the conversation, or the external evidence the model was supposed to use. [7]

This is why bigger models and better prompting alone have not eliminated the problem. Hallucinations can come from missing knowledge, weak retrieval, context overload, reasoning drift, overconfident decoding, or the model simply not knowing when to abstain. A 2025 TACL survey on abstention argues that refusing to answer when uncertainty is high is increasingly recognized as one path to reducing hallucinations and improving safety. [8]

In other words, hallucination is not one bug. It is a family of reliability failures.

Why multi-step processes still break

Single-turn demos hide a lot.

As soon as you ask a model to complete a long chain of dependent actions, the failure modes multiply. The model has to remember prior constraints, keep track of intermediate outputs, decide when to retrieve more information, avoid repeating bad steps, and recover after mistakes. That is a lot to ask from a system with finite working context and imperfect internal state tracking.

Recent research on long-horizon evaluation makes this point clearly. METR’s 2025 work proposes measuring models by the duration of tasks they can complete at a given success probability rather than by isolated benchmark questions. The reason is straightforward: long tasks expose reliability limits that short tests miss. [9]

Even more recent 2026 work goes further. The Long-Horizon Task Mirage? reports that as horizon increases, planning-related and memory-related failures become dominant, and it argues that better base models alone are not enough; method-level improvements in planning, memory, and execution-time control are required. That is a preprint, so it should be treated as emerging evidence rather than settled consensus, but it lines up with what many production teams already see. [10]

So when an AI agent falls apart halfway through a research workflow, a coding task, or a customer support flow, that is usually not one dramatic collapse. It is usually small state errors accumulating until the system no longer knows what matters.

What modern agentic AI systems do instead

This is the part that matters in practice.

The strongest agentic systems today do not try to stuff everything into one giant prompt and hope the model can reason its way through the mess. They reduce the burden on the context window itself.

1) They treat memory as a system, not a transcript

One of the worst defaults in agent design is storing everything and replaying everything.

Recent memory research argues that this is usually the wrong approach. The write path matters: storing every interaction verbatim adds noise, hurts retrieval precision, and makes later reasoning worse. Good memory systems filter low-signal content, normalize important facts, deduplicate overlapping records, and rank stored information by relevance and risk. [3]

That is a more useful mental model for AI memory than “save the whole chat.”

The practical pattern is a hierarchy:

working memory for the current task
episodic memory for what happened in prior runs
semantic memory for durable facts, preferences, rules, or policies
external storage for documents, logs, code, records, and tool outputs

The memory survey literature describes a real shift from prompt compression, to retrieval-augmented stores, to more deliberate learned and policy-based memory systems. It also notes that evaluation is moving from static recall tests to multi-session agentic benchmarks because remembering a fact is not the same as using it correctly later. [3]

2) They retrieve context just in time

This is one of the biggest changes in how serious teams build agents.

Instead of preloading huge amounts of material up front, modern systems increasingly use just-in-time context. Anthropic’s engineering guidance describes a pattern where agents keep lightweight references such as file paths, saved queries, or links, then load the needed context at runtime through tools. The goal is to keep only the smallest high-signal set of tokens in front of the model at each step. [11]

This is a much better answer to long workflows than blindly increasing prompt length.

Retrieval-augmented generation still matters, but the more mature version is not “dump vector-search results into the prompt.” It is targeted retrieval, structured context selection, and dynamic tool use when the next step actually needs more information. [2][11]

3) They separate planning, execution, and verification

A lot of multi-step failures happen because the same model is being asked to plan, act, remember, and judge its own output in one pass.

Modern agentic architectures reduce that burden by separating responsibilities. One component plans. Another executes tool calls. Another checks the result against the original goal, policy, or source material. OpenAI’s current agent documentation reflects this broader design view by emphasizing orchestration, tools, conversation state, compaction, and reasoning best practices as distinct concerns. [2]

This matters because verification is not optional when hallucination or memory drift is costly.

The basic rule is: do not ask the model, “Does this look right?” Ask, “Can this claim be traced to a source, a tool result, or a validated intermediate state?” That shift alone eliminates a surprising amount of brittle behavior. The hallucination literature increasingly frames detection and mitigation as a system problem, not a single-prompt problem. [7]

4) They compress state, not just language

There is a bad version of summarization and a good version.

The bad version throws away detail and hopes the summary will be enough later.

The good version preserves the right abstractions: objective facts, open loops, commitments, unresolved questions, user constraints, tool outputs, and next actions. OpenAI’s agent guidance now explicitly includes compaction and conversation-state strategy as core context-management concerns, which reflects the fact that long-running agents need structured state reduction, not endless replay. [2]

In practice, that means turning raw interaction history into a small, updateable state object. The model should not have to rediscover the entire task from scratch every turn.

5) They make the model earn confidence

One of the simplest ways to reduce bad answers is to stop forcing answers.

Abstention, uncertainty handling, and evidence requirements are becoming more important precisely because fluent wrong answers are often more dangerous than explicit uncertainty. The 2025 abstention survey argues that knowing when not to answer is a meaningful part of LLM reliability, not an edge feature. [8]

In production, this usually becomes a rule set:

answer directly when evidence is strong
retrieve when evidence is missing
ask a tool when precision matters
abstain or escalate when uncertainty remains high

That is not glamorous, but it works.

6) They evaluate long-horizon behavior explicitly

A system that looks good on short prompts can still be weak in real workflows.

That is why newer evaluation work is moving toward task duration, multi-session memory, and agent trajectories rather than isolated question-answer benchmarks. METR’s time-horizon framing and newer long-horizon agent studies both point in the same direction: short benchmarks can overstate practical reliability on extended work. [9][10]

So if you are building agentic AI, the useful question is not “Did it answer one prompt well?” It is “Can it stay coherent, grounded, and on-task across the full job?”

That is a much higher bar.

The real lesson

The current generation of LLMs is impressive, but the core limitations are still visible in production:

long context is not the same as robust memory
hallucination is still a systems problem, not a solved model problem
complex multi-step workflows expose planning and memory weaknesses fast

The way around that is not magical. It is architectural.

The most reliable agentic AI applications now use a combination of selective memory, just-in-time retrieval, structured state, tool use, verification loops, and explicit long-horizon evaluation. That approach does not remove model limitations. It contains them. [2][3][10][11]

And that is the shift I expect to matter most over the next phase of AI deployment: not who can advertise the largest context window, but who can build the best system around a model that still has limits.

FAQ

What is a context window in an AI model?

A context window is the amount of text or data a model can process in a single pass. It is a working space, not durable memory. A larger context window increases capacity, but it does not guarantee the model will use all of that information well.

Why do large context windows still fail in practice?

Because capacity is not comprehension. Models often struggle to retrieve and prioritize the right information when prompts get long, especially when the most important material sits in the middle of the input rather than near the beginning or end.

What does “lost in the middle” mean?

It refers to a long-context failure pattern where models perform worse when relevant information is placed in the middle of a long prompt. Even if the information is technically inside the context window, the model may not use it reliably.

Are hallucinations just factual mistakes?

No. Hallucinations also include failures where the model contradicts the prompt, drifts away from earlier context, or produces claims that are not grounded in reliable evidence. In practice, hallucination is a broader reliability problem than “making things up.”

Why do multi-step AI workflows break so often?

Long workflows create more chances for failure. The model has to track constraints, remember intermediate results, decide when to retrieve more information, and recover from earlier mistakes. Small state errors can accumulate until the workflow no longer stays coherent.

Why is long-context inference expensive?

Long-context inference is costly because attention grows sharply with sequence length. That increases both compute cost and latency, which makes very long prompts slower and more expensive in production.

What is the difference between memory and a transcript in agentic AI?

A transcript is a raw record of what happened. Memory is a system for deciding what to keep, what to discard, and what to retrieve later. Strong agentic systems do not simply replay everything. They filter, structure, deduplicate, and rank information so the model sees what matters when it matters.

What does just-in-time retrieval mean?

Just-in-time retrieval means loading only the context needed for the current step instead of stuffing large amounts of material into the prompt up front. This keeps the prompt smaller, reduces noise, and improves the odds that the model uses the right information at the right moment.

Why should planning, execution, and verification be separated?

Because asking one model to do everything in one pass increases failure risk. More reliable agentic systems split responsibilities across planning, tool use, state tracking, and verification so that each step can be checked against evidence or validated outputs.

What does it mean to compress state instead of compressing language?

It means preserving the important structure of a task instead of just shortening the text. Good state compression keeps goals, constraints, unresolved questions, tool outputs, and next actions. That helps the system continue work without having to rediscover the task from scratch each turn.

How does abstention improve AI reliability?

Abstention improves reliability by allowing the model to decline, escalate, or seek more evidence when uncertainty is high. In many real-world settings, admitting uncertainty is safer and more useful than producing a fluent but unsupported answer.

What is the main takeaway for building dependable AI agents?

The model alone is not enough. Reliable agentic AI depends on system design around the model: selective memory, structured state, targeted retrieval, tool use, verification loops, and evaluation that tests performance across full workflows rather than isolated prompts.

Sources

[1] Liu, Lin, Hewitt, Paranjape, Bevilacqua, Petroni, Liang. Lost in the Middle: How Language Models Use Long Contexts. TACL, 2024.

[2] OpenAI. Building agents and Running agents documentation, including guidance on tools, orchestration, sessions, state, compaction, and context management.

[3] Du et al. Memory for Autonomous LLM Agents: Mechanisms, Evaluation, and Emerging Frontiers. arXiv, 2026. Survey and engineering discussion on write paths, retrieval precision, and the shift from static recall to multi-session agent benchmarks.

[4] Recent long-context memory benchmarking work, including Benchmarking and Enhancing Long-Term Memory in LLMs, arXiv, 2026. This is newer evidence and should be read as emerging rather than settled.

[5] TokenSelect: Efficient Long-Context Inference and Length Generalization for LLMs., 2025. Describes long-context degradation and quadratic attention cost as practical bottlenecks.

[6] Star Attention: Efficient LLM Inference over Long Sequences., 2025. Reinforces the cost and latency problem for long-sequence inference.

[7] Zhang et al. Siren’s Song in the AI Ocean: A Survey on Hallucination in Large Language Models. Computational Linguistics, 2025.

[8] Wen et al. A Survey of Abstention in Large Language Models. TACL, 2025. Frames refusal or abstention as a reliability and hallucination-mitigation strategy.

[9] METR. Measuring AI Ability to Complete Long Tasks., 2025. Introduces task-completion time horizon as a real-world capability measure.

[10] Wang et al. The Long-Horizon Task Mirage? Diagnosing Where and Why Agentic Systems Break., arXiv, April 2026. Recent preprint; useful but still emerging evidence.

[11] Anthropic. Effective context engineering for AI agents., 2025. Practical guidance on just-in-time context and token-efficient agent design.