Why AI Models Forget Things - The Simple Truth About Context Windows

Apr 28, 2025·By Ryan Flanagan

When you interact with a large language model (LLM), you’re not chatting with an all-knowing mind. You’re engaging with a system that has a very real memory limit. That limit is called a context window. It's the box that decides how much the model can ‘remember’ and work with at any one time.

If your conversation stays inside the box, you get coherent, accurate responses. If you spill over, the model starts to lose track of earlier points. That's when replies can get shaky.

What Is a Context Window, Really?

Think of it like this: you're carrying a small notepad. You can write down key points from a meeting, but the pad only has a few pages. If the meeting is short, you can keep track of everything. If it drags on, you’ll have to start erasing earlier notes to make room for new ones. Same with LLMs. The context window is their ‘notepad’.

Once it’s full, the model can't see the older parts anymore. It starts relying on guesswork rather than actual memory. That’s why keeping conversations or inputs within the window matters.

Tokens: The Units That Fill the Window

Now, what fills up the context window isn't words, exactly. It's tokens. A token can be a whole word, part of a word, or even a few words combined.

For example:

"Running" might be broken into "run" and "ning."

"The" is usually one token.

"Artificial intelligence" might sometimes be tokenised as two or three units depending on the model.

A rough guide: 100 English words = about 150 tokens.

The tool that chops text into tokens is called a tokenizer. Different models have slightly different ways of slicing up language, but the principle stays the same.

How Big Is a Context Window?

Context window sizes have exploded. Early models like GPT-2 had around 2,000 tokens. Modern heavyweights like Claude 3 and IBM Granite 3 are pushing 100,000+ tokens.

Sounds massive, but it's not just your conversation filling the space. System prompts, documents, code snippets, instructions — they all count. It doesn’t take much to chew through 128,000 tokens if you’re attaching reports, API data, or long conversations.

Why Attention Matters

Here’s where it gets interesting.

LLMs use something called a self-attention mechanism to process tokens. Rather than reading left to right like we do, self-attention allows the model to look across the whole input and figure out what parts are important to each other.

Imagine trying to follow a novel. You don’t just remember the last sentence; you tie together characters, earlier events, and clues from different chapters. That’s what self-attention lets LLMs do — at lightning speed.

The catch? The more tokens you feed it, the more relationships the model has to map. And mapping every token against every other token gets very expensive, very quickly.

The Hard Truth About Long Context Windows

Bigger context windows aren't always better.

Here’s why:

Quadratic Costs: Doubling the number of tokens doesn’t just double the compute needed. It quadruples it. That means slower speeds, bigger server costs, and higher failure rates.

Performance Drops: A 2023 study showed models perform best when important info is at the start or end of the input. Bury key details halfway through a massive input, and the model struggles — just like you'd struggle to find one note buried on page 400 of a 1,000-page book.

Security Risks: Longer contexts mean more room for trouble. Malicious instructions could hide deep inside a long input, bypassing safety filters. The bigger the window, the harder it is to catch everything.

The Balancing Act

There’s no perfect answer. It’s about finding a workable middle ground:

Enough room to handle real conversations or documents.
Small enough to stay fast, affordable, and secure.
Structured well so key information isn’t buried.
Good design — not just size — makes the difference.

Why This Matters to You

If you’re building tools with LLMs — chatbots, code assistants, document analyzers — you need to understand context windows.

Push past the limit, and the model starts forgetting. Stay inside it, and you get coherent, smart outputs. Knowing where that line is makes everything work better.

When you hear about "128k context" models, remember: it’s not a magic wand. It’s a bigger box. And how you use the space inside it is what really matters.

FAQs on LLM Context Windows

What is a context window?
It’s the model’s short-term memory. It defines how much conversation or input the model can ‘see’ at once. Overfill it, and earlier parts are lost.

What is a token?
A token is the basic chunk of text an LLM processes. Could be a character, a piece of a word, a whole word, or a short phrase.

How big are today’s context windows?
Leading models can handle up to 128,000 tokens. But that space gets eaten up fast by prompts, documents, and instructions.

Why not make context windows infinitely big?
Because the computing power needed scales exponentially. Bigger contexts also risk performance drops and make models easier to hack.

How does attention fit in?
Self-attention is how models map relationships between all tokens. It's what lets them "understand" language rather than just regurgitating it.

Is bigger always better?
No. Good design, smart input management, and context awareness beat brute force every time.

Want to go deeper? Join our 5 Day AI Bootcamp where we go through advanced prompting techniques like chain of thought prompting and context setting with your own assets uploaded.