Part 1: Why Transformers Still Forget

#llm #ai #productivity

This is Part 1 of a three-part series examining why long context is not equivalent to long-term memory in modern language models. Here, we focus on why attention-based systems forget even when context windows grow dramatically. The next parts will introduce a memory-first framework and analyse how the Titans architecture approaches long-term memory explicitly.

Long-context models are everywhere now. The marketing message is simple: if a model can read more tokens, it can “remember” more. That sounds reasonable, but it is the wrong mental model. A bigger context window mostly turns a model into a better reader, not a better rememberer. The distinction matters because many real-world tasks are not about reading everything; they are about keeping what matters and using it later without constantly re-scanning a massive history.

This post introduces the core problem behind the Titans line of work from Google Research: attention is an excellent short-term memory mechanism, but it is not a complete memory system. Titans starts from this premise and proposes a way to introduce long-term memory without relying on quadratic attention over the entire past.

The False Promise of “Just Increase the Context Window”

Transformers are built around attention. Attention works by comparing queries to keys across the tokens provided in the context window and retrieving values weighted by similarity. This mechanism can feel like memory because the model can “look back” and reuse earlier information. In reality, however, the model is only conditioning on what is currently visible; it repeatedly consults the context rather than storing information in a durable internal memory.

As context length increases, the model can consult more text, but relevance becomes harder to isolate. The longer the history, the more distractors exist, and the easier it becomes for retrieval to miss the one detail that matters. This explains the common gap between “the model can technically see the information” and “the model reliably uses the information.”

In short, long context improves access, but it does not guarantee retention.

Illustration showing quadratic growth of attention cost as context length increases

Attention as Memory: Useful, but Incomplete

The Titans' paper frames attention as an associative memory block. Tokens are stored as key–value pairs and retrieved through similarity search. This explains why Transformers perform so well on many sequence tasks. However, it also clarifies the limitation: the model’s output is strictly conditioned on dependencies inside the current context window, which is fundamentally bounded.

Titans draws a clear conceptual line. Attention behaves like short-term memory, high fidelity, flexible, and powerful, but constrained by window size and computational cost. Long-term memory requires different properties: persistence over time, selective storage, and the ability to retain useful abstractions without keeping every past token accessible through attention.

This distinction is not philosophical; it is architectural. Focusing attention on handling long-term storage increases compute cost without delivering reliable recall.

short-term attention with persistent long-term neural memory

Why “Efficient” Transformers and Linear Models Still Struggle

One response to attention’s scaling limits is to replace softmax attention with linear or kernel-based alternatives. While these approaches reduce computational complexity, they often behave like linear recurrent models that compress history into a fixed-size state. This makes them efficient, but also introduces information loss.

The contradiction is clear: these models are most appealing when context is very long, yet very long histories are difficult to compress without losing important details. As a result, two imperfect strategies dominate today:

Full-attention Transformers retain rich access to recent context but are expensive and bounded. Linear or recurrent variants scale efficiently but risk forgetting critical information due to compression.

Titans is motivated by this tension. Efficiency and reliable recall clash when only a single memory mechanism is used.

“Full attention: accurate but expensive” versus “Compressed recurrence: efficient but forgetful”

The Memory Perspective That Leads to Titans

A key contribution of the Titans' work is its memory-centric view of sequence modelling. Models are described in terms of two operations: writing (or updating memory) and reading (or retrieving from memory). Recurrent models write by compressing history into a hidden state. Transformers write by appending keys and values to the context. Retrieval then happens either by reading the hidden state or attending to stored keys.

Seen through this lens, the important questions shift. Instead of asking which architecture is best, we ask how memory should be structured, how it should update, how it should retrieve information, how it should forget, and how multiple memory modules can be combined so each handles what it does best.

This framing naturally leads to Titans: an architecture where attention remains in short-term memory, and a separate module is responsible for long-term memory.

What This Means for Real Systems

For systems that involve continuous reasoning, long-running agents, extended conversations, log analysis, long documents, or time-series data, the limitation appears quickly. Increasing context length helps, but retrieval becomes fragile as the haystack grows. This is why larger windows often improve demos without fully solving reliability.

Titans is compelling because it does not claim that attending to everything is sufficient. Instead, it argues for an architecture that explicitly incorporates long-term memory and manages retention over extended horizons while remaining computationally practical.

Unresolved Questions

Part 1 deliberately leaves one major question unanswered: what should a long-term memory module look like, and how should it decide what to store and forget? The Titans paper addresses this later using a neural long-term memory updated at test time via a “surprise” signal and adaptive forgetting.

These mechanisms will be explored in Part 3. Before that, Part 2 introduces a broader memory lens that explains why forgetting and retention emerge from design choices rather than bugs.

Conclusion: We Don’t Have a Context Problem, We Have a Memory Problem

Attention is an exceptional tool for short-range dependency modelling, but treating it as the entire memory system forces trade-offs that do not disappear with larger context windows. Truly scalable long-context systems require dedicated long-term memory mechanisms, not just longer scrollback.

In Part 2, we will make this memory-first framing explicit so that Titans appears as a logical architectural step rather than an isolated idea.