The Impossibility of a Perfect Memory Token

To get a truly continuous personal agent, we’d love a model with infinite memory. A tempting idea is to fold long histories into a single perfect memory token without changing any future logits. Under softmax attention, however, this is mathematically impossible.

September 13, 2025 · 5 min read

#Memory #Transformer #Continual Learning

The Quest for Infinite Memory #

If an AI is going to act as a lifelong companion—remembering every preference, every inside joke, every past decision—it needs to carry a huge personal history through every step of inference. Context windows help, retrieval helps, summaries help—but they all forget. The real dream is infinite context.

Here’s the thought experiment that motivates this post:

Take a long prefix of tokens (your history) and compress it into one surrogate token.
Keep generating as usual. If the surrogate is perfect, then every future logit (not just the hidden states) is bit‑for‑bit the same as if we’d kept the full prefix.
If that worked, we could repeat it recursively, folding yesterday into a token today, last week into a token this week—achieving infinite memory without blowing up compute.

Unfortunately, this is impossible for vanilla softmax attention. Here is why.

Lossless Compression #

For any given query $\mathbf{q}$ (representing what a new token is “looking for”), the attention output is a weighted average of all the value vectors $\mathbf{v}_i$ in the sequence. The weights are determined by the dot product of the query with each key vector $\mathbf{k}_i$, scaled by the square root of the key dimension, $d_k$.

For a single head, the formula is:

$$\text{Attn}(\mathbf{q}) = \frac{\sum_{i=1}^T e^{(\mathbf{q}\cdot\mathbf{k}_i) / \sqrt{d_k}}\,\mathbf{v}_i}{\sum_{i=1}^T e^{(\mathbf{q}\cdot\mathbf{k}_i) / \sqrt{d_k}}}$$

For the remainder of this proof, we will omit the $\sqrt{d_k}$ scaling factor to simplify the equations. This is a valid simplification as it is a constant term that does not change the core logic or the conclusion of the proof.

Now, let’s say we want to compress the first $n$ tokens into a single surrogate token with key $\mathbf{k}_s$ and value $\mathbf{v}_s$. For this to be “lossless,” the attention output for any query $\mathbf{q}$ from a token after the compressed block must be identical.

This means the following equality must hold for any $\mathbf{q}$ (from a token $j > n$):

$$ \frac{\left(\sum_{t=1}^n e^{\mathbf{q}\cdot\mathbf{k}_t}\,\mathbf{v}_t\right) + \left(\sum_{j=n+1}^T e^{\mathbf{q}\cdot\mathbf{k}_j}\,\mathbf{v}_j\right)}{\left(\sum_{t=1}^n e^{\mathbf{q}\cdot\mathbf{k}_t}\right) + \left(\sum_{j=n+1}^T e^{\mathbf{q}\cdot\mathbf{k}_j}\right)} = \frac{e^{\mathbf{q}\cdot\mathbf{k}_s}\,\mathbf{v}_s + \left(\sum_{j=n+1}^T e^{\mathbf{q}\cdot\mathbf{k}_j}\,\mathbf{v}_j\right)}{e^{\mathbf{q}\cdot\mathbf{k}_s} + \left(\sum_{j=n+1}^T e^{\mathbf{q}\cdot\mathbf{k}_j}\right)} $$

The left side is the attention with the original, full history. The right side is the attention with the compressed history. The terms in the large parentheses represent the parts of the sequence that are not being compressed.

For the compressed history to be truly “lossless,” two conditions must be met for any query $\mathbf{q}$ that comes from a token after the compressed block.

The Denominators Must Match #

For compression to be lossless, every future token must keep the same attention weight.

For a future token $j$,

$\text{weight}_j = \frac{e^{\mathbf{q}\cdot\mathbf{k}_j}}{\text{sum of all exponentiated scores}}$

The numerator is unchanged (since $\mathbf{q}$ and $\mathbf{k}_j$ aren’t part of the compressed block). So the denominator must also be identical.

Original Denominator: $(\sum_{t=1}^n e^{\mathbf{q}\cdot\mathbf{k}_t}) + (\sum_{j=n+1}^T e^{\mathbf{q}\cdot\mathbf{k}_j})$
Compressed Denominator: $e^{\mathbf{q}\cdot\mathbf{k}_s} + (\sum_{j=n+1}^T e^{\mathbf{q}\cdot\mathbf{k}_j})$

For the attention weights on the future tokens to be the same, these two denominators must be equal. Since the second part of the sum is identical in both cases, we can cancel it out, which leaves us with our first condition:

$$e^{\mathbf{q}\cdot\mathbf{k}_s} = \sum_{t=1}^{n} e^{\mathbf{q}\cdot\mathbf{k}_t}$$

The total “attentional pull” from the original history block must be perfectly captured by the single surrogate token.

The Numerators Must Match #

If the denominators are equal, then for the final output of the attention (the whole fraction) to be equal, the numerators must also be equal.

Original Numerator: $(\sum_{t=1}^n e^{\mathbf{q}\cdot\mathbf{k}_t}\,\mathbf{v}_t) + (\sum_{j=n+1}^T e^{\mathbf{q}\cdot\mathbf{k}_j}\,\mathbf{v}_j)$
Compressed Numerator: $e^{\mathbf{q}\cdot\mathbf{k}_s}\,\mathbf{v}_s + (\sum_{j=n+1}^T e^{\mathbf{q}\cdot\mathbf{k}_j}\,\mathbf{v}_j)$

Again, we can cancel the identical “future” part of the sum. This gives us our second condition:

$$e^{\mathbf{q}\cdot\mathbf{k}_s}\,\mathbf{v}_s = \sum_{t=1}^{n} e^{\mathbf{q}\cdot\mathbf{k}_t}\,\mathbf{v}_t$$

The total weighted value contribution from the history block must be perfectly captured by the surrogate.

Why These Conditions Are Impossible to Satisfy #

Now, let’s see why a single, static surrogate token $(\mathbf{k}_s, \mathbf{v}_s)$ cannot satisfy these two conditions for all possible queries $\mathbf{q}$.

The Denominator Problem: Linear vs. Convex #

Let’s look at the first condition again:

$$e^{\mathbf{q}\cdot\mathbf{k}_s} = \sum_{t=1}^{n} e^{\mathbf{q}\cdot\mathbf{k}_t}$$

If we take the natural logarithm of both sides, we get:

$$\mathbf{q}\cdot\mathbf{k}_s = \log\left(\sum_{t=1}^{n} e^{\mathbf{q}\cdot\mathbf{k}_t}\right)$$

This equation must hold for any query $\mathbf{q}$. Here’s the problem:

The left side, $\mathbf{q}\cdot\mathbf{k}_s$, is a linear function of $\mathbf{q}$. Its graph is a flat plane.
The right side is the Log-Sum-Exp function, which is strictly convex. Its graph is a curved bowl.

A flat plane and a curved bowl can only touch at a single point (or a line in a degenerate case), but they cannot be equal for all possible values of $\mathbf{q}$. This means there is no single key vector $\mathbf{k}_s$ that can satisfy this condition.

The Numerator Problem: A Query-Dependent Value #

Even if we could somehow solve the denominator problem, the second condition gives us another impossibility. If we substitute the first condition into the second one:

$$ (\sum_{t=1}^{n} e^{\mathbf{q}\cdot\mathbf{k}_t}) \mathbf{v}_s = \sum_{t=1}^{n} e^{\mathbf{q}\cdot\mathbf{k}_t}\,\mathbf{v}_t $$

Solving for our surrogate value $\mathbf{v}_s$, we get:

$$ \mathbf{v}_s = \frac{\sum_{t=1}^{n} e^{\mathbf{q}\cdot\mathbf{k}_t}\,\mathbf{v}_t}{\sum_{t=1}^{n} e^{\mathbf{q}\cdot\mathbf{k}_t}} $$

Look closely at the right side. It’s a weighted average of the original value vectors $\mathbf{v}_t$, where the weights depend on the query $\mathbf{q}$.

This means that $\mathbf{v}_s$ would have to change depending on which query $\mathbf{q}$ is looking at it. But the tokens in the KV cache, including our surrogate, must be static. They are computed once and then used for all subsequent queries.

A single, fixed $\mathbf{v}_s$ cannot satisfy this equation for all possible queries.

Intuitive Picture #

Take just two tokens $(\mathbf{k}_1,\mathbf{v}_1)$ and $(\mathbf{k}_2,\mathbf{v}_2)$:

For queries near $\mathbf{k}_1$, attention should emphasize $\mathbf{v}_1$.
For queries near $\mathbf{k}_2$, it should emphasize $\mathbf{v}_2$.

A single surrogate would have to become $\mathbf{v}_1$ in one case and $\mathbf{v}_2$ in another. That is impossible unless the two tokens are identical.

Conclusion #

To keep attention outputs identical, we only asked for ratio equality. But because later tokens contribute their own varying terms, this forces numerator and denominator to match separately.
Each condition is impossible for nontrivial sets of tokens.

Therefore, loss-free token fusion is impossible under vanilla softmax attention. To move forward, we need new mechanisms—such as linear attention, query-adaptive surrogates, or external memory systems.