Why can’t LLMs just LEARN the context window? — benchmark.space

bycloud

Why can’t LLMs just LEARN the context window?

2026-03-30 12min 38,057 views watch on youtube →

Channel: bycloud

Date: 2026-03-30

Duration: 12min

Views: 38,057

URL: https://www.youtube.com/watch?v=P9uNy71YukQ

Check out HubSpot's FREE 2026 Guide to AI Agents: https://clickhubspot.com/3972be

In this video, I'll be breaking down a new approach to long-context LLMs called test-time training (TTT-E2E), where models store past context directly in their weights instead of relying on attention or KV caches. Kind of like meta learning, but with gradient descent.

my latest project: Intuitive AI Academy

We just wrote a new piece on MoE!

https://intuitiveai.academy/

limited time code "EARLY" for 40% off yearly

One of the big focuses for LMS in 2026 is solving long context. And just in the past few months, I have already covered many cool new approaches different labs are coming up with to address this bottleneck. From linear attention, sparse attention to R&M based methods. But they all share a major trade-off. That is the need to drop full attention over all context in some sort of way. Linear attention compresses it. Sparse attention ignores some parts. R&M based methods decays it all eventually impacting the performance in the end. In contrast, the true full attention process is actually lossless, which can practically retrieve literally anything it has ever seen. So with full attention scales quadratically, making it pretty much unreasonable to use for long context, are we guaranteed some sort of trade-off for the rest of eternity? Well, instead of looking for a free lunch in attention that might not even exist, what if we look towards somewhere else? A very common theme in fictional AI is that these AI things would learn on the fly. It actively adapts to its environment and remember things clearly from a long time ago. While we don't

have the exact technology capable of executing this right now, engineering efforts have definitely been able to make something that resembles this behavior. For example, having a database that stores a summary of all your previous conversations and have the LM actively retrieve it or reference it. Of course, this is definitely not a good way to store information because summary itself is also lossy and on an even higher level. So maybe something on a lower level that directly remembers information would be much better than manipulating highle information with duct tape. And you know what is commonly known for storing information in a language model? The MLP component within a transformer layer. So what if instead of directing our attention to make attention focuses on every single token in the context window, we instead have the model store information actively. So there's no longer a need for a long context and maybe even a KV cache. And before I dive into it, given how desperate the entire research field is trying to solve lone context, it doesn't mean you're left with hopeless dreams when trying to get your AI agents to work with what you have right now. And

if that sounds like a struggle you have right now, then this free AI agents unleashed playbook from HubSpot is here to help you build AI agents in 2026. Understanding the complex agentic AI ecosystem right now is more complicated than it should be, especially when most things look smart in a demo, but falls apart the moment they touch real workflows. And this playbook cuts through that. In it, they lay out what agents are and where they actually excel today and how to roll them out without overengineering your first attempt. Their core role is simple. Use humans for judgment and agents for execution. So people stay accountable for strategy and quality while agents handle the time-consuming parts like research, drafting, formatting, and optimization. They also cover the most common pitfalls like overmating too fast or setting unrealistic expectations so you actually get it right the first time. What I like the most is the step-by-step implementation road map. It takes you from identifying your first use case all the way to measuring the impact. So you're not just waiting around for the research to catch up. So if you want a clean road map to get started in AI agents for your business in 2026, check out this free playbook in HubSpot's AI powered CRM using the link down in

description. And thank you HubSpot for sponsoring this video. Anyways, with the intense adoption of LMS today, you can expect the model to carry a longer system prompt to use references, a user's chat history, and other functional injections that an engineering team adds to steer the model towards specific behaviors. So, as the load on a context window rises, effective long context capability has become more sought after than ever. But with efficient attention solutions hitting a performance bottleneck at extreme context lengths with full attention scales quadratically in both compute and memory and in this economy the field definitely is a way out. So in this paper called end to end test time training for long context they proposed a fascinating idea. If you could store order information from the context window directly into the model's weights wouldn't that solve the problem? So like instead of paying attention to every token in the full context, you pay attention only to the most recent chunk of tokens and push the rest into the model's memory. While this highle idea

has already been proposed in many different ways, there were also implemented in many different ways with most of them compressing attention or by modifying attention into a memory system. This paper however pushes the modification into the feed forward networks inside transformer blocks. A key difference between compressing by updating MLP versus compressing with attention-like mechanisms is that the shape of the memory is completely different. You no longer store an index table of associations, but you compress whatever seems useful into the parameters of a forward pass through gradient updates. In other words, the memory is the adapted weights, not an external indexed KV cache. Well, the theoretical idea is cool, but practically there is some stuff we need to verify first in order to make it an equivalence where storing the tokens in the model's memory is equivalent to paying attention to those tokens. So, the paper started with a sanity check. They basically made a toy experiment of a transformer without any attention block and only the feed forward network layers aka a model that is a ground. In the setting, the best the model can do is to condition based on one token then

output the next most likely token. And when you put in the next token, it doesn't have the memory of the last token because in this case the model is frozen. But by applying test time training where after making the predictions, the model makes a learning update and stores the new token into its weights, you can theoretically condition not just on one token only, but all previous tokens as the weights now learned all the tokens that was previously passed through. So in their earlier experiment, they explored the question, does TTT actually work as an attention replacement? And it turns out yes, it is almost as good as full attention. But here comes the first practical problem. Even if this idea is proven, doing gradient descent so the model can learn every token at runtime is going to be slow, expensive, and hard to parallelize. The only reasonable solution to this is that we have the model to learn the tokens in batches. So the learning is not done at every token, but after a set amount of tokens. However, that would actually create a problem where the model would revert back into a biogram within the batch.

For example, if you have a sentence like this and have the model to predict the next token. Let's say the model learns every four tokens aka a batch of four, then it would just output something completely unrelated to the sentence because the model have not even learned the immediate four tokens into its F4 network to have the context. So, batching is necessary for efficiency but batching breaks the mechanism and we are stuck between a rock and a hard place. That is until you look outside the box. The researchers came up with a very fascinating idea. Applying sliding window attention during the batch that is not yet learned by the model to cover the lack of attention. This is very clever because the batch size will always stay constant. And the sliding windows compute and memory cost is also constant. And you can set the window size to the batch size which is just perfect. So now the idea is consistent. Local context is achieved with windowed attention. long-range context is compressed into updated MLP ways and batching makes it computationally feasible and this is the key mechanism of TTT E2E short for test time training end to end but why is it called end to

end because unlike previous approaches that incorporate a separate state with a different objective for example attention replacement layers that learn to reconstruct values from keys ttt e2e uses the actual language modeling objective end to end where next token prediction directly drives the model update hates. So it's like a pipeline where the entire testime training behavior can be pre-trained into the model at scale where everything is learned implicitly not separately because we're also teaching how the MLP should learn and be used as a context. So pre-training is kind of now like metaarning and inference is now more like test time training. But even if test time training is good, is it actually that good? Because you're literally updating the entire model. And shouldn't you leave some knowledge and touch from the pre-training so it retains key behaviors? Or maybe training too much on the model's own generated data may lead to a model collapse. So they experimented with different MLP updates and ratios. For a 760 million 24 layer model, they tried updating the

last half, 1/4 or 1/8 of layers and also the extreme of updating only the final layer. In their result, if you update one to three layers, the method doesn't compare with full attention. And when you update six or 12 layers, it does scale. And updating 12 is not meaningfully better than six. So they kind of synerize on updating the last 1/4 of the blocks. As for the context length of the model, because typically training data are 8,000 tokens long, they have made a batch to 1,000 tokens. Any other larger batch would directly hurt the performance. So it was a pretty easy default to choose. But sliding window on the other hand doesn't really matter because it's full attention in that specific window. So no matter how big the window you set to, it'll perform pretty well. So it's mostly constrained by the token batch size. And now to the exciting part. On its own, the results for TTT E2E are genuinely strong. In the long context language modeling experiments, it tracks full attention scaling surprisingly close and even holds a consistent loss advantage across context lengths while keeping decoding costs bounded by the fixed attention

window rather than prefix. And the theory stays consistent when they pushed to extreme lengths like 128,000 token context window. The model didn't fall apart just because the context got huge. It behaved as the researchers hoped and just stretched out much better. And when you look at figure 6 at 32,000, the left panel breakdown makes the gap between TTT E2E and full attention look small near the end of the window, which is tempting to draw a conclusion that maybe full attention catches up and eventually wins if you go longer. But the right panel at 128,000 tokens says otherwise. Instead of crossing, the plot mostly looks like a stretched version of the 32,000 breakdown. So it seems like TDTE2's advantage doesn't disappear and it actually extends. And that stretching lines up with figure 1 where the advantage is maintained across context lengths rather than converging away. While the main appeal of TDTE2 is definitely the lower losses across the board, that may also be an illusion because the model can be completely up to date on all the information creating loss that is as close as the present.

And maybe that's why it's capable of getting lower loss. And if you look at table 2, the loss looks great, but downstream style performance, especially the kind that depends on precise retrieval, can be disappointing compared to full attention. So that is the catch. Lowering average next token loss does not guarantee you reliably recover a rare needle token from far back because full attention can explicitly look it up and t2e is still technically compressing just into the weights instead of compressing attention and it might be even worse than compressing attention because it is compressing without indexing. So the needle can get confused or discarded from its origin even though the model has seen the information. So practically we wouldn't know how well this actually scales in larger parameters. But I'd say this method definitely sounds promising and a rather fresh direction to explore long context extension through continual learning. And at the current trend of context hungry AI development where pretty much everything from RL to to use is about context window size. If test time training does make it, it'll effectively make the context window infinite and completely change the landscape of long

horizon LM. And this is the huge upside potential of TTT2E as any other methods would still be stuck recalling using attention. So yeah, that's it for this video. And if you like how I explained the AI concepts today, you should definitely check out my latest project, intuitive.academy, where it contains an intuitive explanation of all modern LMS from the ground up, ranging from LM architectures, Laura to howes work. A total of 24 chapters are currently available and will be updated monthly. This is the start of a series where I'll break down AI topics intuitively because I genuinely think anyone could understand them, no matter how difficult it may seem. So, for those who want to get into AI or LMS, this should be the perfect place for you to dive into the technical parts without being intimidated by crazy looking MATHs. And right now, I am also putting out a new launch discount for 2026. So, you can use the code early for 40% off a yearly plan. And thank you guys for watching. A big shout out to Spam Match, Chris Leoo, Degan, Robert Zaviasa, Marcelo, Ferraria, Poof, and Enu DX Research Group, Alex Midwest Maker, and many

others that support me through Patreon or YouTube. Follow me on Twitter if you haven't and I'll see you in the next