Google’s New AI Just Broke My Brain

Two Minute Papers

2026-04-01 8min 159,484 views watch on youtube →

Channel: Two Minute Papers

Date: 2026-04-01

Duration: 8min

URL: https://www.youtube.com/watch?v=7YVrb3-ABYE

❤️ Check out Lambda here and sign up for their GPU Cloud: https://lambda.ai/papers

📝 The TurboQuant paper is available here:

https://arxiv.org/abs/2504.19874

Reproductions:

https://github.com/tonbistudio/turboquant-pytorch

https://www.reddit.com/r/LocalLLM/comments/1s6edoi/turboquant_implementation/

https://www.reddit.com/r/LocalLLaMA/comments/1s73yby/implemented_turboquant_in_python_over_weekend/

https://x.com/AlicanKiraz0/status/2038245538865275274

I'll note that I have found several reprod

0:00

Google made a huge announcement about their new

method that lets us run AI techniques cheaper. The news took the world by storm.

This came at the best possible time, because we have a worldwide memory

shortage. So the prices for capable laptops and GPUs and anything that can run

these AI systems is up by…insane amounts. And this work would make them much

cheaper to run. They call it TurboQuant. Roughly speaking, they claim 4 to 6

times less memory, that is insane. And 8 times faster computation for a part

of the neural network called attention. No meaningful loss in output quality. And

it works on top of existing models as is. If true, that is a total game changer.

The news was so huge it even moved the stock price of huge semiconductor companies.

Because of that, I did not want to publish an

1:02

early video on the huge sensation.

No. I really wanted to wait a bit, and find out whether it actually

works in practice. I’ll tell you about that. And I’ll also tell you

that not everyone is happy about it. So, three questions. What does it do? Does

it work? What is the controversy about? Dear Fellow Scholars, this is Two Minute

Papers with Dr. Károly Zsolnai-Fehér. It feels so good to do it like this. Well,

this compresses the KV cache of AI systems, like large language models. This is the

short-term memory of an AI assistant. If you would look into that, you would see

tons and tons of numbers. These numbers relate to what you are currently talking about.

Movies, a bunch of documents or a huge codebase. Now, personally, I am a research scientist, what

caught my eye was not the media hype, but this. Oh! A formal mathematical proof that it works. Now

we’re talking. Okay, one, so what does this do?

2:06

And these numbers have lots of digits. Scientists

propose that we chop off the end of the numbers to save memory. Is that a new idea? No.

Is that a good idea? No, unless you are very careful. Because you can lose a lot of information

and your neural network might output nonsense. So, how do you do that? Well, imagine a vector,

this is like an arrow pointing somewhere. Sometimes that arrow points mostly along one

axis. So most of its "energy" is in one direction, and a little in other directions.

When you chop that information off, it snaps on to the grid, you basically

lose everything except that one direction. That is not useful. Now here’s a brilliant idea:

before chopping it off, rotate the arrow in a random direction. Now the energy spreads

more evenly across all directions. So when

3:09

you round off parts of it, you lose a little from

everywhere instead of everything from most places. The result? Much less information lost. Is

this idea new? No. This is a very old idea. Now they do one more thing. They use a Johnson–Lindenstrauss Transform to

compress the data. What is that? Remember, we have a bunch of numbers,

representing arrow directions. And we want fewer numbers to describe these

directions. But very carefully. You do this in a way that guarantees that the

distances between these arrows is roughly the same after squishing. If you want to sound

really cool, just call it the JL transform. Is that new? Not really. 40-year old

technique. And I think that is the key. Everyone loves to invent shiny new

stuff. But here, quantization is not new,

4:12

rotating things around is not new. This

transform is not new. These are three age old ideas combined together to great

effect. Sometimes you don’t need to invent grand new theories. Sometimes you need

a smart combination of existing methods. Okay, second big question:

so does it work in practice? To conclude that it works in practice, I wanted to see other scientists reproduce

the technique and benchmark it for themselves. This is why this video appears later than most

others, but I think it makes it more truthful. So were other scientists able to

reproduce this technique? Yes. Did they also benchmark it? Yes.

Does this technique help? Yes. But, not so fast! The first tests reveal that

it decreased the memory cost of the KV cache, short term memory by 30-40%. That is fantastic.

I would have been very happy with this. But it

5:16

doesn’t end there. Typically you have a tradeoff

where you decrease memory usage at the cost of something. So something needs to slow down.

Now hold on to your papers Fellow Scholars, because it also sped up processing the

prompts by about 40% as well. What? That is…my brain crashed. We get faster

AI assistants that need less memory at almost zero cost. That is insane. In a world

where it’s harder and harder to own things, this is a blessing. Thank you so much! It is also remarkable that the paper has

barely been out for a week and some of you Fellow Scholars already coded it up.

Nice work. Link is in the description. So, it’s not quite like the media says.

Based on the results, we cannot conclude that every AI machine suddenly needs 6 times

less ram. No. That is a bit idealistic and only

6:24

true for some corner cases. You know when

you see an official benchmark of a phone battery or electric car mileage with somewhat

idealized conditions? It is a bit like that. So careful with the media hype. Experienced

Fellow Scholars like you know that in your mind, you have to tone these numbers down a

little. This is why we wait for more data and analyze experiments here, to get

the highest quality information for you. But it’s still good. Really good!

It helps most people who run AI systems with very long contexts. When you

chuck in a huge pdf document, or a movie, or a huge codebase for an AI to analyze. Yes,

you will be able to do that cheaper, with meaningfully less memory. Often a few gigabytes

less. And I think that is absolutely amazing news. Third, I will note that other researchers

point out that the paper overlaps with previous

7:27

techniques. They felt that it has similarities

that should be discussed more thoroughly. There was more. Eventually, the paper was accepted

for publication, though not all researchers agree the concerns were fully addressed. I put the

links to all of these in the video description. But this proves that even in modern AI, there

are still basic things we haven’t invented yet. And that makes this a very exciting

area to be in. What a time to be alive! And if you agree that this is

the way of talking about papers, please consider subscribing

and hitting the bell icon.