❤️ Check out Lambda here and sign up for their GPU Cloud: https://lambda.ai/papers
📝 The TurboQuant paper is available here:
https://arxiv.org/abs/2504.19874
Reproductions:
https://github.com/tonbistudio/turboquant-pytorch
https://www.reddit.com/r/LocalLLM/comments/1s6edoi/turboquant_implementation/
https://www.reddit.com/r/LocalLLaMA/comments/1s73yby/implemented_turboquant_in_python_over_weekend/
https://x.com/AlicanKiraz0/status/2038245538865275274
I'll note that I have found several reprod
Google made a huge announcement about their new
method that lets us run AI techniques cheaper. The news took the world by storm.
This came at the best possible time, because we have a worldwide memory
shortage. So the prices for capable laptops and GPUs and anything that can run
these AI systems is up by…insane amounts. And this work would make them much
cheaper to run. They call it TurboQuant. Roughly speaking, they claim 4 to 6
times less memory, that is insane. And 8 times faster computation for a part
of the neural network called attention. No meaningful loss in output quality. And
it works on top of existing models as is. If true, that is a total game changer.
The news was so huge it even moved the stock price of huge semiconductor companies.
Because of that, I did not want to publish an
early video on the huge sensation.
No. I really wanted to wait a bit, and find out whether it actually
works in practice. I’ll tell you about that. And I’ll also tell you
that not everyone is happy about it. So, three questions. What does it do? Does
it work? What is the controversy about? Dear Fellow Scholars, this is Two Minute
Papers with Dr. Károly Zsolnai-Fehér. It feels so good to do it like this. Well,
this compresses the KV cache of AI systems, like large language models. This is the
short-term memory of an AI assistant. If you would look into that, you would see
tons and tons of numbers. These numbers relate to what you are currently talking about.
Movies, a bunch of documents or a huge codebase. Now, personally, I am a research scientist, what
caught my eye was not the media hype, but this. Oh! A formal mathematical proof that it works. Now
we’re talking. Okay, one, so what does this do?
And these numbers have lots of digits. Scientists
propose that we chop off the end of the numbers to save memory. Is that a new idea? No.
Is that a good idea? No, unless you are very careful. Because you can lose a lot of information
and your neural network might output nonsense. So, how do you do that? Well, imagine a vector,
this is like an arrow pointing somewhere. Sometimes that arrow points mostly along one
axis. So most of its "energy" is in one direction, and a little in other directions.
When you chop that information off, it snaps on to the grid, you basically
lose everything except that one direction. That is not useful. Now here’s a brilliant idea:
before chopping it off, rotate the arrow in a random direction. Now the energy spreads
more evenly across all directions. So when
you round off parts of it, you lose a little from
everywhere instead of everything from most places. The result? Much less information lost. Is
this idea new? No. This is a very old idea. Now they do one more thing. They use a Johnson–Lindenstrauss Transform to
compress the data. What is that? Remember, we have a bunch of numbers,
representing arrow directions. And we want fewer numbers to describe these
directions. But very carefully. You do this in a way that guarantees that the
distances between these arrows is roughly the same after squishing. If you want to sound
really cool, just call it the JL transform. Is that new? Not really. 40-year old
technique. And I think that is the key. Everyone loves to invent shiny new
stuff. But here, quantization is not new,
rotating things around is not new. This
transform is not new. These are three age old ideas combined together to great
effect. Sometimes you don’t need to invent grand new theories. Sometimes you need
a smart combination of existing methods. Okay, second big question:
so does it work in practice? To conclude that it works in practice, I wanted to see other scientists reproduce
the technique and benchmark it for themselves. This is why this video appears later than most
others, but I think it makes it more truthful. So were other scientists able to
reproduce this technique? Yes. Did they also benchmark it? Yes.
Does this technique help? Yes. But, not so fast! The first tests reveal that
it decreased the memory cost of the KV cache, short term memory by 30-40%. That is fantastic.
I would have been very happy with this. But it
doesn’t end there. Typically you have a tradeoff
where you decrease memory usage at the cost of something. So something needs to slow down.
Now hold on to your papers Fellow Scholars, because it also sped up processing the
prompts by about 40% as well. What? That is…my brain crashed. We get faster
AI assistants that need less memory at almost zero cost. That is insane. In a world
where it’s harder and harder to own things, this is a blessing. Thank you so much! It is also remarkable that the paper has
barely been out for a week and some of you Fellow Scholars already coded it up.
Nice work. Link is in the description. So, it’s not quite like the media says.
Based on the results, we cannot conclude that every AI machine suddenly needs 6 times
less ram. No. That is a bit idealistic and only
true for some corner cases. You know when
you see an official benchmark of a phone battery or electric car mileage with somewhat
idealized conditions? It is a bit like that. So careful with the media hype. Experienced
Fellow Scholars like you know that in your mind, you have to tone these numbers down a
little. This is why we wait for more data and analyze experiments here, to get
the highest quality information for you. But it’s still good. Really good!
It helps most people who run AI systems with very long contexts. When you
chuck in a huge pdf document, or a movie, or a huge codebase for an AI to analyze. Yes,
you will be able to do that cheaper, with meaningfully less memory. Often a few gigabytes
less. And I think that is absolutely amazing news. Third, I will note that other researchers
point out that the paper overlaps with previous
techniques. They felt that it has similarities
that should be discussed more thoroughly. There was more. Eventually, the paper was accepted
for publication, though not all researchers agree the concerns were fully addressed. I put the
links to all of these in the video description. But this proves that even in modern AI, there
are still basic things we haven’t invented yet. And that makes this a very exciting
area to be in. What a time to be alive! And if you agree that this is
the way of talking about papers, please consider subscribing
and hitting the bell icon.