bycloud
Google's TurboQuant Is Way Too Overhyped
2026-04-10 14min 21,367 views watch on youtube →
Channel: bycloud
Date: 2026-04-10
Duration: 14min
Views: 21,367
URL: https://www.youtube.com/watch?v=haoAI2lIZ74

Check out Inngest and let your AI agents wear a harness now!

https://www.inngest.com/?utm_source=youtube&utm_medium=video&utm_campaign=yt-bycl-4

With how TurboQuant shook the general public with its insane 6x memory reduction claim for LLMs, lets take a closer look at what actually happened underneath, and validate their claims by understanding how TurboQuant actually works.

my latest project: Intuitive AI Academy

We just wrote a new piece on Distillation & MoE!

https://intuitiveai.academy/

l

A few weeks ago, Google announced this new research called Turbo Quan, claiming that their new compression algorithm is capable of reducing LM's KV cache memory usage by up to six times and achieving up to eight times speed up. This headline itself is huge, reaching up to 38,000 likes. Because if Turbo Quin is applied, at least for Google, they get to free up around 83% of the total memory that existing hardware was occupied with, which some would say led to the crash of AI chip stocks with prices of 32 GB DDR5 being down at least 30%. So, in a way, Turbo Quant to Micron stocks is like Deep Seek to Nvidia stocks last year, as Micron is the one that makes the memory that AI uses the most. But there is just a bit of a problem. Turbo Quant is not being advertised in good faith, while Deep Seek V3 is genuinely a breakthrough that people kind of misunderstood its impact. Like this eight times speed up claim alone in this title is already pretty dramatic and even a bit problematic because if you take a closer look at this eight times speed up claim, it is being compared to a baseline that no one

uses in practice. They are comparing 4bit against 32bit unquantized baseline. So of course on paper that gives you the cleanest tell line imaginable because 32-bit down to 4bit is exactly an 8:1 reduction in data movement. So if your bottleneck is just moving data through memory then of course you can just say up to eight times in the best case. But that is a very different statement from saying the method makes real LM inference eight times faster because in practice modern LM inference does not usually use 32 bits in the first place. So Google definitely made the improvement sound more groundbreaking than it actually is. The real question should be how much better is turbo quan than the baselines people already use which they did not answer anywhere in the blog nor their paper. It's kind of like saying oh I run 100 times faster than a toddler. It's like what's even the point? You should at least compare yourself to someone like Usain Bolt. But I am not saying Turboquin is a bad paper. In fact, they do have a lot of great contributions when you look beyond their marketing tactics. But before we dive into it, if you're building AI

agents, you have probably noticed a recurring problem. Prototyping is really easy, but making them reliable and consistent in production often requires painful rewriting of the project from scratch. Unexpected things that never happen locally, such as orchestration failures, tool close timeout, or rate limits, just spawn. And that's exactly where ingests comes in. Ingest is a durable execution platform. It shifts the fault tolerant work from system engineering into back-end code that you can actually ship. So your critical logic keeps running across failures or longtime windows without turning your app into an infrastructure project. With how AI agents early challenges like hallucination and safety were managed through many workarounds like orchestration patterns, human in loop tool calling and reasoning. These also introduce multiple points of failure that ingests durable execution is uniquely positioned to address. You would get automatic state persistence, intelligent retries, and workflow resumption. So your agent doesn't just fail and restart. It continues from where it left off. Inest would basically act as the AI agent harness for you. So you can make sure that your agents are

always production ready. For instance, they manage human in the loop with a durable suspension where your agent can pause for approval for hours or days then continue without losing context. In their latest release called durable endpoints, you can use ingest's durability API directly inside your existing API endpoint. So, your prototype is production ready from day one without having to set up workflows or go through that classic two-stage process of building something fast and brittle, then slowing down later to rebuild it for reliability. What's even better is that they have a pretty nice free plan, too, with up to 50,000 executions per month. So, if you're building agents and want reliability and good AI harness, definitely check them out using the link down description. And thank you, Injest, for sponsoring this video. Anyways, you have probably seen a lot of glazing videos talking about how gamechanging Turbo Quan is. So this video will be more like a reality check to ground it as much as I can. This paper was actually first published nearly a year ago. So as Google advertised that is still a relatively new technique. But to understand turbo quant you first need to understand what it is actually compressing. In a

transformer's attention mechanism every token is converted into a key vector which is then stored in the KV cache along with a corresponding value vector that contains the actual information to be retrieved later. So as more tokens are processed, the model keeps all these past key vectors in memory and when you need that key again, you don't need to recomputee it as it is saved. Then when a new token aka a query comes in, the model compares it against all previous tokens by computing doc products between its query vector and every stored key vector. So the KV cache becomes a continuously growing table of vectors where each stored vector represents a past token and every new token is evaluated against all of them to determine how strongly they are related. And this implies two things. First is that these vectors take up a lot of memory because you store a new vector for every token which will expand it linearly and they all have to be stored to save compute. Second, the dot productducts between vectors must remain accurate because they directly determine how much each pass token influences the current one. And this is exactly where compression becomes difficult because if

you compress these vectors too aggressively, you don't just lose information, you specifically distort the relationships between tokens, which is what the model actually relies on. So Turbocoin is designed around a very specific goal. It is not just trying to compress vectors. Well, in general, it is trying to compress them in a way that keeps their dot products correct while still using as few bits as possible. And the major part of this paper is to find the sweet spot. And they basically achieved it by breaking the problem into two major steps where each step removes one specific difficulty. The first step is to deal with the fact that these vectors are highly structured. Since different dimensions behave differently, some directions carry more signals than others and many dimensions are correlated. If you try to compress this directly, it wouldn't work because the vector is not uniform across its dimensions as some dimensions carry much more signal than others as some are almost noise and many of them are correlated. So the information is unevenly distributed. And if you apply a simple compression scheme directly like quantizing each coordinate the same way, you would waste bits on dimensions that

don't matter much while not allocating enough precision to the important ones which is very inefficient. On top of that, the relationships between vectors get distorted in unpredictable ways. Dot products depend on how all dimensions interact together. So even small errors in the wrong directions can change which tokens are considered more related. And since different dimensions contribute differently, a uniform compression scheme does not preserve these relationships well. But instead of trying to design a more complicated compression scheme to handle all the structure, turbo quant does something much simpler. It applies a random rotation to the vector. So it removes the structure. This idea was first introduced in polar quant which also has the same author as turbo quant where if you first apply a random transformation to a vector, you can reshape its distribution into a much more regular form. This simply redistributes the information across all dimensions while not removing information or changing the meaning of the vector. So, Turbo Quant further applied this idea which means now a simple uniform compression scheme can be used. Then once the vector has been rotated, every coordinate behaves

similarly and carries roughly the same amount of information. This does not distort existing information because the rotation is invertible and preserves dotproducts. So all relationships between tokens remain exactly the same just expressed in a different coordinate system. Which means now the problem of compressing a highdimensional vector essentially reduces to compressing a bunch of independent one-dimensional values. So Turbocoin can just apply an optimal scalar quantization method that would minimize reconstruction error. And because the rotation made the data uniform, this simple approach would be the best possible compression you can achieve. Even though this step reconstructs the vector well overall there is still a subtle issue. Minimizing reconstruction error does not guarantee that dot products remain accurate. What I mean is that even if the compressed vector looks very close to the original the way it interacts with other vectors which is what attention actually uses can still be slightly off which is a huge problem. So this is where step two comes in. After quantization we compute what was lost which is the difference between the original vector and the compressed

version. As this remaining error is much smaller than the original vector, this can be stored very efficiently. But instead of actually storing it precisely, Turboin applies another random projection and keeps only the sign of each component which uses just one bit per dimension. And this idea was also from one of their earlier papers called QJL. And they found that even with just the sign information, it allows you to recover an unbiased estimate of dot products. What this means is that even though each value is extremely compressed, the dot product will still be correct on average. Like for a single computation, it might be slightly off, but across dimensions, the errors cancel out. So it stays centered around the true value thanks to the rotation that spreads all dimensions evenly. So that's why with just one bit per dimension and summing over many of them still gives a correct estimate on average and would make the dot product calculation much more accurate. So this is the core idea as to how Turbo Quan can compress KV cache from something like 16 bits per value down to around 2.5 to 4 bits while still keeping the

model's behavior intact. As for the trade-off for accuracy, they found that at around 3.5 bits per value, the compression is essentially quality neutral, meaning the model behaves almost identically to full precision. But if you push it further down to around 2.5 bits, you would start to see some degradation, but it is still relatively small compared to the amount of memory saved. And keep in mind this compression is mainly applied to the key vector and the precision of dotproduct for the query key calculation. So to fact check about the 16th memory savings, where does that number come from? Well, first of all, the memory savings is only applied to the KV cache. This means the model weights would still take up the same amount of space. It's just the cost would be cheaper when the context window grows. As for the actual numbers, the baseline is once again not something fitting for the current actual LMS in production, but a theoretical baseline that lacks optimization. So it likely comes from comparing the usual 16- bit KV cache against the turbo coin at around 2.5 bits per value. So if you divide 16 by 2.5 bit, you get around 6.4 times memory savings or 16 divided by

3.5 bits, you get around 4.5 times memory savings. Not to mention the graphs here they use to compare is a method published back in 2024 like the KV paper baseline or just a full KV cache. While it is not some sort of graph crime, it does feel a bit heavily narrated in a direction. Not to mention, this is only applying to the KV cache and not the model weights. On top of that, on the other graph where they are comparing Turbo Quant against other research baselines, a lot of problems got pointed out by the community. So, here's the thing. When there is a similar research paper that was published much earlier than yours, what should a reasonable scientific study do? discuss it, try to reproduce it, and maybe even make it into a baseline. Well, here's the problem with the first version of Turbo Quant. So, there is this very similar technique called Rabbit Q published one year earlier that also incorporates the rotation method and shares their implemented C++ optimization openly. But due to some sort of reasons, Turboan authors had to port it to Python and in that ported version, it does not support multi thread. So, they ran the baseline of

Rabbit Q on a CPU and ran Turboquin on an E100. Though of course the difference with the prior methods would have a huge difference because not only is the baseline not being compared fairly, it is also running on two completely different hardware basically providing no value in this comparison. What's a bit more absurd is that they dismissed about rabbit Q in the paper without any reason to calling it suboptimal without actually fully reading their appendex of plasian studies as they have admitted in their latest comment on open review which is pretty nuts. So, it just doesn't seem like research was done in completely good faith, especially when the original Rabbit Q author has already reached out to them multiple times to try to get them to clarify their findings. Well, there are still more issues surrounding the paper, but I think you probably already got the point. And the Turbo Quan authors did promise they will improve the paper in their latest comments regarding these issues. So, the news of Turbo Quan posted by Google could likely be a media stunt or misdirection, unless a technique that was published a year ago still counts as something new to Google.

Or maybe they're just promoting it because this is only recently accepted into iClar 2026. But once again, looking at this one peer review comment they got which gave this paper a 10 out of 10 while other reviewers are giving it like a four and a six is definitely awkward. But anyways, I am just speculating at this point. But this sort of optimization at KV cache level is not something new and every company that serves LLM definitely uses some sort of quantizations there. So no, nothing crazy revolutionary about AI is discovered and everyone has already been maxing the compression efficiency in their own ways. Maybe Google was able to improve a bit of their memory efficiency, but to able to free up at least 83% of memory on top of the current state-of-the-art methods in this economy, it's pretty much impossible. especially when they didn't even open source any codes. So, your AI chip stocks are safe and you can see how it bounced back a few days later already. And even if 83% memory saving exists, it just means more tokens will be consumed because we are entering an economy of intelligence. And again, this

compression technique wouldn't change the amount of memory you need to hold a model in the GPU. So, yeah, that's it for this video. And if you like how I explained the AI concepts today, you should definitely check out my latest project, intuitive.academy, Academy where it contains an intuitive explanation of all modern LMS from the ground up ranging from LM architectures Laura to howes work. A total of 24 chapters are currently available and will be updated monthly. This is the start of a series where I'll break down AI topics intuitively because I genuinely think anyone could understand them no matter how difficult it may seem. So for those who want to get into AI or LMS, this should be the perfect place for you to dive into the technical parts without being intimidated by crazy looking maths. And right now I am also putting out a new launch discount for 2026. So you can use the code early for 40% off a yearly plan. And thank you guys for watching. A big shout out to Spam Match, Chris Ladoo, Degan, Robert Zaviasa, Marcelo Ferraria, Proof and Enu DX Research Group, Alex Midwest Maker, and many others that support me through

Patreon or YouTube. Follow me on Twitter if you haven't and I'll see you in the next