bycloud
The Most Clever Trick To Speedup LLMs
2026-04-01 12min 38,278 views watch on youtube →
Channel: bycloud
Date: 2026-04-01
Duration: 12min
Views: 38,278
URL: https://www.youtube.com/watch?v=4Ij9YOyrNdM

Try out and get your free credits now on GenSpark AI, as well as unlimited use of AI Chat and AI Image in 2026 for paid users using this link: https://www.genspark.ai/?utm_source=yt&utm_campaign=bycloudAI

In this video, let us demystify one of the coolest trick that researchers use to speedup LLMs so that we can now use it at 2-3x the speed with no drawbacks at all!

my latest project: Intuitive AI Academy

We just wrote a new piece on Distillation & MoE!

https://intuitiveai.academy/

limited t

If you ask any reasoning LLM the question, what's the most clever trick to speed up LLMs? They would all mention this concept called speculative decoding. This trick definitely feels a bit magical. It's not doing any mathematical tricks or hardware optimization to make the LLM generate faster. Yet, it is pretty much a lossless technique that guarantees two to three times speed up. How is this legal? Well, just like every magic trick, once you understand what happens behind the scenes, you would realize it's actually nothing that crazy. Before I dive into it, while we are talking about theoretical efficiency, there is still so much more practical efficiency that standard LM chatbots left on the table. Which is why I want to put GenSpark on your radar. Genspark is an all-in-one AI workplace that can research, create, and execute multi-step tasks across things like presentations, data analysis, brand assets, collaboration, meeting notes, and so much more with everything all in one place. And unlike chatbt which is mainly conversational, GenSpark orchestrates multiple top models and tools to deliver finished outputs, not just text. With their latest AI workspace 3.0, instead of simply asking AI for help, you set it

up to get you moving faster on the busy work so you can stay focused on the decisions and the creative part. On top of that, the big new thing they put out in this launch called workflows can put routine work on autopilot with pre-built templates or build step-by-step custom workflows connecting up to around 20 apps. It also has this AI slide generation where you can dump in any messy doc, PDF report, or even piles of raw notes and turn it into a clean client ready deck fast. With their latest edition of Gen Spark Claw, which is basically their version of Open Claw that can take a task from a message, then go to execute it on its own cloud computer. You pretty much don't have to set up anything manually and instantly get all the features connected to your apps. So, if these functionalities fascinates you, they currently have a get started bonus where new users can try premium features for free. And for paid users, they're offering unlimited AI chat and AI image usage for all of 2026 with a stack of top models inside those two features. So if you want to give it a shot, check them out using the link down description. And thank you GenSpark for sponsoring this video. Anyways, to understand the black magic, we first need to understand the difference between LM training and

inferencing. During inference, LMS generate tokens one at a time. The model predicts the next token. That token gets appended to the context and the entire sequence is fed through the model again to produce the next token. then repeats. And every time this happens, the model needs to perform another forward pass through billions of parameters as the GPU is repeatedly loading the same massive weights again and again just to produce one additional token, which is why generation is pretty expensive. But for training, the model is not looping to generate tokens one at a time. Instead, it's just simply learning from an entire sequence by performing matrix operations over it simultaneously. So from that single sequence, the model is being trained to solve many prediction problems in parallel. And of these predictions can be computed in a single forward pass. So instead of processing it word by word, the transformer lays all the tokens out at once and asks every token to look backward at the tokens before it with each position solving its own prediction problem. Each of these can also be stacked together into a matrix where everything can be computed in one pass. However, even

though the model could compute many positions, we cannot do this at inference because the future tokens are unknown. Then let's just make the future tokens known, right? It's like a no-brainer. So researchers invented time traveling. I mean speculative decoding. The intuition is simple. You have a big state-of-the-art model and a small version of it which we will call the draft model. The big model is the one that will generate the inserts we actually care about. It is powerful. It is accurate but it is also extremely expensive to run as every token it generates requires another pass through billions of parameters. The draft model on the other hand is much smaller. It is not as smart and will make mistakes more often. But it is very fast and most importantly because it is trained on similar data and architecture. It often guesses tokens that the big model would also generate. So instead of asking the big model to generate every token itself, we let the small model guess ahead like a draft. The draft model might propose something like a sequence of tokens that it believes should come next. And because the draft model is small, producing these guesses is fast and cheap. Then after that, the big model comes in. But instead of generating tokens one by one, it simply

checks the guesses. It takes the whole sequence and evaluates it in one forward pass just like how it was done in training except it does not learn from it. Using that single pass, the model can see how likely each of those tokens is given the context before it. So it starts checking from left to right. Does A make sense? Does B make sense after A? Does C make sense after AB? And so on. If the big model agrees with the draft model, the tokens are accepted. then you will realize that we can produce tokens using the cost of a single forward pass of the big model as long as the draft model is correct. So for five tokens instead of five forward passes on the big model if the draft model generates all five tokens correctly the big model would only need to run one forward pass for the check. But if the big model disagrees at some point it just simply rejects the token and simpost the correct one itself. Then the drafting continues from there which means we are guaranteed to get the expected results from the big model much faster without giving up any performance as long as the error rate is low from the draft model. While running the draft model does introduce extra computation but the draft model is much smaller than the big model. So its cost is tiny compared to

running the large model. For example, the big model might cost 100 and a draft model that is 10 times smaller might cost 10 to generate five tokens. If you run the big model, it'll cost 500. But if the draft model gets it in one go, it will only cost 50 plus 100 which is 150. This is already 3.3 times better. Now suppose the draft model makes mistakes early. Let's say the first token is wrong. So the big model rejects it immediately. Then the second guess is also wrong. Then the third is also wrong. Even in that pessimistic case, the cost might look like 420, which is still cheaper than running the big model five times. So only when the draft model is wrong almost every time does the advantage disappear. But that would mean the draft model is disagreeing with the big model almost constantly, which would imply the draft model is extremely bad. And that shouldn't be the case cuz the draft model is trained on the same data as a big model. It's just smaller. So mistakes shouldn't be that common. But if the draft model is guessing tokens that might be wrong, how do we fix the mistakes? Because if we simply accept or reject tokens naively, we would distort the probabilities that the big model

intended. But why would that even matter? Well, because we'll still be sampling from the distribution, not asking it to directly output the most likely token. Model might think the next token should be the. So, if we generate many completions, roughly half the time we should see the about a third of the time uh and occasionally. That distribution is what defines the behavior of the model. And this controls diversity, creativity, and even factual behavior. Now, imagine we simply trust the draft model whenever it gets the token. If the draft model tends to prefer uh more often than the big model does, then uh would start appearing too frequently in the final outputs. And over many generations, the text distribution would slowly drift away from what the big model actually intended. And that means we would no longer be running the big model and instead we would be running some strange hybrid model whose probabilities are distorted by the draft model's biases. This might not seem like a big deal for a single sentence, but at scale it matters a lot as small probability shifts compound over time. If the distribution is slightly wrong at every

token, the generated text can gradually diverge into completely different outputs, which means the system would no longer match the behavior of the big model you deployed. So if the draft model believes the distribution is like this, the draft model would then sample a more often. But the problem is the big model thinks it's only around 30%. So you can calculate the acceptance rate as 60%. Which shows that there is a 60% chance we accept A. If it gets accepted, we move on. But if it gets rejected, we do not sample from the original distribution again. Instead, we remove the probability mass the draft model already tried to claim. Which means now the new sample distribution become like this and the negative probabilities will be set to zero leaving a 50/50 sampling distribution when you normalize it. So in the rejection case, we only sample between the and n. And when you combine the acceptance case and the rejection case together, the final probabilities become exactly like the big models distribution. So even though the draft model is proposing tokens and even though some tokens are being accepted without the big model generating them directly, the overall distribution of

outputs remains identical to running the big model normally. From the outside, it is as if the large model generated the text itself just much faster. And this is why speculative decoding is considered lossless because you gain speed but the statistical behavior of the model stays exactly the same. What's even funnier is that the paper that first proposed speculative decoding both came out in parallel. Each proposed it with similar methods without referencing each other. But the most interesting development for speculative decoding is that a few weeks ago a new paper called SSD short for speculative speculative decoding came out. And as its name suggests, it takes the drafting process to another level because even with speculative decoding, there is still a small inefficiency with a process of the draft model proposes several tokens. Then the big model verifies them and during that verification step, the draft model is basically just waiting. So what SSD proposes is that what if the draft model does not wait? While the big model is verifying the drafted tokens, the draft model can continue to work ahead and prepare the next possible continuations. as the verification step

can only produce a few possible outcomes with each of these outcomes leads to a slightly different next context. So the draft model prepares for several of these possibilities in advance and waits for the verifier to finish before drafting again with the system already having several likely continuations ready. If it matches one of the prepared outcomes, generation can continue immediately without running the draft model again. So if speculative decoding is about guessing future tokens, speculative speculative decoding is about drafting the potential outcomes of the verification itself. Let's take the previous example. So three things could happen and those outcomes depend on the big model's probabilities which the draft model does not see. So SSD basically just drafts every possible outcome at every token. While saying every possible is not really realistic because the vocabulary might have hundred thousands of tokens. So it does not draft every token. The algorithm actually chooses only the most likely outcomes according to the draft model where it uses the draft model's probabilities as a proxy for what the big model is likely to do. So it doesn't draft every possible outcome. So if the algorithm saw the possibility for in the

draft model being too low, it wouldn't make a branch. But if the verifier ends up producing and anyway, the system simply falls back to normal speculative decoding here. However, the SSD method does improve the speed of the current best implementation of speculative decoding that is in SG lang providing up to 50% speed up and it can be roughly four times faster than standard next token generation. And the speed up does not come from running the big model less. In fact, the number of times we run the big model stays roughly the same. It's just purely reducing the idle time of the draft models. But of course, there would be additional compute as there will be extra draft model generating. But again, since the draft model is tiny compared to the large model, the overhead is still small. Another important design choice is that these components can run in parallel on separate hardware. This overlap is where the additional speed up comes from. So naturally, the compute cost would be higher than typical speculative decoding when drafting this many outcomes. However, if the speed up is much more important, then SSD seems pretty worth it for generations. So yeah, that's it for this video. And if you like how I

explained the AI concepts today, you should definitely check out my latest project, Intuitive AI.academy, where it contains an intuitive explanation of all modern LMS from the ground up, ranging from LM architectures, Laura, to how work. A total of 24 chapters are currently available and will be updated monthly. This is the start of a series where I'll break down AI topics intuitively because I genuinely think anyone could understand them no matter how difficult it may seem. So for those who want to get into AI or LMS, this should be the perfect place for you to dive into the technical parts without being intimidated by crazy looking maths. And right now I am also putting out a new launch discount for 2026 so you can use the code early for 40% off a yearly plan. And thank you guys for watching. A big shout out to Spam Match, Chris Leoo, Degan, Robert Zaviasa, Marcelo, Ferraria, Proof and Enu, DX Research Group, Alex Midwest Maker, and many others that support me through Patreon or YouTube. Follow me on Twitter if you haven't and I'll see you in the next