Paper: https://arxiv.org/abs/2511.08923
Abstract:
Diffusion language models hold the promise of fast parallel generation, while autoregressive (AR) models typically excel in quality due to their causal structure aligning naturally with language modeling. This raises a fundamental question: can we achieve a synergy with high throughput, higher GPU utilization, and AR level quality? Existing methods fail to effectively balance these two aspects, either prioritizing AR using a weaker model for seq
Hello, today we're looking at TiDaR Thinking Diffusion Talk and Autoregression by researchers at Nvidia. And this is a really cool paper because it effectively um makes good you already observes that we don't have full GPU utilization during autoregressive large language model inference um because it's largely memory bound. So, there's going to be times when the GPU isn't fully utilized. And it asks itself how can we smartly [snorts] use that extra GPU capacity without having to do the tradeoffs that typically um other systems that are trying to do similar things have. So, this is kind of as close to a free lunch as one can get. Basically, you're just uh investing extra electricity um to do some computation, but you do not have to, you know, you don't have to the other tradeoffs that are usually involved to uh such as
with speculative decoding um or block diffusion. We're going into the method in a bit, um but basically, this in summary, this model proposes like a hybrid autoregressive and diffusion language model architecture that nevertheless samples exactly as an autoregressive model would. So, um you get all the quality of an autoregressive model, but is able to achieve a significant speed up um by sort of pre-computing things uh like speculative decoding, but it does so using diffusion. And it does so explicitly using this extra capacity that's available in uh the GPUs. So, the paper starts out by saying, "As we move towards artificial general intelligence." Pretty strong opening statement from a paper right here. And um it's just it's an opener. Um a hook, so to say. And after that, they go very
quickly into the methods here. So, um there are a couple of things we have to understand before understanding the method that this paper uses. And the two things that the paper is coming back to is the autoregressive models and the diffusion language models. Now, the autoregressive models is probably basically what you um know if you have learned about large language models, especially the GPT type models. So, um you're going to have a sequence. And what you're going to do is you're going to always condition on the what's called the prefix here, um sometimes also called the prompt. And um then you're going to produce the next token right here. Once you've produced the next token, you're going to condition on the new prefix and produce yet the next token. Once you've produced that token, and so on. So, you're basically um always considering the um the full prefix and and you're producing the very
next token in the sequence right here. So, this could be a sentence, this could be a visual visual language model and whatnot, as long as it's tokenized, um you can do autoregressive decoding. Now, one thing here is that uh this is obviously quite slow to train uh bec- if you were to do this naively, because um you only get like you have to process in this case five tokens only to compute the loss for this token right here, right? And then you have to produ- compute you have to process six tokens only to compute the loss for this one right here. So, people have have um sped this up by parallelizing this. They're effectively saying, "Okay, let's during training, right? During inference, we produce one token by one, like one by one by one. But during inference, what we do is um like the naive way would be, "Okay, here is a full sentence of tokens." Um these
circles represent uh tokens or words, so the naive way would be say, "Let's cut it off somewhere here, right?" And then this is the prefix and this is the target token. And then you have a classic machine learning problem, right? You have an X, you have a Y, and you're trying to predict Y from X. This is super inefficient. So, what you can do is you can actually construct um what how many do we have? We have eight tokens. You can construct eight different uh losses from this one thing in parallel by saying, "Okay, actually, I can construct a, you know, nothing um nothing is my X and then this here is my Y. And then this here is my X, right? And then this here is my target. And then this is my prefix and this is my target. And this is my prefix and this is my target. And you can do it all in
parallel. And the only constraint you have is you have to um create a triangular attention mask, meaning that um this token here can only look back. Right? It can only look uh at the previous tokens. And then this token here can only look at the previous tokens. This is during attention computation. Um if you don't know what attention computation is, there's there's tons of videos available. I have one, too, um on Attention Is All You Need, but uh in this what is called causal attention masking, you're only allowing the tokens to look back. So, your attention mask is going to be sort of triangular. So, this one can only look at you have to imagine this same tokens to also be aligned along this side, and then this one can can look at all of the previous, well, I guess here. This one is filled out here. This one can look at all of the previous
ones. And then this one can all look all of the previous ones except the last one, and so on. Now, you might think, "Oh, that's obvious. You Of course, you can only look back." However, what you're forgetting is that this now also in causal attention masking, this now also counts in at the intermediate stages. So, imagine you are in this situation, right? And you are computing this token right here. Now, as your signal moves through these layers right here, um all you need is you need to produce at the end you need to produce a distribution here over this token. And you can as you compute the intermediate signals here, basically anything is allowed from a theoretical standpoint, right? Like imagine you as a human look at this pieces of piece of words and are trying to infer this right here. You might very well read this one first,
right? Because it's like, you know, the cat ate the dog or the cat chased the dog, right? You might actually, you know, read it and then you might actually put some attention on the on the nouns first, the the cat and the dog, right? And then, you know, from these you might look into the verb here, chased, right? So, your attention is going to jump around wildly um as you analyze this prefix before you come up with the distribution of the next token. However, this is not allowed in causal attention. In causal attention, even in intermediate processing, you are strictly only allowed to look back as you do that. So, it's a bit of a technical argument, but keep in mind that you are in fact doing a a tradeoff to train these things more effectively, because um that allows you that the same computation you're doing for this prefix are also valid for this prefix and are also valid for this prefix right here.
I keep repeating this in videos because I do feel like it shouldn't go out of collective consciousness that we are doing a pretty serious tradeoff here when we are applying causal attention that is seriously restricting the um compu- like the what would be theoretically possible in terms of attention patterns. Nevertheless, um this causal attention, it basically reduces to to during inference time, um do one token at a time. There is another thing called diffusion language model. Now, in diffusion language model, what you're doing is you have a prefix, right? And you are again, there's layer, layer, layer, right? You you process the signal, yada yada yada, and and you are generating all of the future at once. Uh there's also different strategies where, you know, you you have different masking patterns and whatnot,
um but you're basically just generating all of the future at once. So, you are going to come up at the end of this process here, you're going to come up for every token that you are interested in, right? For every future token, you are going to come up with a distribution over like over the vocabulary and jointly, right? So, in this case, let's say we predict uh four tokens, right? And you can hopefully observe that you're only getting marginals. You're only getting marginal distribution for each of the four tokens. Meaning that um if you now want to produce text, you're going to sample from this distribution for the first, uh, token or the fifth token, from this for the sixth, from this from the seventh, and from this for for the eighth. And that's bad because this completely disregards any interaction that these things might have, right? Like, whether or not here you actually sample a verb or an adverb, right? Which
which would both be possible, but that's going to determine what should come back here. However, um, if you simply sample from the marginals, that that does not that doesn't happen. There's no, um, influence between these. There is definitely influence as you compute these tokens, but the sampling processes are, um, are distinct from from one another, and they don't consider you they don't consider you can't sample one and then compute the other one because that would be autoregressive, and that's exactly where you you're back to this situation right here. So, the diffusion language models, um, by the way, they're trained by you simply taking a full sentence and then just masking some stuff out and then trying to to predict those, um, to predict those things. If you've ever seen something like BERT, that's what that is.
Uh, it's interesting that we're we're back to BERT for language, um, modeling a generative language modeling because I remember when BERT came out, like, the whole world was trying to do it, and it just kind of didn't work. Um, so, we've we've gotten better at a lot of stuff here, and it's interesting that a a method that was basically, you know, discarded being like, oh, no, it doesn't work with the masked language modeling, um, is now making a comeback. So, why the like, which one's better? Autoregressive one gives you better quality because it's principled, right? Like, it's basically says, "Okay, I'm going to produce one token based on all of the tokens that I know of, and then I'm going to sample it." And once I've determined the token, I'm going to produce the next token from all the ones that I know. Whereas, the diffusion language model is faster because it can generate many things at once. Um, however, uh, due to
you only sampling from the marginals, uh, there it there is no interaction, and typically your performance degrades. Um, by the way, this distribution right here, yeah, it doesn't know, even if it's the perfect distribution, it doesn't know what you're going to sample here. So, it cannot it cannot it can only be based on the marginal distribution here. Uh, it can only it cannot be based on what you actually end up sampling, right? Okay. So, autoregressive is better. If possible, we would want to produce, uh, the sequence autoregressively, but diffusion is faster. So, how can we do these two? Now, the last thing that you need is, uh, to know is speculative decoding. Speculative decoding comes down here in related work somewhere. So, speculative decoding is a really interesting, um, technique, and, um,
it comes back to this to this fact right here of parallelizing training in, um, in autoregressive models. So, why can't we just parallelize, um, the inference in the same way as we parallelize training in autoregressive models? Well, because during training, we know the whole future already, right? We know the whole future. So, we know all the tokens already, so we can compute everything in parallel. Whereas, during inference, we don't know the future. Like, legitimately, we don't know what's going to come next, and therefore, how could we compute, like, for this token right here, how could we compute because we don't know what's coming here? Okay. So, there is a solution here, and the solution is, um, is the following. Imagine for a moment that we're doing greedy sampling, right? We always take the we always take the token with the highest probability.
If for some reason, if for some reason I had an oracle, right? I'm in I I need to produce some some, you know, three more tokens right here. I have an oracle here. And the oracle is going to tell me it's going to be B D A here. Right? Tokens B D A. Well, okay, if it's a true oracle, I can just take it, but let's assume that oracle, you know, sometimes it's not really it's kind of lying to me. But if it's if it's lying, um, it lies about all the tokens, and if it's truthful, it it doesn't lie. Like, it gives me the true truth. Um, well, if I have this suggestion, I can just check, right? I can simply, now because I know the future, I can simply check, um, all of these in parallel. Like, okay, given given this prefix, what's the likelihood
of the B token? Given this prefix, what's the likelihood of the D token? And given this prefix, what's the likelihood of the A token? I can do all of this in parallel because of causal attention, like during training. So, I can compute all of them in parallel, and if it turns out that indeed the B token is the highest likely for this, um, given the prefix here, the D token is the highest likely for this prefix plus the B token, and the A token is the highest likely for this prefix right here, well, then I know that if I had sampled my sequence using autoregressive decoding, so if I had gone from here and used autoregressive decoding, right? And I had be like, "Okay, what's the most likely token? It's B." Right? Now, give from this prefix, what's the most likely token? It's D. From this prefix, what's the most likely
token? It's A. Right? If I had done this autoregressively, I would have gotten the exact same result. So, if I have something that gives me a suggestion, and I can check, sort of compute that suggestion, um, or or check the suggestion, I can do so in parallel. And if it actually turns out that yes, um, had I done the computation in the regular way, I would have come out to the exact same result, then you can hopefully see how this is faster because I can check all of them in parallel. Now, why don't we always do this? Well, we don't always do this because, um, we don't have this oracle. And you might say, "Well, we could just guess, right? We could just take any tokens there." And that's fair, but vocabularies are 32,000 tokens, so the likelihood that you you even going to have the first token correct here is so small.
And if you have the first token incorrect, all these computations for the rest, they don't matter because, right? Like, who cares if D is the most likely given the prefix that includes B if B wasn't the token that was actually sam that that was actually sampled. So, that's useless computation. Um, you can still do it in parallel, so you didn't pay really anything for it, but it's kind of useless. So, that's why we don't we don't usually do it. >> [snorts] >> So, what did people So, the whole game here is, can we, um, come up with a technique that gives us a suggestion for the future that is accurate enough so that in a lot of cases, it's actually going to be correct in that yes, um, we can we can quickly check of whether it would give us the same result as the autoregressive model in the first place. Um, and that's speculative decoding. So, what speculative decoding does is it
basically goes and it says, "Oh, here you have this prefix, right? Let me use a small, very fast language model. That's kind of a distilled version, an approximation of the big one, and let me quickly produce, um, some tokens, and then let's use the large one to check." And that's um, how we speed up. So, if the time it takes me to do to compute the small model for, let's say, N tokens, um, plus the time it takes me to do the big model, um, to check N tokens, so that's just one token, right? If that is smaller than the time it does it takes me to do the big model for N tokens, then I win. But that's not all because I have to multiply by like some factor alpha right
here, uh, because alpha is the likelihood that, um, the small model is actually correct. If the small model isn't correct, then it's probably like 1 minus alpha, um, T the big model for N tokens, or I can make it more accurate like however many tokens the small model wasn't correct about, but I hope you get the point, right? Uh it all it's all a question of how uh how good is the small model in giving us accurate um things and how fast is it? Because if it's kind of slow and it doesn't give us that accurate guesses, then running the small model is actually nullifying all the gains that we make by the speculative decoding. It's actually could even hurt. Like if we always run the small model and it's always wrong and then we need need to use the big model anyway, then we've not gained anything. We've actually lost time because we also had to run the small
model. So that's where this paper comes in. This paper finds a way to give us these suggestions basically for free. Um because it notices that we do have unused GPU capacity during auto-regressive decoding and that's um that that's basically that capacity that the model can use to compute the suggestions, what they call as a draft um for the next step. So that's the paper. That's this that's what this paper does. It's basically says um we have this architecture that enables parallel token computation from the marginal distribution via diffusion and high-quality sampling from the chain-factorized joint distribution via auto-regression. At each generation step, we partition the tokens into three sections. There are prefix tokens, tokens proposed in the previous step, and tokens
pre-drafted for the next step. We reuse the KV cache of prefix tokens from the last step. Tokens proposed from the last steps are auto-regressively sampled via rejection sampling guided by the auto-regressive likelihood computed at the current step. And then they also pre-draft proposals um conditional all possible prefix outcomes of the rejection sampling. This uh this is going to be the topic of what we're going to look into next, but the the last thing I have to explain here is this re- what they call rejection sampling. That's simply so for now we've assumed in speculative decoding that there is like a right token um to sample because we've assumed that you always take the one that's most likely. Obviously, in practice you're not you're going to like sample from this marginal distribution right here. And so you have to imagine that um the situation isn't that the small model is telling you it's B and then the big model is telling you no, it's A and therefore this is wrong,
right? Instead, the small model is going to give you some sort of distribution to tell you sorry, this this is a histogram. This is like a histogram that looks in this direction. I hope that's clear. This is the vocabulary on along this axis and then this is the likelihood or or a density or I guess probability for for a discrete distribution. Um and so the the small model will tell you this and the big model is going to give you a also a distribution. Now this distribution is might be similar, right? Like let's say the small model is really accurate, it means that um these distributions are, you know, might be similar uh but they're never obviously going to be exactly the same. Or if the small model is really bad, then these distributions might be dissimilar. So the question is basically um how can we achieve it such that um
we can use a a this proposal, this small model, these distributions and still end up uh in a situation that is mathematically exactly the same as if we had used the large model to do auto-regressive sampling. And the answer is rejection sampling. So there is um a technique called rejection rejection sampling. Um by which you can um basically sample from I guess sample from this distribution and then use this distribution here like to compute an acceptance probability by considering the ratio of things. I'm not an expert on rejection sampling. But basically, you can um have a procedure to accept or reject a proposed a proposed sample from um this distribution considering that this distribution here is actually the one
that you wanted to sample from in the first place. So that's why you can generate tokens with diffusion and the diffusion tokens are generated with these distributions and then you can use these distributions here in order to decide whether you want to accept them or not. That's just a bit more um involved way um rather than in the greedy case it it just degrades to if the big model sorry, if the big model tells you a different token from the small model, then you know, bad. But in principle is exactly the same thing. You use the large model to check the smaller model and um you can accept the tokens or you can reject the tokens and you always do that in order. So if you accept if you accept um B here um and you and then you reject D, you must also reject A, right? And then you've only accepted B and then from
here you continue. And that's exactly the the gist here. So now we dive into the technique. They have a nice diagram here that explains things. Uh da da da da da da. There we go. So this is it's a bit it's a bit involved here, but I do think it actually explains things um quite well. So in this situation, imagine we do already have uh tokens A, B, and C already produced. So that's our prefix, right? We have a KV cache here which is just the KV cache contains the keys and values uh from the attention computation because the keys and values are going to be always the same for the same token at the same position. Um so during auto-regressive decoding, you can actually store them and that means you don't need to compute them again um during uh during the attention computation. You only need to uh compute the queries um from the next token that you consider.
So now um imagine, okay, in this situation, yeah, everything's happening at once. Um Let's say we have ABC. And then we would need to produce the next token, okay? Now if we were to just auto-regressively sample here, um we would get D. And that's cool, right? And that's what we would normally already do. All the work that is done in addition to this and they call this free token slots here. All of this work happens in parallel to um the forward pass that samples D um and therefore is kind of free, so to say, because uh we don't waste any time because we're simply using extra capacity in the GPUs to do some. Now let's go back a bit and um
say, okay. Let's say we have a we have a proposal. Our proposal is um D E F and we'll see where we get this proposal from. But okay, let's say we have a proposal. Now again, in the same forward pass as we would have simply sampled D auto-regressively, we can now use that forward pass to check D E F to check the sequence D E F following ABC using the auto-regressive model, right? A single forward pass is enough to check all three of them. So let's say, yes, that's good. We accept this. We accept this and this one, you know, the proposal got it wrong, right? But we've made a gain because we've now produced two tokens in one forward pass. So we're twice in this particular step, we're already twice as fast. Um so the question is where does this proposal come from? And that's exactly
where um where the diffusion um comes from. Let's assume that I already know that um I already know that um Uh no, let's let's assume for a moment we accept all of them, right? All of them, yeah, correct. Three tokens for the price of one, super good, right? In the in the same forward pass we can do yet another thing. We can already produce the draft, the proposal for the next step. So in this case, we assume that this is the prefix. So the prefix that we have plus the proposal, that's all the prefix and we're going to use a diffusion language model to produce three more tokens, right? G H and I. I'm not even sure if that's the order of the alphabet.
Yeah, it is. Okay. So, we're going to use these Now, these are diffused, right? These are Let's call make a small D right here. So, these are diffusion tokens. So, they are not as good as autoregressive tokens, but we can produce them at the same time. And we can do so in the same forward pass, right? So, in the same forward pass, we can use the autoregressive model to check the this draft, right? That's AR AR check. That's a fat marker. And in the same um in the same forward pass, we can use uh diffusion to produce this next draft. So, we're gambling a bit here. We're gambling. We're gambling that if the autoregressive model
accepts all three tokens, then we already have this draft for the next step. So, in the next step, it will be Oh, we have A B C D E F and I already have a proposal here, you know, G H I. I already have that proposal right here because I've computed this proposal, this draft, on the basis that this is the prefix. So, the proposal is valid. However, if um let's say the autoregressive model rejects this F token right here. Well, then I have to discard my draft um because the draft is no longer valid because the draft is no longer modeled on um on the the draft's assumed prefix A B C D F isn't the true prefix of the next step. The prefix of the next step is only A B C D E because I've rejected F because
um F was a bad choice. Uh the autoregressive model um The rejection sampling can only tell you whether F is is what the autoregressive model would have done or not. It cannot tell you, "Oh, I would have done this other thing." Uh so that you could substitute it. At least not to my knowledge. So, the rejection sampling simply rejects. Um >> [snorts] >> I think it can always produce at least one token. So, you you'll always get at least like it's it's not possible that it rejects D and then doesn't tell you what it wants, I believe at least. I think so, maybe. Maybe I'm wrong about this one. But in any case, I hope you can see that if the autoregressive model accepts the all the the entire draft, then we already can in the same forward pass compute the draft for the next step and go on and go on and go on. The only problem is if the autoregressive model um doesn't accept the whole draft.
Well, what we can do is again, we we have some free capacity here. Couldn't we simply um compute all the possible future all the all drafts for all possible outcomes? So, we have A B C, that's the prefix, right? And now what's option one? Option one is we accept D but and accept E and accept F. Option two is we accept D and we accept E but not F. And option three is we only accept D. Or we produce D, right? Like we can always we can always we can always get at least one token out here from this procedure. Couldn't we simply compute the draft here, G H I and here F prime G H and here
E double prime F double prime G double prime, right? We could use diffusion language models to compute all of these drafts at the same time. So, no matter what happens, no matter where we land in the rejection sampling we are always going to have a draft for the next step available. So, let's say we're in this situation. Well, the next prefix is A B C D. Well, look at that. I have a draft ready that actually used this as a prefix to compute the draft and therefore is a valid draft and therefore I can use to to check again. So, that's what tighter ultimately ends up doing and that's what this graphic ultimately proposes. So, you can see here we're doing autoregressive decoding in order to check the current draft, right? You can see um for D, the target is E. For E, the target is is F prime or the
autoregressive decoding decoded F prime right here. Um so, we are we are So, D and E are accepted, but um since we the F is wrong, the autoregressive model um rather wanted F prime or wanted a different thing. Um we're rejecting F. But we're accepting D and E. Um at the same time, like the the this part here doesn't know yet that the autoregressive part decoded um the wrong thing. So, you have to imagine here here is the step boundary. This time goes like this. So, at the same time as we are doing this uh checking here against the drafts from the last step, we're also computing all the possible futures. So, we're saying, "Well, if we accept D
then let's compute a proposal for the next step. But if we compute D and E, let's compute another proposal. And if we compute if we accept D E and F, let's compute another proposal." All at the same time, right? Single forward pass. We're computing all the possible drafts for all the possible futures. And then once we know how many we've accepted, Oh, look, we've accepted D and E. We're simply going to select the proposal that matches that and then we're going into the next step with that proposal. That's it. If you you can do all of that by sort of doing smart attention masking. You can see here this is inference. So, we have our prefix tokens which we're actually going to put at the end right here um so that we can reuse the same mask and simply shift this border right here depending on the length of the prefix. Like everything else stays the same. Then um you can see that we're we have like a a fixed a fixed length of um diffusion
tokens here that we're going to um to produce. But you can see and this is three in this case. So, we have the prefix A B C tokens drafted from the last step, causal attention for those tokens, and then for producing the masks uh for producing the next draft you can see here this is the part where we assume only D is accepted. So, you have attention to D and you have full bidirectional attention like in is like diffusion language models have bidirectional attention. They are not subject to causal masking, which should be obvious. Um this part here is the part that assumes uh D and E are going to be accepted in the future. Uh and again, full bidirectional attention within the masked um within the draft. And then this part here assumes that D E and F
are going to be accepted. Um you can see here this attention and and this attention They always everything can always attend to the prefix tokens, right? That's why they're over here. And so, if you structure your mask like this, then you can hopefully see that um you can compute everything at the same time. Right? All of this happens at the same time. Now, you might ask, "Wait a minute. Before I basically my my causal masked look, you know, I had this. Uh let's see. Different color. I had this here and this. That was That was before. Uh now I have like all of this and all of this. Won't this just blow up my you know, my memory basically? Like am I not just quadratically increasing something right here? And the answer is kind of
no. Well, first of all, um you can see with more prefix tokens, it simply grows in this direction. So, all you're doing is you're kind of multiplying by a constant factor in in this size here, like three four. So, you just doing times four here. And the second thing is with things like flash attention, you can actually trade off um memory for compute. And that's exactly what we're what we need, right? We All of this is based on the fact that you we are currently not using all the compute available in the GPUs because we're memory bound. And so this is exactly a way to use that compute. I might be wrong in this actually. I'm just kind of assuming this. I haven't looked into the code. So I might be wrong on this one. But in any case Yeah, the whole point here is that we're using extra compute to compute all these other things, right? Which is basically
we get for free because we don't also compute all of these masking sorry all of these drafts for the future which we basically get for free because that extra compute is just unused currently. During training it's pretty simple. We use block block diffusion. So we assume full prefix attention and then for we can use and you can see here hopefully in the position. This is 0 1 2 0 1 2 3 4 5 3 4 5 So we're using the same tokens to do auto regressive loss function and masked language modeling loss function in in the block attention. So that it's not full diffusion masked language modeling everything can attend to everything. But it is block wise. So within the block diffusion and the block is exactly the
same size that we want our drafts to be long in the future. So you can use the same forward pass not just during inference to compute all this stuff but also during training you can use the same forward pass to compute the loss function for the auto regressive model and you can you compute the loss function for the diffusion In fact, they can be the same model, right? And that's maybe the trade-off that we're doing right here in that um you now have to train the same model to do these two different things diffusion language modeling and auto regressive language modeling. I'm not sure how much how hurtful that is because ultimately they are doing the same task and they might even benefit from from each other's computation. But the same would be true. Like you can also imagine two parallel models here that train uh each of each of them has their own task.
One is an auto regressive one is for auto regressive modeling and one is for diffusion modeling. Um that would just be some more parameters. So you have the freedom to shift these things around. That's basically it. They add these two losses together with factor of alpha and alpha is just one. So they they just add the losses together from the diffusion language modeling and the um auto regressive language modeling. And yeah, they say it has no no hyper parameters to tune during inference. And that's mainly because it doesn't do very fancy masking strategies. It just kind of always masks everything because when it produces a draft it wants to produce all the tokens of the draft at the same time, right? That that um because it's just a draft and you have the auto regressive model anyway. You don't have to have these fancy masking strategies or unmasking strategies that typically control the trade-off in
diffusion models. However, it's not excluded. You could definitely imagine a world where you use some fancy demasking strategies there but invest a bit more time into the diffusion draft. Let's say you have to do a couple of you know forward passes but in each forward pass you uncover I don't know five tokens in some fancy masking strategy. You might still be better off doing that and then using the whole thing as a draft for the auto regressive model rather than you know using these forward passes to do auto regressive decoding. It all depends on how good the diffusion model is in giving you accurate suggestions. And in this particular case right here it it doesn't really because the compute is quote unquote free. That's the special case here. Because we produce all at the same time it's you can achieve that in a single forward pass and that can be the same forward pass that you're using anyway to auto regressively decode the next token.
All [snorts] right. Experiments. You can see this section is relatively left blank by me and that's because I feel like that the um the the introduction here and the abstract already summarized the main experimental results quite well and that is that you get a similar performance than auto regressive models but you get a massive speed up because you are now um simply checking the drafts which is a lot faster. So that's this section right here saying um So they're they're using this at 1.5 billion parameters scales. Thanks to parallel drafting and sampling as well as the exact KV cache support tighter out performs speculative decoding in measured throughput and surpasses diffusion models like Dream and Lada in both efficiency and quality. Most notably tighter is the first
architecture to close the quality gap with auto regressive models while delivering four to five x or six x more tokens per second. That's pretty insane and again um they close the quality gap with auto regressive models meaning that it's well since you're mathematically equivalent than the auto regressive sampling the only downside you have really is that you are training your model with this auxiliary loss which is the diffusion loss rather than the pure next token prediction loss. And again since it's for the same task it might not be that much of a hindrance. But that's kind of the only trade-off you have versus pure auto regressive models. So it's not hard to believe that this model actually closes the gap. So is as good as auto regressive models while being a lot faster. And then against other diffusion models it's faster or also fast but is much
higher in quality because at the end again it it samples like an auto regressive model. Cool. That was it for the paper. Again, I enjoyed this one. It's pretty cool. It's observes some some opportunity and it smartly uses it and yeah, pretty nice. Um that's all I have to say about the paper. Thanks so much for listening and I'll see you around. Bye-bye.