DeepSeek V3.2 Just Broke SoTA Again… But How? — benchmark.space

bycloud

DeepSeek V3.2 Just Broke SoTA Again… But How?

2025-12-08 12min 192,561 views watch on youtube →

Channel: bycloud

Date: 2025-12-08

Duration: 12min

Views: 192,561

URL: https://www.youtube.com/watch?v=pljoUcBniPQ

Check out HubSpot's FREE AI Decoded Pocket Guide: https://clickhubspot.com/d21e13

In today's video, I'll be breaking down what made DeepSeek V3.2 such an important paper, how is DeepSeek-V3.2-Speciale so good, how DeepSeek has created this model, and explaining DeepSeek's new secret weapon: DeepSeek Sparse Attention (DSA).

my latest project: Intuitive AI Academy, learn modern AI/LLMs Intuitively

https://intuitiveai.academy/

code "NYNM" for 50% off forever (limited to 50)

My Newsletter

https

I think no words can properly express how excited I am with this Deep Seek 3.2 release. So, let me list out the things that got me this excited. Deep Seek has pretty much beaten OpenAI's flagship model that is GPT5 High and is comparable to Gemini 3 Pro. It is the cheapest Frontier AI model you would ever expect. 10 times cheaper than any state-of-the-art at this pace. Intelligence will actually be too cheap to meter. It's pretty much open source. a state-of-the-art model that is a git command away. Can you believe it? And there are so many new technical insights being shared, notably three really interesting ones that I'll cover later. With this Deep Seek 3.2 just being a completion of their full RL experiments and from the look of their conclusion, their future Deep Seek V4 will probably be insane. And there seems to be no wall in sight with them ending their technical report on the note that compute is the actual differentiator going forward. But I think it is in a very elegant way. So for each of the points I just mentioned I will be

explaining them in more details. And before I dive into it with Deep Seek bring amazing agentic functions to the table and new AI models launching every week. Figuring out which ones are actually useful is becoming a full-time job. Knowing exactly which model to use and how to prompt it is also what differentiates the typical users from the seasoned users. So if you want to skip the trial and error and go straight to the tools that matter, there's a faster way to do it. This free AID decoded pocket guide from HubSpot cuts right through the noise for you. They are the ones digging through the tunnels of AI to mine out the gems and deliver them directly to you. Inside you'll find a breakdown of exactly which model to use for what, like why you should use Claude for nuance writing but switch to ChachiPT for STEM tasks. It also covers the mustry GBTs for business from specialized image generators to tools like consensus for research and Taplio for professional growth. My favorite section is definitely the GRWC prompting framework which stands for goal return format warnings and context. The guide explains how using this specific structure can get you 55% better results, effectively transforming

mediocre outputs into exceptional ones just by tweaking how you ask the question. So, if you're ready to upgrade your toolkit and master the prompts that seasoned prompters use, check out this free HubSpot resource using the link down description. And thank you HubSpot for sponsoring this video. Anyway, seeing DeepSeek being this brutally honest about their model performance is so entertaining to read, but I guess they were never pretentious in their claims before anyways, because usually in their model releases, they will always be like, "It is improved benchmark performance, but this time around, they just casually drop the bomb that V3.2 2 is better than GPT5 high, which is what I like to see. But it's not actually the version 3.2 that beat the model. It's the 3.2 special. >> Oo, I'm definitely not trying to pronounce that. Let's just call it the 3.2 special model because this time they actually published two models. The 3.2 model is a well-rounded model and the 3.2 special is a model trained specifically for extended reasoning. They made this version so that they could kind of test the ceiling of their

3.2 model, but getting a pretty much state-of-the-art model is probably still a surprise to them. Initially, without the special version, the model's performance is slightly behind GPD5 High and probably a tiny bit better than Kim K2 and around Miniax M2 level. But with special, it outperformed all the models on the current hardest public MATH benchmarks. And this trend also shows on private MATH evals. The only subpar aspect about it is its coding, which is a tiny bit behind Gemini 3 Pro, but it's still BGPD5 high across the board. And that is not the only good part. This is a benchmark created by a Chinese blogger that is posted on Zoo. And the benchmarking cost here is calculated by how many tokens are used by each model to complete all the benchmarks with 3.2 special using double the amount of the tokens, but still being at least 10 times cheaper overall. Can you believe that? >> Again, they are not making losses on this. 3.2 is just that cheap for inference. This blog also contained more

detailed enduser performance analysis like it being an extremely strong MATHematical and scientific reasoning model with limitations in spatial reasoning, coding, and long context hallucination prevention. And this cost reduction is also something new to Deepseek. In their official Deepseek 3.2 technical report, they compared the inference cost between 3.1 Terminus and 3.2 2. And look at this, okay? Deepseek 3.1 Terminus is already a really efficient model to begin with, but Deepseek 3.2's cost efficiency has gotten so good that it became the X-axis. Can you believe that? >> And you know what's the most absurd part about this? This model is literally one git command away from you. It's free. You can download it, but not necessarily run it because it is still a 685 billion parameters model. But the point being this is mind-blowing. It's like experiencing DeepSeek R1 all over again. And you know what Sam Olman responded to this Deepseek release? A code red where they're delaying their products like

integrating ad into ChachiBT and going back to research. Well, they still pushed out the ad. But yeah, it just looks really rushed. Well, I guess that's one less reason to use ChachiVT. But anyways, the amazing thing about what DeepS has achieved here is not through some simple scaling, but a lot of clever research with scaling. So, how have they done it? Well, a great thing about open source research is that it's all online for us to read it ourselves. But going through everything might take too much time for you. So, I'll be highlighting three of my favorite points. The first being DSA, short for Deep Seek Attention. This attention mechanism in 3.2 is not completely new. It was released as an experimental model earlier in September 2025 under the name Deepseek 3.2 exp which was made with the sole purpose to test out DSA at scale. So they already have a paper along with this release that explain DSA in depth too. In short 3.2 experiment uses the same exact architecture as Deepseek 3.1 Terminus but they swapped out the MLA with DSA from 3.1 Terminus and continue

to train on. I put it into a table here so it's easier to visualize the difference and to explain what makes DSA so good without going into the technicals. So instead of doing the typical attention calculation immediately, DSA first runs a lightning indexer. This is a lightweight module that quickly scans all previous tokens and it computes a rough relevant score to determine which previous tokens are potentially important for the current step. Then based on the rough scores, it'll select the top K most relevant tokens and discards the rest. So the actual attention is only run on this top K subset. Lowering the compute from big O of L squared to big O of LK where K is the subset of the top K tokens. This not only makes the compute incredibly cheap but the cost doesn't explode the longer you use the context window. Like if you remember some state-of-the-art models have different prices when you use past the 100k or 200k context window. That is because the quadratic attention cost could start to slow the model down and increase the computational cost. So, we would assume this skipping by DSA would make the model hallucinate or forget things in longer context. While their

technical report shows that it doesn't on third party benchmarks like context arena, which is a much harder benchmark, it still kind of suffered. But without hammering that much in depth, they were still able to keep up the accuracy on par with the standard models. So, I guess only really intensive needle in a haststack use cases where that would be more obvious. The second part I find fascinating is the specialist distilled training. Usually, it's hard to train one general model to be a genius at everything simultaneously, especially using RL. So, Deep Sea took a divide and conquer approach. They trained six distinct specialist models on top of 3.2 experiment exclusively. And for each, they let these specialists to run wild with massive RL budgets. And once they become top tier in their respective specialtity, they distill them, which is the process of having each of them generate thousands of highquality reasoning traces and answers. Then they took this extremely expensive synthetic data and used it to train the main 3.2 model. So you skip the downsides of RL, yet you obtain the quality of RL, so you

can apply RL even more effectively later on when it's needed. For the RL algorithm, they are still using GRPO by the way, but an updated GRPO that has an unbiased parameter. So most of the technical errors you read about earlier this year have already been fixed. However, the RL post training this time around is huge. It took up 10% of the total compute compared to Deepcar1 which was only around 1%. But where did all these RL data come from? Which brings us to the third part that is deep glimma testing how good synthesizing agentic tasks would be. With how practical AI agents are, a huge bottleneck is born where it's pretty much impossible to obtain large amounts of highquality data of AI agents using tools correctly because these kinds of precise tool use data does not really exist where it has to contain how AI would successfully navigate a complex software environment, right? Someone or at least something has to create it out of thin air. So they limit test this by creating a large scale pipeline that would generate this data for a general tool use environment. They actually build an agent whose sole

job is to build environments for other agents to train in. It goes from creating a database sandbox and writing unique Python tools that can be used to interact with it to inventing a difficult task that requires those tools to solve all while creating a task that is hard to solve but easy to verify. For coding environments, they mind millions of GitHub issues. And instead of just reading the static code like most models, they built a setup agent that creates executable RL environments. it will install the packages, resolve dependency, and actually run the unit tests. This pretty much gives them an infinite supply of executable training scenarios without needing a single human to label any RL environments. And it turns out that the synthesized environments are actually hard enough for the models. And when trained in these environments, the model also sees a huge jump in performance. And they were able to scale this method to create tens of thousands of unique tasks. So what's special about 3.2 too special is that it didn't use all these highquality training data but only the reasoning

data on top of a more relaxed length penalty to enable longer reasoning. So it was trained more like a specialist unlike standard 3.2 which is more of a generalist. So with this many cool new proven techniques now shared with everyone. What's exciting to look at next is definitely their conclusion limitations and future work section. The first thing that caught my eyes is that they kind of hinted the moat might still be compute with them admitting that 3.2 two lacks behind Gemini 3 Pro in world knowledge simply because they didn't burn enough flops during pre-training to inject more knowledge. This definitely implies that they are going to ball very hard for Deep Seek V4. So V3.2 is pretty much them wrapping up what they can do with RL experimenting and scaling. I mean or else they would call this release V4. Another point about the compute mode is that most of the techniques they came up with in this paper are ultimately pointing to data synthesis. So the more compute you have, the more you can synthesize. And the same goes for RL training. But by cloud, there are so many other companies with so much more compute but seemingly hit the wall. How did Deepse seek still able

to stand out on top? Well, after my careful diagnostics, I think this is a case of skill issue or what they would call in Chinese. From the way that the Blue Well is tackling all these problems, there does not seem to be a wall in sight. Just a bunch of S-curves waiting to be stacked up. Deep Sea kind of shows that if the model isn't getting smarter, it's probably not because of the architecture is dead. Garbage in, garbage out. If such a small scale company can squeeze this much performance gains, it really paints a different light to the AI is hitting a wall narrative. While compute is still the actual differentiator going forward, they still highlighted an important idea that is intelligence density. If you remember this table from earlier, in order for V3.2 2 special to match the performance of Gemini 3 Pro, it pretty much needs to generate twice as many tokens. And even though in the end 3.2 Special is still 10 times cheaper, it still has to think longer and harder to arrive at the same conclusion that Gemini gets to with much less tokens. So the next hurdle that Deep Seek will also work on as they state is to find out how to improve

their token efficiency. Anyways, it still amazed me Deepseek is so capable that they are able to bring two revolutional research within just one year. Like, can you even believe we got this and Deepseek R1 both in 2025? And yeah, if you like today's video, definitely go check out my newsletter where I cover the latest research papers weekly. And thank you guys for watching. A big shout out to Spam Match, Chris Leoo, Degan, Robert, Zaviasa, Marcelo, Ferraria, Proof Any New DX Research Group, Alex, and many others that support me through Patreon or YouTube. Follow me on Twitter if you haven't and I'll see you in the next