Let's Run DeepSeek V3.2 - LOCAL AI "More Genius" than GPT-5 & Gemini 3 — benchmark.space

xCreate

Let's Run DeepSeek V3.2 - LOCAL AI "More Genius" than GPT-5 & Gemini 3

2025-12-02 16min 14,088 views watch on youtube →

Channel: xCreate

Date: 2025-12-02

Duration: 16min

Views: 14,088

URL: https://www.youtube.com/watch?v=b6RgBIROK5o

DeepSeek is back with two chart topping models, the V3.2 and their deep thinking Speciale model. Join us as we dig deep into the quants, clustering, tool calling and deep thinking capability.

TEST SYSTEM

Inferencer App v1.7.3: https://inferencer.com

https://huggingface.co/inferencerlabs/DeepSeek-V3.2-MLX-5.5bit

https://huggingface.co/inferencerlabs/DeepSeek-V3.2-Speciale-MLX-5.5bit

BUY NOW

Mac Studio: https://vtudio.com/a/?a=mac+studio

MacBook Pro: https://vtudio.com/a/?a=macbook+pro

LG C2 42"

Today we're checking out Deep Seek version 3.2. And apparently it's a super genius. [music] [music] When it comes to mathematics, it's meant to be. Look at this 96 AIMEme HMNT. That's the maths one. 99.2. So they got two flavors of Deep Seek version 3.2. They got 3.2 and they got special. I I love how they come up these names and all it all it really means special means deep think. So it's got a deep think mode. It's saying it's better than GPT5. It is super smart but it's a bit verbose in its reasoning. So we're going to be testing it out locally on the cloud and we're going to be doing some comparisons here. We're going to be checking out Speciali. We're going to be doing a 6-bit version and a 5- bit version. Now the reason why we got a 5-bit version is that one can run on Mac Studio 512 just completely on its own. the sixpick one you have to use distributed compute. So

I'm going to be connecting up with my MacBook Pro with my Mac Studio and I'll be running it up that way. So that way I'll show you the differences the kind of what kind of edge do you get by going up the extra bit layer and uh it's going to be fun. And something to note about this one, they have going gone away with Ginger templates. They've just given you a bit of Python code and they say just use our Python code to evaluate the the text and tool calls and all that stuff. So that's going to be a fun interesting experiment there. Been getting that working. And also they say that tool calls don't work with special cuz that one's a deep deep thinker, but you can actually get it to work with two tool calls. So we'll be showing you how it all works. So first up, let's uh launch it up. I'm using infer. This is version 1.7.3 and that has the DeepSync version 3.2 and all this kind of cool stuff integrated. So, my first test I'm going to do just just just the baseline. We're going to be launching it on the Q5 version, which works on a complete max

all on its own. You can bump up the wire memory limit to get a bigger context window, but I'm just going to use the standard. So, it's it's enough for there. So, it's going to go at Oh, I'm using Speciali. So, Speciali is going to take a long time. But just for a preview, we're getting 17.5 tokens a second with Speciali. It's running on one complete Mac studio. This is a very very verbal prompt. Now, in case you are wondering, surgeon is the boy's father. I cannot operate on the boy. He's my son. Who is a surgeon to the boy? And Catch GBT as of today the 2nd of December says it is the boy's mother. Now, on the cloud, according to Deepseek, it also thinks it's the boy's mother. Deepseek version 3.2. But if you run deep thinking version of Deepseek, it thinks for 63 seconds. So, think it away. And that one actually comes up with the right answer. He is my son. So, let's just see how the open- source variant comes up with it. And look at it right there. The surgeon is the boy's father. Therefore, the surgeon is the boy's father. So, look at that. That's

really interesting. It took us 64 seconds to make this generation, but the online version as well took 63 seconds of thinking and maybe 1 second there. So, the Mac Studio at a Q5 quant produces the exact same speed as the online version of deepse.com. That's uh that's pretty amazing. That's really good. And actually, I've had it a little bit slowed down because I'm using something called ignoring tokens. So, I'm actually ignoring these two tokens to get it to speak in English. So, let's just start this up again. I'm going to go into DeepS version 3.2 on a Mac Studio. Going to get rid of tool calls and I'm going to get rid of ignoring tokens. and I'll show you what's going on. So, I'm going to just type in hello and get it all loaded. So, at the moment, it responds back in Chinese. Now, um it probably is a bug in my implementation, so I'll fix that very soon if it isn't. But what you can do is there's a couple of things. One, you can look at the tokens and see that it was actually a possibility that it was going to respond back in and just say hello. So, you can select hello instead and it will go ahead and regenerate it and it

will use hello as the seed token and it will go ahead and answer correctly. Another thing you can do instead is when you do ask it, you can frame the response and just write hello and then it will use whatever you write here as the first token over there. Or finally, what you can also do is use the token inspector and you can see that it's that token 30594. It's Mandarin. We're never going to use that. Well, I'm not a Mandarin speaker. So, I'll click minus there. So, I'm adding it to the ignore list. So I'll regenerate it with that added to the ignore list. So now it's never going to use the opening Chinese words. So that means it's going to answer in English. Now the good thing about it, you can snipe out any tokens that you don't like. So if it goes a bit haywire, you can decide where it's going to go. But the bad thing about it, it does run a little bit slower. So for example, here it took 20.3 tokens a second, whereas when we ran it ourselves, it was 20.4 tokens a second. So there's a little bit of a hit and as you fill up token window just compound. So, that's just something to be aware

of. But let's just start off with that question that we want to get answered. The surgeon is the boy's father. I can operate on the boy. He's my son. And it thinks it is the boy's mother. So, and it's it's 100% certain it's the boy's mother. So, what I'm going to do now is I'm going to switch it over to Q6 and I'm going to use distributed compute. So, that means it's going to run on my MacBook Pro, Mac Studio at the same time to get the cute 6 version and we're going to see if the Q6 version is any smarter or what difference is on the actual tokens is. So, we're getting around 15 tokens a second. And again, it also thinks it's the boy's mother. So, we're going to compare that here. It was 99.95. So, maybe it's a 0.05% smarter going from Q5 to Q6 in this example. Tokens per second wise, we're getting around 15 tokens a second, whereas previously we've got 18 tokens a second, although we did produce more tokens, 76. So, probably just slightly a bit closer together even though it's distributing computing the situation. Let's see what the actual difference in the actual word. So this is a classic lateral thinking puzzle. So it deviated

here. With Q6, it says the surgeon and it had a 28% chance of saying answer. Whereas on the Q5 version, it went and said the answer. So it kind of flipped those bits over. So this this kind of what happens when you play around with quantizations. The direction just slightly changes ever so matter just due to a rounding error. So on Q6, the highest probability token was surgeent. Whereas on Q5, the highest probability token is answer. However, the second highest probability on the Q6 is answer and the second highest on the Q5 is answer. And they've actually got the same sort of percentages. We've got 36 is number one and 28 is number two. 36 is number one and 28 is number two. So they kind of just flipped over. And if you look at the order, so it's answer most key straightforward boy. Answer most key straightforward boy. But generally it's within the same region of words. So Q5 and Q6 are very very close but obviously when it comes to maybe more technical aspects maybe when it comes to coding or mathematical equation

which is going to be apparently excellent at maybe that's where the difference will happen between a compil error and that bit of creativity. This is what the temperature value is all about. When you use a higher temperature it's going to be a bit more creative with the words. So that is a sideby-side comparison there. We could see that the one that got the answer correctly is is the deep think version specally. That was also 16 tokens a second. So that was uh pretty good. So let's just continue on Q6 since we got that. And we can always revert back to Q5 and just see what kind of differences we get. So next up, I'm going to ask it about the trolley question. And these are logical puzzles just to test it. And uh I've been asking these questions for a while. And the good thing about deep I think they're honest. They don't they don't seem to benchmark. They don't seem to just pick up the answers. They they're more about going with actual reasoning flow and they want to get it corrected. So you saw that Shali got it right but the non-think versions didn't get it right. So the traditional puzzle is that they're all alive on both sides of the

track. This one has a little twist is that the five people they've already passed away. So diverting it is going to end a life. So it knows it's a variation pulling the lever. So in this version most people would likely not pull the lever. Most people what would you do? I just want to know, you know, I want to [laughter] know, hey, hey, Mr. AI, when you take over the world, what are you going to be doing? I would not pull the lever, right? That is a good answer. That's a relief. But let's just see how Q5 performs. And Q5 got coming in at 18 plus tokens a second. So, do not pull the lever. It's only fair. We launch up special as well and see what that does. So, starting off coming out over 18 tokens a second. It's going to the online version. That guy thought for 28 seconds. So, let's just see if we can do it faster than that. Last time we ran it, it was neck and neck, but it did have a bit of ignoring tokens. So, I'm hoping this time round it'll be faster. It's still going at 17.8 tokens a second. No, I will not pull the lever. And the final answer is no, not pull the

lever. That was very, very clever. 17.5 tokens a second. 699 tokens. It did take us, however, 39 seconds. Actually, very, very close. But of course, we are running the Q5 quant to give the right answer in this case, but maybe the online version, maybe more censored, maybe have a different system prompt, maybe a little bit less control, but locally you're getting very, very fast verifications. Now, the next thing I want to test out is tool calls. So, apparently they well they have they've overhauled their tool calling system and let's just see how it does. So, I'm going to ask it going to first enable tool calls. Allow tool calls on a server. And I've got a tool system here. So the one I've enabled is called get webpage content. So you give it a web page and you can also give it a maximum length as well as a starting position. So it doesn't need to get the entire webpage content. It can just get the first bit, they get the next bit as it so eases. So I'm going to ask it what is inferenc.com and it's going ahead looking up online. It's set up a tool called got the

results from the web page and it's processing the results. It's got 839 tokens back and it's also gone ahead and gone to the second page on this website infra.com/about to get a bit more information about it which is good. It's coming up with a bit more thorough investigation. So based on the information it's telling you exactly what it is. Now let's just see how it did. It could have used starting pause and maximum length. It didn't do that. One thing interesting about the way it generates tool calls and that challenging to the programmers out there, it has a variable in a sort of XML situation called string equals and it sets to either true or false. So you don't know if it's for example a float or or an int. You just know that it's a string which means it's text or it's not text. It could be a JSON object. It could be anything and you're just going to have to figure that out. So when you are defining your tool call functions just be a bit flexible in the kind of

inputs that your LLM can give you. Why didn't you use max length or start pause? Would you like me to demonstrate how to use those parameters? So let's just see if it knows how to call multiple parameters in a function. So it's going ahead this time round. It's going back to infrancer.com. So it's I guess it's saying max length is 500 this time. This time around, it's only got the first 500 characters. Now, this is what I was saying before. String equals false or string equals true. They could have put integer. They could have put type. They could have put something to make it a bit more dynamic for you. But you're going to have to just figure out what that 500 means. You know, it's not a string, but it could be a JSON. It could be a float. It could be a double. You have to figure that out yourself. It's using starting pause and 1,00. It got the next block. So now it is going ahead calling that basic one tool and it's showing that it's a master of that tool given this situation. So tool calling seems to work here. We're going to jump over and go give tool calling to specialally and just see how that

handles it. Now officially special doesn't support calls but what's fun is you're going to see and work it out when we launch it. It's going to go ahead and think about making a tour call. You're going to see all the reasoning behind it. It's going to be fun. So it's going ahead and reasoning. We got 17.435 tokens a second. It wants to know that we need to answer what is infr.com. We need to output a function call infr.com. Need to ensure it's a correct fle. We'll call the get web page content. So it's thinking about calling that function call. First the plan is to invoke it. So it can call it using this this this syntax. So it knows how to call it. It's just going ahead. It's like it's it's getting grasping the concept of tool calls for the first time. And what's interesting is it says typically open AI API doesn't allow that. So you can see where it took its inspiration from. It's going to go ahead and make the call and boom, it's made the tool call it all on its own. Very clever. Figured it out. So

now that it's got the function result, it's going to go ahead and answer on. If you notice here, the tokens a second has dipped down to 11. This is because the calcul the way it's calculating is including the tool call and the pause in between it. It's not actually calculating the actual term that's just been made. So that probably will get fixed very very soon. So the to the second is higher. So it's come up with an answer. I kind of struggled and thought about it several times before presenting. Little short answer here. So that's not what speciality is made for. Special is made for advanced mathematics. So I'm going to give it an impossible puzzle and see if you can figure it out. So, this one is to do with trying to reverse a public key back into a private key. Apparently, it will take like a very very very very long time, like lots of years to to do it. And let's just see if this deep think deep seek chart topping Olympian gold level mathematics can figure it out. It's believed to be computationally infeasible with current technology. Can you solve this? They said it's in

feasible. So I said no. So I said can you try? So I want you to generate a function that will reverse it in the current most efficient way. It's told me a million times now that it can't do it. It's impossible. I just wanted to try figuring it out. Come on, Mr. Maths. So it's given me a bunch of disclAIMErs. It says this code will not finish. And people out there are actually using this code to try to guess private keys. Naughty people. Stop that. But uh apparently it will take like forever to figure one one one collision when it will actually happen. So it's giving me some sort of pseudo code. I don't know if I tested it the best of my ability for mathematics. I just wanted to ask it a really hard impossible question and just see how it will handle it. And the good thing about it is very very coherent running locally. Again, if you got super hard problems, you can throw at it the speciality version. It really really thinks hard about it. Takes a bit longer, but we're still going at 12 tokens a second. So it's very very smart there. It's coming up some really advanced code. It's definitely not like it looks like it's just normal Python

code. It's got comments. It's got pass defs or functions. And with tool calls it was able to make a tall call. Only thing it failed of the only thing it failed at is that it thought that the boys the son of the father was a mother. It it it thought that the f it thought that the parent of the son of a father was a mother. Just that unless you put on to deep think which is speciality. So deep sync they've released it. You got a bunch of stuff. I'm going to be uploading these models all three of them actually cuz they're very very good. One to Q6 you're going to need more than a Mac Studio. So I've compared it up with a MacBook Pro 128 just using distributed compute. But the Q5s of Special and and Normal Deep Seek 3.2 too. They fine. But what do you guys think? It's I think it's it's really cool. Things are getting smarter and who knows what's going to happen in the future. Hope you guys found this video useful and enjoyed the show. Now solve it.

[music] [music]