Theo's Video: https://www.youtube.com/watch?v=bAYZjVAodoo
Cloudflare article: https://blog.cloudflare.com/code-mode/
Links:
Homepage: https://ykilcher.com
Merch: https://ykilcher.com/merch
YouTube: https://www.youtube.com/c/yannickilcher
Twitter: https://twitter.com/ykilcher
Discord: https://ykilcher.com/discord
LinkedIn: https://www.linkedin.com/in/ykilcher
If you want to support me, the best thing to do is to share out the content :)
If you want to support me financially (completely optiona
Hello, this is in many ways a response video to Theo or T3 GGG's video called MCP is the wrong abstraction and it is also a video about this Cloudflare article which that video is about called code mode the better way to use MCP and I want to show you this article but also point out something that potentially is missed by uh either of these two uh article and the commentary. So here's the core idea of the article voiced by Theo. And that seems to be the direction Cloudflare is leaning into here. As they said, they're trying something different. Converting the MCP tools into Typescript APIs, then asking the LLM to write code to call those APIs instead. The results are striking. We found agents are able to handle many more tools and more complex tools when the tools are presented as a TypeScript API rather than directly. They don't know why this is, but their theories are that this is perhaps because LMS have an enormous amount of real world TypeScript in their training data, but only a small set of contrived examples of tool calls.
Very likely. And then point two. So what's the idea here? Um, you just heard it. Let's not do tool calling in a way where we expose this to the LLMs and the LLMs has to in sort of conversational mode say I will use the tool and then emit some token and then emit some JSON for the uh tool calling itself and then we give it back the JSON of the response. rather than that um let's render the tools as an API then let the LLM write code and call that code execute that code and part of that code is actually the tool call right here this is what's called code mode and uh Cloudflare is advocating for it now in a sense this is really really great because as Cloudflare hypothesizes in the training data of LLMs there's very probably tons and tons and tons of code. So, tons of examples of uh someone
wanting to call some API and stringing that together and doing that correctly. Um fitting the types and so on, that's probably super common. Whereas the LLM tool calling that we're doing nowadays, this is very probably a result of the postraining. So you do your pre-training and then on top of that you do the fine-tuning from human feedback, reinforcement learning and so on. Uh from these specific examples that the the mechanical Turks create for you. Very very big difference between the two. Whatever is in pre-training is kind of ingrained knowledge of the LLMs. Whatever is on top is just sort of tacked on. Uh the article here compares that to saying making an LLM perform tasks with tool calling is like putting Shakespeare through a month-long class in Mandarin and then asking him to write a play in it. It's just not going to be his best work here. Uh Shakespeare
writing plays will be the pre-training sort of the ingrained knowledge and then you tack on a very brief Mandarin course and um ask Shakespeare to write a play in that. He can still write play it. He could still write the play but uh the the understanding of the language is just very rudimentary, clunky and a lot of things are going to go wrong. I agree with this stance. I agree with yes probably framing tool calls as APIs is a lot better and um as Theo also points out a lot of the providers are at least internally already doing this where I have a problem is when it is uh when it comes to that second point that the article makes right here let's listen to it two here the approach really shines when an agent needs to string together multiple calls with the traditional approach the output of Each tool call must feed into the LLM's neural network just to be copied over to
the inputs of the next call, wasting time, energy, and tokens. When the LLM can write code, it can skip all that and only read back the final results that it needs. In short, LM are better at writing code to call MCP than they are at calling MCP directly. I totally Okay, this point is a bit more contentious in my opinion. So, the idea is let's say you want to know what you should wear today. Now the LLM needs to do two things. It needs to find out where you are and what the weather is today. So it does a first tool call and that first tool call is to the weather MCP server and get back. Okay, here's the weather. No, sorry. The first tool call is for the location or memory or whatnot. Just figure out where you are. Get back the location and then it needs to provide that to the to the weather uh MCP server. get back the weather and then reason what you should wear today. This is a stringing together of tool calls. And the way this works right now is that uh the results of the first tool
call will get back into the context of the LLM. You call it again with that new context and then it can decide to do the second tool call. And in the point of of the article right here, this is this is uh clunky because it goes back to the LLM. It wastes tokens and so on. Wouldn't it be easier if all of this was an API? So if you had the API of the location service and the API of the weather service, you could just do get weather open parentheses get location uh closed like as an argument, right? So you directly say I want to get the location and then feed that into the weather server and that's a single execution. Um and you don't have to go back to the LLM intermittently. So the LLM could at priori tell you oh I have these five tools here is how I want to string them together and you execute all of that and you only get back the final result. Now this sounds appealing in
theory. Um however I think it is missing a crucial point and that is that in um in in some cases this is going to work but it's only going to work if you have extremely deterministic tools extremely deterministic types and so on. So for example if the location is getting back um is getting back a determined type um and then you can feed that into the into the weather tool. However, it starts breaking down in most real world scenarios and that is because uh stuff is messy and the real world is messy and yeah that's that's where I think that the problems here come in. So what I mean by that is let's say your location uh service doesn't always give you back a GPS location. Sometimes it gives you back an address. Sometimes it gives you back a oh behind the house. Sometimes it's a I don't know. Sometimes it's a
question back to the to the to the user, right? Like, hey, uh, give me your location. It's like, oh, I know yesterday you were in London. Um, I don't have any location data of today. Uh, did you take a train or something? I'm contriving this, but I hope you can see that the real world is messy and the output of a lot of tool calls is not so defined. I know people are trying to get around that with JSON mode and all of that. Um, but even with that, even you get back JSON mode, it is very likely that the next action will actually be dependent on the outputs of the last action in a nondeterministic way. What I mean by that is in a way that you have to reason. So what humans do when they string together different tools is they do exactly that. They call the first tool and then they look at the outputs of that and then they decide how
to call the second tool. And I think that a lot of situations are like this and therefore it is not really going to be an advantage to string together tool calls by writing the code a priori. I can ask you this, right? If you have a very complex task and you make a detailed plan at the beginning, um, how well is that plan usually turning out? I predict it's not turning out well. I predict that in the middle somewhere, you'll have to adjust your plan as you go along. And that only works if you actually look at each of the outputs and redecide how you want to react to that output, whether your plan is still valid and how to pass how and if to pass that output on to the next tools. And so um and so that's where I have the the the trouble with because if we simply compose all these API calls together into one block of code and then say oh now we only need
to read the output. That is effectively the same as saying yes my plans always work out 100% of the time. um and I don't ever need to, you know, nothing is ever actually dependent on the on the intermediate states in any meaningful uh non-deterministic way. So a that's a word of caution. Now I think that is code mode is still fantastic and I think it's going to to give us a lot of benefits. Um, but it might not get us the benefits in this more meaningful way when the tasks are complex and the the the tools are doing a bit more sophisticated things than you know calling the weather and um yeah so that was an appeal. Now what I do think is um there is actually an opportunity to get a lot of the benefits and that has to do with speculative decoding. So what we can actually do is let's say we compose
all of this code together, right? Um and we do multiple tool calls. It's code, there's a loop and so on, but we record all the intermediate things. We record all of the outputs um from the intermediate states. Uh and then we just execute this all at once. And at the end, we don't just give the final output to the LLM, but we give all of the tool calls and all of the intermediate outputs to the LLM and effectively ask it whether any of these look sus, right? Like, hey, look, these were the intermediate outputs. Does any of this look suspicious? Does any of this look wrong to you? And if it doesn't, then you're perfectly fine taking the final output, right? uh because it's it's it's the LLM is effectively saying no all of this looks perfectly fine if you had given this to me during the execution I would have told you to move forward move along with the plan and if we can
somehow get that or or let the LLM pinpoint where uh the deviation happens like oh no no no this output here or ah I would not have called the tool in this way had I known the outputs we skip a lot of these intermediate steps. So, let's call it speculative tool calling or something like this, which is you just call a bunch of tools ahead of time um because you estimate that that's something that might happen in a lot of cases. Uh tool calling isn't so expensive. um you might use a few extra tokens by doing all of this intermediate validation and but maybe we'll get a lot of speed out of it. All right, I still I invite you to read the article about code mode because it also talks about loading loading code into running agents uh using isolates and all of that. Very very cool. And I also invite you to uh watch Theo's video on the topic because uh is a lot of good
comments about it. uh just this this one thing that I think is missed by both and that's that in the real world with complex tools and messy data um there will often be the case where intermittently you need to actually look at the intermediate output and decide what you want to do in a way that you couldn't have known at the beginning. General reminder that MCP isn't magic. MCP doesn't add any capabilities and MCP simply is a standard way of exposing APIs. I think that goes without saying if you've watched this channel for a while or are in the field, but for anyone else uh who's super hyping about MCP, it's not that big of a deal. It's cool, but uh it doesn't add anything new. Cool. That's it. Thank you very much, and I'll see you around. Bye-bye.