When AI Discovers the Next Transformer — Robert Lange — benchmark.space

Machine Learning Street Talk

When AI Discovers the Next Transformer — Robert Lange

2026-03-13 78min 24,570 views watch on youtube →

Channel: Machine Learning Street Talk

Date: 2026-03-13

Duration: 78min

Views: 24,570

URL: https://www.youtube.com/watch?v=EInEmGaMRLc

Robert Lange, founding researcher at Sakana AI, joins Tim to discuss *Shinka Evolve* — a framework that combines LLMs with evolutionary algorithms to do open-ended program search. The core claim: systems like AlphaEvolve can optimize solutions to fixed problems, but real scientific progress requires co-evolving the problems themselves.

GTC is coming, the premier AI conference, great opportunity to learn about AI. NVIDIA and partners will showcase breakthroughs in physical AI, AI factories, agen

So I think a lot of sort of analogies from evolution transfer to scientific research, right? In the sense that we traverse a tree of different ideas or different experiments and then in a paper we report one path through that tree. >> When we run LLMs autonomously, >> yeah, >> they they tend to just kind of like nothing interesting happens. >> But often times um innovation for a specific problem might require first inventing a different problem, right? uh sort of automatically coming up with this reduction or like this let's say um recursive nature of problem solving is something these systems right now not necessarily have built in intrinsically right often times it's easier to generate a lot of solutions than to actually like hard verify them right >> the reason why I'm not that worried yet about labor market disruption is I still believe deeply that humans are the source of deep understanding and creativity in the world if I didn't believe that I would be very worried I think it's going to be an amplifier of sort of these these latent dimensions humans are great at, right?

>> And I think one of the Rubicon moments is when the the new Transformers architecture or something massive is discovered by AI and we're all using it. NVIDIA GTC starts Monday in San Jose and it's free to attend virtually online. There's already been a leak this week of something called Nemo Claw, which is an open-source agent platform. And if it's real, it could be one of the bigger announcements this year. So, it's definitely worth watching Jensen's keynote for that alone. I'm giving away a DJX Spark. Uh, Nvidia just hiked the price $700. You probably heard about these memory shortages, right? So, yeah, it's now $4,700, which is very, very expensive. And, uh, MV from Hugging Face, by the way, she got one for her birthday, and she said she literally cried. So, it's a really cool bit of kit. Um, if you register through my link in the description and you attend at least one session, then you are in the draw. This is a massive conference. Physical AI and robotics are going to be the breakout theme. And Jensen does the keynote Monday at 11:00 a.m. Pacific. The link

is in the description. Don't miss it. Robert Land, it's amazing to have you on MLST. >> Thank you, Tim. It's a pleasure to be back. >> So, you're working for Sakana. Tell us about that. Sakana AI is a Japanese um AI startup working mostly on um yeah AI for Japan and at the same time sort of exploring exploring let's say novel or ambitious ideas on the research side. >> It's been around for over a year now. You're on you're one of the founding researchers right? >> Exactly. So Sakana has been around for now like almost two years like one and three quarters I would say and um yeah it's pretty fascinating to to look back and to look at the early days and how much the company sort of organizationally has changed but in spirit like um we're we're trying to sort of embrace Ken Stanley's open-endedness idea and sort of explore many different ideas which might not get the resources right now in the ML community more general >> and we we've got a few interviews coming out with Sakana that that we filmed here in Japan summed up I won't spoil surprise but the the CEO is David Ha and David you know like there are these epic

you know giants out there like you know Cloon and Stanley David Har is one of these people >> um David's work has had a lot of influence on my personal PhD right he he did a lot of fascinating work on hyperet networks and sort of modulation in in neural networks but also on evolutionary computation and evolutionary optimization and um yeah that sort of also painted um yeah my path during the PhD you've you've released a paper a shinka evolve and and we were just saying that that kind of means evolve evolve because in in Japanese shinka is evolve but that's quite common. It's a common thing to do to have these like multilingual you know double double namings in in Japanese. just before we get there. So, um, we interviewed the Alpha Evolve team and I also interviewed Jeremy Berman a few weeks ago and your paper is is very much like a more sophisticated version of those in the sense that it's using language models to generate programs and it's doing an evolutionary approach where we generate the program, we refine the generated program and we have an an evaluator and we do this over several steps and and

your your approach does many things that that the other ones don't do. Tell me about the paper. >> First off, of course, this was partially inspired by Alpha Evolve. I think it's great work. Um I know Alex and Mate and I think they're doing incredible science. One thing that sort of is important about sort of using all of these evolutionary LLM driven methods um is sample efficiency, right? So many of these systems sample like let's say a thousand programs for a given task. And what we tried to do with Shinka Evolve was uh try to essentially cut down costs as well as sort of computation evaluation time by introducing um a set of sort of technical innovations to this evolutionary search. And we showed that um it's possible with uh very few program evaluations to basically improve upon like for example the circle packing um canonical result that they showed in their paper. And um yeah, more generally speaking, I think uh we're right now at a point or like at an inflection point where these sort of let's say evolutionarydriven LLM systems can really revolutionize scientific

discovery and u yeah we hope to um have made a step forward to making this more democratically accessible right so the code is um open source available and uh yeah by its sample efficient nature we hope that many people can interact with the system and can make uh their own scientific discoveries as well. >> Yeah, that that's actually a really important point because I suppose we can use these foundation models and first of all, isn't it just fascinating to reflect that we have these amazing models out there that we can access? So like GPT5 and Gro 4 and they are so much better when you get them to refine their solution in in several steps. Why why is that? I mean I suppose a naive question would be why why aren't they just good out of the box >> potentially like with enough random samples, right? it's sort of this monkey typing on the keyboard. They would potentially be able to get there, right? Um but in principle, it's sort of uh coming back to the principles of evolution, right? In the sense that you need to collect a bunch of stepping stones first and then build on top of

them to to really find u innovations or to tune in down the line. And I think um language models with the right sort of evolutionary hardness are um extremely powerful in terms of scaling up to um to to make discoveries. And um yeah, I think uh Jeremy as well as the alpha evolve paper as well as sort of u work we've done on like the Davin Google machine for example shows that this um sort of stepping stone accumulation plus iterative verification and uh collecting sort of information and evidence from the real world or real synthetic evaluator is really important for that. >> Very cool. And stepping stone collection. So this is um it came from Kenneth Stanley is a wonderful paper why greatness cannot be planned and he said that it's it's better to have systems that don't converge. So in natural evolution we are just trying all of these different things and greatness quite often follows a diverse path which means you have to do things which initially seem quite stupid and then

later on they turn out to be incredibly useful. Yeah, >> we're trying to design algorithms that can kind of allow for a population of slightly weird things and and then we kind of lock in and converge a little bit. So we we're still converging though. So we're still building systems that don't diverge forever. What are we losing? One one thing I find extremely important after having done Shinka Evolve is um sort of this problem problem, right? So uh with all of these systems so far maybe except for the AI scientist which we can also talk about the problem is given right so you have an evaluator you have a correctness checker and you sample programs only on that single problem right but often times um innovation for a specific problem might require first inventing a different problem right so for example I think in the uh matrix multiplication result that um the alpha evolved people show you can recursively apply sort of the algorithm to larger matrices So it's actually an important result, right? Um but sort of automatically coming up with

this reduction or like this let's say um recursive nature of problem solving is something these systems right now not necessarily have built in intrinsically right. So I think uh going forward it's going to be um really important to not only sort of do open-ended let's say optimization of solutions but sort of do the co-evolution of problem and solution together in order to collect even more diverse stepping stones and um to really kick off this this open-ended process because also to me like one of the u the big life goals or achievements um I would want to see is really having a process that can run uh not only for let's say a week or uh many weeks but like for years even potentially, right? Collecting even more diverse interesting stepping stones. >> Yeah, I spoke to Joel Lemon and he was talking about the night and uncertainty and which is that machine learning algorithms aren't very good with unknown unknowns and and in a sense the unknown unknown is talking about these these stepping stones that might be useful later. >> And when we run these algorithms at the moment, it's the same with LLMs and

reasoning systems is that they're very very good when we give them a specific thing. And what you're pointing to is we might need to invent new unrelated problems and find the solutions which might then be related to what we're trying to do. So that feels like a bit of a catch22 situation, >> right? So we're saying, you know, circle packing. Here's my evaluation function and I want you to sort of diversify and then you know kind of and then converge towards this solution. >> It's I I had the same thought with Genie by the way that it it gives you exactly what what you ask for. So you put put a prompt in, you know, like a Swiss lake with, you know, with boats on the water and mountains on the side. And I was thinking, where are the birds? Oh, I forgot to put birds in the prompt. >> Right. So how can we meaningfully build systems that actually kind of bring in other unknown things that might be useful? >> I think uh one inspiration or thing I would personally want to sort of um research are systems uh like outlined in in Power Play or Poet by by Schmidurn by by Jeff Cloon and others, right? So

where there is essentially like a a set of tasks and um a solution generator and both of them sort of co-evolve in this almost like autocurriculum playlike style right and I think sort of the in poet the natural first application was sort of reinforcement learning um but I think this can now be broadened up to to yeah science more generally right at least when there's a a simulator available to for for running these evaluations and by doing such a co-evolution um you you always try to to max out um the the capabilities of that generator while sort of uh increasing this uh this convex hull or potentially even um yeah more diverse problems while doing so. I know that there's always the leading thought that even with poet which was this thing where you had like a you had like a load of um environments and agents and the environments were in complexified so the agents would have a kind of effective curriculum to to learn

things and increasing complexity but even then isn't there a kind of design bias in the system where there's some code somewhere which complexifies the environment step by step and wouldn't that also just be designed by the humans so it and also just give you exactly what you ask for. >> Ultimately, this comes down to like the hypothesis that language models can potentially do extrapolation or interpolation, right? In the sense that even though these things might be in the end designed by humans, there are many unknown unknowns, right, that we humans didn't think of while designing them, right? So, potentially it is possible for an LLM to yeah, find a novel discovery simply uh by us not having thought about it before, right? When we run LLMs autonomously, >> yeah, >> they they tend to just kind of like nothing interesting happens. >> So depending on the prompt you give them, >> they'll kind of go a few steps in that direction and then no new interesting novelty emerges. And I think even if you

wire them with environmental feedback, >> they they still seem quite parasitic on their starting conditions. with an LLM, could we build a system which actually adapted to novelty that could actually discover new things? >> I think it really kind of also depends on um what do you give the LLM as a starting point, right? So for example, in Shinka Evolve, we from time on time saw that if you give an initial solution program which is already pretty optimized on the problem at hand, um you still kind of get stuck in in local optima, right? Where not a lot of novelty is introduced, right? While if you you start off from like an impoverished solution, there's much more room for diversity. And I think this is sort of coming back to um sort of what I did before in in my research, namely metalarning. It's sort of um this classical trade-off where you can either start out from something um very, let's say, unconstrained from like a very simple solution and give much more room for the optimization, but this might actually require open-endedness and a

long time to find a good solution. Or you start out from something that is already very constrained by inductive biases, let's say, and then you might be much more efficient in terms of convergence, let's say, but um uh you don't have this sort of open-ended big no novelty sort of benefit from it. >> Yes. I suppose where we want to get to is building systems which are not designed by humans. So for example, if if I'm leveraging my deep understanding, you know, LLMs are really good if if you understand something deeply. And similarly, we could kick off um a Chinka Revolve and we could we could put a starting solution in there which leverages my understanding. We want to have AI systems that >> that anyone could use. >> So just a non-expert could say I want to solve this problem and and it will solve the problem. >> We should talk about the the evolutionary approach, right? So to maintain diversity you had a population of programs and they were separated into islands. >> Tell me about that.

>> The way how Shinka Evolve similar to Alpha Evolve works is you keep an archive like a database of programs and then you sample parent programs with a set of sort of inspiration programs and then you ask an LLM to basically uh make an improvement to that program. Right? So to provide code to edits or rewrite an entire program or to potentially even cross over different programs and then you you basically you query the LLM you get a program out and you evaluate it on the problem at hand right so for example increasing uh the sum of the radi of a bunch of circle in a square you run this basically each time collecting evidence from the evaluator adding it to the database and then sort of repeating this process and you don't do this sort of sequentially but you do this in parallel for many different programs and each time sort of a program is added you essentially try to diffuse the knowledge that was collected by that program across the entire sort of database. Right? So one way to think about this is you have a tree a tree where each node

in the tree represents a program and then you you sort of uh branch off of it based on the parent nodes. Right? And uh interestingly like these approaches do tend to scale um but uh ideally we can make the scaling h at a faster rate right and um this is something we tried in Shinka Evolve by sort of uh doing a bunch of uh innovations including sort of um model ensembling. So we're not using just Gemini but we're using basically all um frontier model providers and uh figuring out a smart way how to uh use each model for a given parent right so um if you have a certain program in some situations it might be better to use a sort of GPT model in other settings it might be better to use Gemini model and we sort of introduce a um sort of adaptive prioritization scheme that can adapt sort of the evolutionary algorithm on the fly while running the the algorithm and this sort of also comes back to the naming right so Shinka evolve evolve evolve kind of means that

this evolutionary algorithm that we apply using LLMs sort of also co-evolves at the same time while we optimize the programs and on this um while we're on this circle packing problem so you you had this plot showing how it converged and and it seemed to converge quite quickly so and we'll show the plot on the screen now so very quickly the performance jumped up and then it slowly converged and you said in the paper that it It was using three, I think, three core innovations. And my thinking was if you ran this 50 times, would it be the same every single time? And how to what extent is it thinking outside the box? >> You know, um, Sebastian Bubck is always posting on Twitter talking about how GPT5 has just, you know, discovered new things. And there's always the question of, well, is it just searching the internet? Is it just finding things that have been found before? And, yeah, combining things together in a new way. But could it really think outside the box? >> Yeah, I think this is um almost like a subjective question, right? So, first off, I don't know all problems on the

internet that try doing circle packing, right? But what I can see in the tree that we also depict is um there's for example like a crossover operation between two programs happening where sort of um uh different concepts are combined, right? So, one important part is for example the the initialization of the circles. Another one is like um the optimization. So basically like a constraint optimization program is executed and then the final part is basically like a reheating stage right where noise is added and sort of more is try to be squeezed out and uh to me um like this sort of propagation of information through the tree is one that's really really fascinating right where in some sense these stepping stones are actually used and so in a complimentary fashion right and with regards to rerunning the program multiple times right of course there's some in it, right? So, we're using language models and sort of due to like the the queueing device scheduling on on their server side basically we can't get rid of um all the all the noise. We we've

seen that um at least for uh the general quality of the solution. So what is arrived um afterwards uh it is possible to reoptain this uh but sometimes with a different program like or most of the times just by stoasticity right so it's not like um there's for many problems there's like not one solution that achieves that score but there is like a spectrum or like a a region let's say in the program space that um that resembles the same right um I think one thing that was very interesting about the circle packing problem sort of also coming back to the problem problem that I discussed initially was that uh originally we um we used a formulation where the correctness is checked with like um a very tiny amount of slack, right? So um the the circles could overlap a tiny little bit and then um afterwards we we we sort of reduced the rate AI and the solution was exact, right? This didn't change the score by too much. So it's still state-of-the-art. Um but it was essentially like a proxy problem. We then reran the the shinka evolve on the

exact setting and we found that it took a little bit longer to actually obtain the same quality of a solution. So I think this already points a little bit in this direction of what I discussed in the beginning like sometimes sort of surrogate problems might actually be extremely val valuable in in making such discoveries and having an automated way for designing these surrogate problems in an efficient way might be something really important going forward. Yeah, that's absolutely fascinating. It reminds me of support vector machines where we um make the optimization tractable by introducing slack variables and you can think of that as a kind of surrogate problem. But then I'm thinking well would um shink evolve or alpha evolve would it know to introduce a surrogate problem because you know as designers who understand you know we can think outside the box and and we can do stuff like that because presumably if the um fitness function had the constraint that there were no circle intersections then it wouldn't it wouldn't occur to the algorithm to come up with a surrogate problem. >> Exactly. Yeah. This is a big limitation right now. Right. So at this current point in time, we take the problem to be

fixed and we optimize for that problem. But uh when you think about humans, we're really really good at sort of inventing our own problems, right? Or reformulating the problem so that we can actually sort of work with it, right? So I think a lot of um sort of the innovations in let's say MATHematics come from uh taking a very different perspective on a problem, right? So uh taking sort of number theory and applying it to linear algebra or the other way around. And I think right now these systems are not yet at the point of of achieving such level of let's say transfer. >> Yes. And it reminded me um I spoke to Lion about this. You've got this Sedoko bench and a lot of folks watch Cracking the Cryptic YouTube channel and that's exactly what they do. They invent new problems based on abstractions that capture the essence or aspects of the problem you're solving and then they do something which is similar to Shinker Evolve. They do this kind of evolution where they take these different solutions and they kind of combine the best aspects of both of them and they forge a divergent path to a new solution

and that seems to be the essence of of what we need to do. >> Yeah, for sure. I mean there is some work also by Jeff Cloon um Shenran Hugh and Sunlu on automatic automated capability discovery. So there they look at language models that generate tasks right but it's in a let's say unstructured way in the sense that it's not uh done in order to enable the solution to one target problem right and I think sort of doing these connections is is going to be very fruitful down the line >> very cool now the other thing we'll show the graph on the screen the evolutionary graph so um for the circle backing problem I was looking at that and first of all it looked incredibly pasmonious which is good it looked like it had found an optimal path of the solution very quickly And I was thinking in my mind, well, maybe there's some natural pattern that there there's there's there's something about that that we could use in the abstract to guide the evolution in the future. But the other thing I'm thinking about is right now the problem with machine learning is that we don't really have semantics baked in. So what we're doing is we have a verifier, we're

looking at the rewards and we're sort of like doing pattern exploration and we're taking steps towards the, you know, towards the target. And I love mechanistic forms of reasoning where we actually know something about what the program components mean. And the reason this is important is when we're merging together the best performing programs from two different islands. Um that's a kind of first order interaction and it might not make sense to merge them together. It's wonderful that LLMs you can give them any pairs of programs and it will find a way to merge them together. But wouldn't a more principled way be of there's there's some kind of semantic primitives here and we know they fit together. So there's this Lego analogy that we're kind of building up based on principles rather than forging a path based on the performance. >> Yeah. Um that's a good point. So um one thing we do in Shinkai Evolve as well is we keep essentially a scratch pad. So each program is being summarized and then from the program summaries we keep sort of a set of global insights let's

say that were shared or like extracted from these programs and then based off of this scratch pad we construct um sort of meta recommendations that then become part of the system prompt right so um that way you can try to sort of semantically grasp some of the discoveries but a general problem which is again sort of task dependent is um thereby you sort of diffuse that knowledge across the tree, right? But sometimes you want things to be much more isolated, right? It's always like um a trade-off where you somehow have to find for your problem the right uh position on the spectrum of how much knowledge diffusion do you want to have and how much sort of uh let's say hard islands of programs do you want to have right and um yeah we're trying to make steps in the direction of sort of automatically adjusting this in an optimal way but again it's very program sensitive and then sort of I think another point where you're already sort of going into um is sort of Jeremy Jeremy's solution to arc AGI right and sort of doing um solution evolution in

the instruction space right instead of the program space I do think that this is uh something important and we're like I said with like the construction of this meta scratch pad trying to do sort of both at the same time again it's problem dependent like I played around a little bit with arc a1 agi 1 and arc agi 2 and I think on arc agi1 actually the the transform sort of program direction is actually quite effective right It's like Jeremy said, it's deterministic and um it's easier to sort of get clear signal to improve on during your evolution process. Um while on others like ARGI 2 like this whole sort of semantic evolution uh seems to be more efficient. So I think ideally we we can get a system that can automatically in some sense decide whether or not it wants to take like a programmatic approach in settings where it's actually feasible and easier to to bootstrap off or it takes the semantic approach of evolving instructions or like LLM driven um input output mappings. Yeah, it's it's so interesting because um you know

like a symbolic AI person would say, "Oh, I don't like connectionism because it doesn't you know the only semantics in connectionism is this notion of similarity. It doesn't really understand things." So so they would say, "Well, just just start with an entity relationship graph and then just kind of build up using, you know, composition and first principles." That that that doesn't work, right? So we're using neural networks because they're incredibly flexible and they understand a lot of things about the world but they don't have the kind of constraints that we want. So what we do is we use these tricks. So Jeremy evolved program descriptions. Um on your program selection you had a semantic novelty detection you know using like a um >> embedding based similarity. Yeah. >> Yes. You had like a kind of self similarity matrix and you know based on the um the cosiness um and and indeed you've got this meta scratch pad. So what we're seeing is this fascinating spectrum of possibilities where still using neural networks, you can imbue semantics in using all of these different tricks, but they all come with trade-offs. >> Yeah, for sure. Like I think it's it's kind of interesting. We we've had a long

period of computer science where algorithms were sort of designed by humans, right? Then we had sort of this Andreopathy software 2.0 paradigm where like we trained neural networks that then performed a certain function. And now we're sort of at this point where we're using LLMs to design algorithms or solutions more generally, right? And I think actually like even though like large frontier language models are extreme like let's say black boxes where it's very hard to get a full mechanistic understanding of them um the outputs can be right the programs the instructions and so on right so I think it opens up a very sort of new paradigm of um doing research or basically doing anything right if you if you think about it um but I think we're we're just sort of at the starting point of figuring out the the user interface for that. >> So the other innovation in the paper was using um UCB which is um upper confidence bound. It comes from the multi-arm bandit literature which is this problem where you can pull these these levers and at the beginning you

don't know which levers to pull and and over time you kind of reduce your uncertainty and you can kind of pull the ones that work but there's this exploration exploitation dilemma >> and you've implemented that for figuring out which LLM so it could be Gemini it could be like you know gro or something to figure out which one to use >> we we're using like a model ensemble right to propose program mut mutations and um intuitively one could say like the the best frontier model on on SWEBench is always the best mutation proposal model but that's actually in practice not always the case right and in general it's extremely hard in this um evolutionary setting to assign clear credit to a single model right so you have um for example like one improvement is uh implemented by GPT5 and then the next one is implemented by Sonnet 4.5 5 and it's unclear basically if the performance gain you get from the second mutation actually originated from GPT5 sort of collecting the first stepping

stone or from sonet 4.5. So instead of sort of uniformly sampling models, uh what we do is we implement um this bandit based approach where each model um is basically one arm of a bandit and then we look at how often did this model improve performance of a sort of parent node by creating a mutation. And we then adjust sort of this posterior probability to sort of first explore all arms once, right? and then essentially um change over the course of time in order to uh prefer models that sort of yielded improvements before for similar notes. >> The great thing about using a UCB like algorithm is is you can it it actually has um a theoretical regret which means it's not it's like only log worse than the optimal switching path if that makes sense. But if I understand correctly, UCB is based on a sort of like a global rating like a mean score of every single LLM. And I think what we want is to have

more of a contextual switching um decision which means we know for this particular program Gemini is better and do I understand correctly at the moment that it might converge to a single frontier model and then in a nuance situation we might still get the wrong model. So uh in general like there is um some amount of probability associ like allocated to all models right so it's not like it can just peak on one model and then you stop using the others right so there's still a chance for open-endedness and serendipity if you will and uh we in general like for the problems we consider we we haven't seen that like um one model clearly dominates all the others right we've seen that it really depends on the course of this evolutionary um process like which model is better and um UCB or like the the bandit approach that we take dynamically adjust this in in an efficient way. And would it be possible in the future to use an LLM to make this judgment? >> Potentially in some sense in that case again you think of the LLM as a surrogate model. Right? In some sense um

you can think of like a Gaussian process as a surrogate regression model. And there has been some work sort of showing that language models can act as surrogate models. And um the real question to me is like how do you represent the information to the LLM, right? in the sense that if you use like the raw programs and their fitness evaluations, you you quickly run out of context, right? Um, so you need some amount of compression in order to present the information the right way to the LLM in order to do this prioritization of the models. I hadn't appreciated how long the context is. I mean, I was thinking, you know, could we use like an 8 billion llama model and we're doing um active fine-tuning, so we're saying I just ran it on, you know, I just ran this program on Grock and and it got this score. Yeah. and and then over time that you know this thing for the given run of this evolution it will kind of know that Grock is good at these problems. >> Yeah, potentially I'm not sure like how efficient this finetuning is if if we're only evaluating like 150 programs. Um

but in principle one could imagine I think it's on the engineering side not necessarily like the prettiest to do. Yeah, it could could in fact happen. But I think like for all of these things um we started out sort of with the let's say uh most intuitive algorithmic component that we had and UCB was one that really um did the job here and uh yeah much credit to Satin who introduced this to to Shinka. >> So let's talk about the um the diffs and and the mutations. So um we we generate programs and I I think you folks were inspired a bit by Alpha Evolve. I actually had this gating where where you kind of gate part of the code which is mutable. Tell me about all of that. >> A program is just let's say a long string, right? And um in order to to make sure that certain parts which are sort of essential to the evaluation, for example, into the imports and so on were not sort of deleted by the LLM mutations. Um there are so-called markers which basically state which parts of the code are mutable and evolvable. And um it's easy to like

programmatically sort of make them actually mutable when you get a um diff proposal. And these will not be changed. So only the the rest of the um the code snippet will be changed. We sort of implement a type of rejection sampling with reflection approach where if an LLM by chance for example tries to mutate this part, it's going to be rejected and you resample a new proposal. and uh yeah thereby you you can somewhat uh mitigate certain security or safety um problems and uh yeah get a robust sort of mutation. One of sort of I think the the bigger questions is how can you turn this from a single file mutation setup to a multifile mutation setup. So working on entire code bases in principle you can represent many code bases in a single file right but um the hierarchical structure might be actually useful and there are some um ideas from uh let's say ader this this um coding tool um where you construct like um a repository map and sort of have some

level of abstraction but they also come again with uh positive uh and negative trade-offs basically. I I love Ada by the way. >> Um it it feels that in the future the um the you know like the code generation systems will will actually resemble Shinkerol and if you think about it it'll be using some kind of git repo. Maybe cursor already does this because in cursor you can restore previous checkpoints but it can be exploring different branches and and merging checkpoints together and um you know obviously you just say in natural language what you want to do >> but um we didn't talk about mutation by the way so so we just spoke about diffs and there's also an option to do the full file rewrites but there's also this notion of of of crossover so how how does that work >> a small innovation on top of alpha evolve where I believe they only use sort of diff based um um mutations is that here we wanted to have more flexibility to entirely rewrite the program right to come up with a completely different stepping stone if you will. So um again there you can make

parts of the code mutable but instead of proposing uh let's say a patch to change certain parts of it we um essentially rewrite the entire program and um this sometimes is helpful right um it's not always like a clear benefit um but uh it it allows you to essentially get more diversity into the search right so um this is one type of mutation next to sort of this stiff patchbased approach and the other one is a crossover mutation where we sample basically not only a single parent program but sort of two different ones and we ask the the system to sort of make a complimentary improvement and um again on some problems this is really helpful and on others it's not um but in generally we found that um sort of having a diversity in terms of operators is also helpful in discovering new things and I wanted to to sort of follow up on the point you made before about this sort of being a new paradigm I think so too I'm really convinced I think right now we're sort of um at the beginning where we we still think a lot about sort of this chat assistant interface um as the way how we

interact with LLMs but it's uh most of the times inherently single threaded right so we're sitting in front of the computer we're interacting with the chat where we're seeing sort of changes um as they occur in the editor we accept them and so on um but I think this is sort of also just a stepping stone towards sort of a more let's say distributed way about thinking about research optimization and so on so I like to sort Think of um VIP coding, Vibe chatting. And on the other hand, we have sort of Vibe optimization and Vibe researching where sort of my ideal future scenario is one in which um you as a researcher sort of during the day co-work with like a system like Shinka or the AI scientist. you um sort of steer the ship like a shepherd in some sense and then during the night you you you you press play and you go to bed and in this in the background you have multiple experiments running and automatically new ones being proposed by LLM's evidence being accumulated and then in the morning you come back and sort of you have an multi-threaded sort of

system running in parallel and you're more like the shepherd of the ship than the the person actually executing experiments and analyzing or you're still analyzing ing but you're not executing. This is happening sort of by the system itself. >> Yes. And increasingly this might be semi-supervised or even proactive. I mean you know there's that new product from open AAI where it knows what you're interested in and while you sleep it's going off and you know find your pulse. That's right. >> And you know we're in the situation now where we're reasonably technical people. So you know Matt Mat Lab and MATHematica they're they're supremely powerful but you need to know how to express problems precisely. Whereas I can imagine a future where we um express problems just in natural language or maybe just based on our interactions with language models, the platform knows what we're interested in and it can just go and find things on our behalf because this is about democratizing this technology to people who perhaps don't know exactly what they're looking for. >> I think one of the bigger problems there is sort of this verification aspect to

it, right? in the sense that often times it's easier to generate a lot of solutions than to actually like hard verify them, right? Language models are capable of doing sort of soft verification looking at code and sort of uh latently running like a like a stack trace of execution, right? But it's not exact, right? And I think um sort of these notions of reward hacking and sort of um not doing real discoveries but sort of shortcutting them is one where we need to put more time and effort into to figure out um yeah how to make sure that this actually moves in the right direction. Right? And I would hope that language models at some point can do this efficiently themselves, right? So either implementing in code or latently doing it. Um but this is also like part of the problem problem, right? It's not only coming up with the problem but also with the automatic verification at the same point. >> Yeah. Isn't it a tantalizing idea that there are natural patterns in the world and the building blocks to construct novel solutions are already there,

right? And and maybe they're there for a reason. Maybe they just reflect natural regularities in in the universe because there's always this question of, you know, intelligence is about adapting to novelty. So the world is always changing and the world tomorrow will have things that we can't explain you know with our with our knowledge today but we do have like abstract knowledge that could be easily recombined to explain the future and LLMs might already have those building blocks. >> Yeah for sure. I think like in some sense the more you think about sort of AAM's razor applying to everything in our world like let it be language or let it be sort of science um is is pretty interesting because like these artifacts now go into our language models of today and potentially there is some amount of this being captured. Um I think though it might also be an inductive bias that leads to a local optimum at some point, right? And you need more complexity. But I do think like with systems that sort of do this evolutionary mutation sort of style approach, you might still sort of push the system out of these local optima eventually.

>> Yes. And then there's also the notion of the importance of adaptivity. So this is what Shlay says intelligence is. >> And since we've had these models that actually do adaptivity at inference time, so things like test time, active fine-tuning and the reasoning models and and so on, they started getting non-trivial performance on ARC. >> Now it's very very expensive to have adapting huge foundation models. You know, it it's it's just a practical concern why we haven't done that yet. But what we can do is build systems like shrink or evolve that leverage the best of both worlds. So they leverage frozen foundation models, but they give you adaptivity. And the purpose of adaptivity is to respond to novelty is to create new building blocks, synthesize new building blocks in this principled treelike structure that allow us to adapt to novelty. So we are having our cake and eating it. I have to say I found it very interesting that Jeremy basically in your podcast when you asked him about Shinka was saying like he doesn't believe that there are a lot of sort of percentage points to be gained

by using a system like Shinka but you can make it much more efficient. Right? That was sort of the gist of his answer. Um and to me it's like once you have made it much more efficient you can scale it up again right so if you essentially have a cheaper system that can generate many more sort of instructions I would expect that by the nature of open-endedness uh you might get some amount of improvement out of it right now I don't have any evidence for it um I would love to collect that evidence it's again like the magic of open-endedness that comes into play that as long as sort of these training examples of arc AGI give you a good signal for a final test submission um you should be to to progress. >> Yes. And that and that is a great segue because um certainly on on the circle packing problem, it was so sample efficient that in less than 200, you know, interactions with with an LLM, you converge on the solution. But I was thinking that great, but it's still quite dependent on the starting conditions. You know, we talk about this design bias and and and so on. So what we put in is very important. But now what we could do is scale out. So we

could run this a thousand times and we could have another process which prompts generates breeds the starting conditions because because every time we run chinker revolve what it's doing is it's it's searching parts of the epistemic tree >> and what would happen if we just scaled that out massively. >> We haven't tried but you could even start with like an empty program right which be would be basically the same right and then you would branch off of that empty program I would expect. Yeah, we haven't done this simply out of sort of cost and um time reasons. Um but I do think in many ways sort of this is the question that will push us towards like this true open-ended vision of running a system for like a month or so, right? Really trying to squeeze this out. Yeah, I'm not sure if we're entirely there yet, but I will do my best that we will. >> And the reason this is interesting is we know as a practical matter that we can't start with nothing. >> Mhm. If we were just sort of like starting from the most primitive building blocks, the search space would just be huge and there'd be no learning signal. So we know we need to start a little way up the stack. But we can massively parallelize that. So yeah,

let's say we have a thousand different instantiations of Chinker Revolve. It doesn't have to be embarrassingly parallel. We could still have some sharing. So during their execution, we could still have a little bit of like crossover and and and and maybe then we could we could run all the Chinkra evolve instantiations in a in a similar kind of meta evolution loop. And my suspicion is contra Jeremy, I agree with you. We know there are diverse stepping stones out there that could dramatically dramatically improve many of these solutions. We simply haven't scaled it up yet. >> Yeah. I also believe that um using a system like Shinkai Evolve um could be able to sort of automatically detect whether or not like an instruction based optimization approach for a given problem or a transform based approach is actually the right thing to do and sometimes potentially it's like even the mixture right there's some things you can probably easier even articulate in Python than you can articulate in in sort of language right so I would be really interested in um sort of exploring that

>> yeah I mean you said earlier about Jeff's CL what was Jeff Cloon's paper the um the thing that generates problem >> needed capability discovery >> I did speak to him about this at an something like that could be fascinating as well you know where we're also generating the problems and solutions and then kind of moving them back in but I I think the way this will land commercially is there'll be a new type of GPT where everyone is solving different types of problems and and the system it'll be like a kind of Chinka revolve but a massively distributed version where MATHematicians are using the platform over here to solve this problem and it will see commonalities and and it will kind of like link them together because you need to leverage like human creativity in this process as well. I think >> like a big challenge going forward is going to be like how do we uh change our incentive system for this to actually scale right um I think like for example some amount of uh economy will be needed uh or some amount of mechanism design in order to make sure that everyone is still happy to engage in it right so maybe uh we're going to have many more leaderboards for whatever is numerically

um sort of scorable and um I think this this this will be really really interesting to see how sort of compute um these automated agents. Uh, human shephering and steering will ultimately sort of change and revolutionize science and I guess society more generally. >> And Rob, looking at the future, we've got a load of people in in um, San Francisco that that want to scale language models and they are adding in implicit forms of adaptivity and composition so that they're building controllers and they're doing reinforcement learning with verifiable feedback and so on. I think that you subscribe to the slightly different idea that that we need to be far more open-ended and we need to be using evolutionary algorithms and so on. But do you think that they are on a path to nowhere? Do you think they might change TAC? Do I mean where is this going? So I I actually think um that these things can be complimementaryary right in the sense like um let's say you fine tune a model to be like a circle packing expert right so I I do believe that uh mixing in sort of different sort of RL fine-tuned models into sort of the

ensemble of models and then having a good way to adaptively select which model to use is is not a bad idea right um so to me I just very fully subscribe to this philosophy of open-endedness and uh reading Ken's and Joel's book was really like a fundamental moment in my life and I uh want to see how far we can push this and I think we're uh we're not yet at uh sort of uh convergence where either the capabilities of the models uh has converged or the the way how we uh scaffold around them or the way how we humans interface with them. So to me they're really like these three points like uh model capability, model scaffolding and sort of the user interface and I think we have a lot still to push on all three angles. >> Beautiful. The only thing we didn't talk about was we spoke about the circle packing problem but you also applied it to a few other things. Can you tell us about that? So one thing we did was um we uh sort of used um a framework called ADAS automatic design of agentic system

where basically instead of manually writing an agent scaffold um you use an LLM to write agent scaffolds for a specific task right. So what we did is we looked at u MATHematics tasks. So um Amy and we uh use Chinka to evolve basically an agent, right? So using an agent to evolve an agent and we found that uh there we could dramatically improve um sort of the performance of very cheap models like GPT4.1 nano but the agent scaffold was also able to either like generalize to other language models or to different years of uh of Amy right that was one application um one important other application that we did was to AL bench bench is basically work done by other folks at um Sakan including Yuki who's also part of the paper uh which is uh considering horistic sort of programming contests um sort of previously done and executed by ATC coder which is like this famous Japanese competitive programming um organization and we sort of showed that

uh Shinka can also work very well as a co-scientist. So basically we we took initial solutions obtained by um an AL agent that was previously designed and then we optimized on top of these initial solutions with Shinka and showed that on one of these um sort of programming tasks if the combination of this agent and Shinka would have competed in the challenge it would have um ranked second place basically. So I think there's uh some evidence that Chinka can work as a co-scientist and um not only for LLM agents but potentially even for humans like we discussed before. And then finally the final application that we looked at was um designing sort of mixture of expert load balancing loss functions. So at Sakana we've done some uh previous work called Disco Pop. I think we discussed this during the last podcast we did where we using LLM to design objective functions and back then we did it for preference optimization and post training and here we did it for load balancing of um mixtures of experts also there we found

that within I think like even only 20 sort of generations we were able to sort of um explore let's say not only a single objective function um but sort of uh let's say a convex hole where there are different trade-offs between sort of performance and load balancing and so on. So I think this is another application of Shinka where it's not only basically about sort of finding the best solution but essentially illuminating a program space where there are always potential trade-offs between like let's say for example runtime and the quality of the circle packing right and um having a system that can explore all of these is important as well. >> I'm very excited to see you apply this to the ARC challenge like what what are your thoughts about that? I still need to collect results. So I I don't want to make any claims like or hard claims before having done this. Um but I would hope that there is some chance of for sure improving sort of the the cost of these systems and then potentially even performance but yeah to be seen. >> Oh very so you've done some experiments.

Exciting news is potentially coming. >> I've started looking into it. >> Yeah. I mean what what are your thoughts in general about about ARC though? >> I think it's great. I think it's uh it's really important and I think um it fills an important gap and uh I do really deeply respect Francois and sort of um read the paper when it first came out and no one thought of actually being able to to get numbers above 10%. Right? And it's also pretty fascinating on a society level how far we've come since then. And sometimes while you're sort of deep in the say battle mode or work mode, you kind of forget where you were one year ago and then just looking back it's it's pretty amazing also how far we've come since um or what >> it's insane. I mean I think Frana doesn't get enough credit because it's such a good benchmark and not necessarily for reasons people think because Franis is always saying that um we need to have a benchmark which is easy for humans and hard for AIs and and in a sense that's not quite the case. I

I said when Arc V2 came out that it's actually very difficult for humans that you know there was one task where Dougar was stumped for about 15 minutes with there was three of us looking at it and we just and it's one of those things that depending on your perspective you might get it straight away or you might not. So there's that criticism and people have said that RP3 is even harder you know but I I think that's rather missing the point. I I think he's saying that >> with with with a lot of these competitive coding um problems the the data set is contaminated. These are problems that have been solved before in in part or in whole. Which means when you look at the epistemic tree, um many of the building blocks for solving them are very high up in the tree. He's he's looking at these these problems that there is very little data set contamination and um they need to be solved from very abstract building blocks. So you're starting much lower down the tree and you're synthesizing a model by composing together very abstract building blocks which is the essence of intelligence. Yeah. And and I think for that reason ARC is is really kind of pushing us to build adaptive systems which we could say are

intelligent. >> Yeah, I agree. I mean like in many ways I'm I'm really looking forward to the next years and seeing how far we can push this and then also how much generalization we can get afterwards because I I believe like when you look at sort of the more recent models, they're getting much better at the uh transform style code evolution or outputting for ARC than they are on the instruction based level. And I think this might already be like a small sign of some amount of overtraining on ARC AGI1 at least, right? I do believe there are some aspects of work which will be automated before it comes to sort of fully science automation and the type of work I'm doing. But I could imagine that certain parts of the dimensions that I deal with every day are for sure going to be hit by AI. And then the question is, are there going to be new dimensions opened up that we as humans will fill in, right? And I think what I said before about like shephering and so on, I really hope that that's the way

forward, right? In the sense that humans are the ones steering the ship while just being massively amplified in their productivity. Right now, I am not really seeing the kind of job market disruption that was being predicted. I know from personal experience that in in a sense it's made it very difficult to hire people. You know, script writers use um chat GBT. I can spot it instantly. >> And u writers and copy editors are actually in more demand than they were before fixing all of the crap that has been generated with chat GPT. And there's the cloud analogy as well. So, you know, IT system administrators who were earning, you know, £60,000 a year in the UK, they rebranded as as cloud DevOps engineers and they more than doubled their pay. And people are very adaptive. They they see new trends, new bandwagons and and they just adapt and and they add value on top and that has been the trend for, you know, for a very long time. Do you think that AI is going

to be so transformative that it will transcend people's ability to adapt? I think it's uh just a question of um speed right so um I was talking about sort of cultural evolution and technological evolution and it seems like we humans we need more adaptation and more time to to get used to the technology to carve out these niches where we we can fill in and it's complimentary right so first off I I think we're still not at the ceiling of the sort of technological progression right so maybe in a couple of years we will need less of sort of slop editing like you said, but I do think we we need some more time to adapt to the different modalities of interacting with these systems, right? I think everyone um can sort of interact with a chat assistant. Um but I think this is the most sort of naive form of interacting with AI agents for example, right? So yeah, I think we need to get the pacing of all of this right and we need to do much more exploration in human machine interfaces,

UIUX design and uh how to make sure that humans sort of feel or feel fulfilled during this experience. >> This is particularly relevant because you know you were behind the AI scientist paper and there's now version two of that. Allow me to be a tiny bit skeptical. You know, we were talking about when we evolve systems to do a to do a particular thing and at the moment it feels like as good as they are, they are still quite parasitic on the instructions and intentions of the human supervisor. So it's very much um an exchange between the humans and and the system because the implication is that in the future we might have systems that are so autonomous and so open-endedness and can figure out valuable things to research that humans wouldn't be needed anymore. And the reason why I'm not that worried yet about labor market disruption is I still believe deeply that humans are the source of deep understanding and creativity in the world. If I didn't

believe that, I would be very worried. >> I agree. To me, like the AI scientists like V1 and now V2 are sort of glimpses into a potential transformation. But I fully agree in order to make really big scientific breakthroughs like multiple of them like every day or whatever you still need humans in the loop to sort of either seed or guide the direction in which to explore or to to verify check and um actually yeah transfer these insights right so I think um it's not going to be like all ML PhDs will will be unemployed it's it's more going to be a um sort of core evolution ution of humans with this technology and um potentially like in an ideal future for me like it will allow humans to focus on what they're really really great at right so I think it's going to be an amplifier of sort of these these latent dimensions humans are great at right I think something that's critical is that we as humans try to interact with these systems as early as possible in order to

actually like have influence and ownership over like this development process right it's um ultimately collective intelligence that will shape all of these systems together. >> And do you think these systems can become incredibly sophisticated such that they are you know somewhat detached from humans? >> Well, I mean with the AI scientist v2 we sort of released that um one paper that we submitted to an ilear workshop was able to sort of pass uh the acceptance u threshold before meta review. So I I do think at least for sort of workshop level contributions um we're we're getting there. Um while not every submission an AI scientist paper um does is or is reaching that threshold um we're we're at the point where we can even talk about sort of noisy review processes and this actually being uh yeah something that as long as you have a large budget you might get something out of it. I think going forward for the bigger innovations and so on. Um for now

it you still need humans. Um but we're sort of at the GPT1 moment of of making this sort of a reality and potentially uh in 10 years this is going to look very very different once the sort of also the infrastructure for it has been built up right. So there are places like periodic labs right which sort of now are building like real physical labs with uh robotic systems to automate automatically sort of execute experiments. This will take some time but it is uh sort of imaginable for sure that as we sort of do RL on these types of uh systems and we actually also account for negative results and for actual like hypothesis testing. So getting these systems to be a real good hypothesis testers with verifiers in the loop um that we might be able to unlock many more capabilities. >> Yeah, I mean I suppose I I don't want to sound like a lite. So it's entirely possible that this is just you know I I don't have the imagination to think about the future. So it is possible that in the future that these systems might

understand very deeply and be creative. You know, I think right now the problem is they only understand things a few levels down in the epistemic tree. So they can do some surface level recombination and they can discover new things in the basin of things that have already discovered. But but we understand things very deep down in in the epistemic tree which means our you know our cone of creative potential is is much wider. It's possible that that gap might be closed. What would happen then? The way how I kind of think about the scientific process is like a tree search ultimately, right? So I think a lot of sort of analogies from evolution transfer to scientific research, right? In the sense that we traverse a tree of different ideas or different experiments and then in the paper we report one path through that tree. And I think what I kind of alluded to before, we need much more like full tree data sets for training these LLM systems to actually learn how to do this exploration and this foraging basically. Um at the same

time I I feel like um evolution will also take place on the cultural level like for us right we will get better at sort of steering the ship and um I can imagine that uh in in a future world sort of the way how we do research will be completely different and I'm pretty sure that right now already 99% of machine learning research is done with sort of AI assistance right think about chat GPT brainstorming cursor coding cloud code etc in the long run we're going to move on that spectrum from sort of with AI closer to by AI and then sort of more high level sort of orchestration and overseeing by humans. >> There's also the notion of how intrinsically coupled to humans is the value function. So one school of thought is that AI will develop a mind of its own and it will you know basically transcend humanity and it will just have agency which is not parasitic on on on ours. I personally don't subscribe to that view. But the other view is that it

is like let's say the AI scientist, you know, like version 10, it's going to be continually epistemic, you know, epistemic foraging. It's going to be finding new things that are useful and they kind of have to be useful to us because if it finds things that are not useful to us, then we just won't use them and then nothing will happen. So, so do you think there'll always be a kind of coupled value function to humans? >> Um, Jeff Cloon had this work on Omni, right? and using LLMs as sort of amotized notions of interestingness for humans, right? And I think ultimately the way how we train these systems is coupled in in human data, right? And uh going forward it will also be coupled with human data that is collected using verifiers, right? So, I have a hard time believing that um in the long run when you run this open-endedness sort of paradigm with AI scientist agents, uh it's going to completely divert to to something that's either fully non-interpretable or unrelated to problems we as humans care about, right?

And then again, like humans can steer to a certain degree where like the search happens, right? So, you can tell a system, okay, um try to do cancer research, right? and sort of work on problems that we care about and ultimately like we are the ones who control how much flops are being pushed into this. >> Yeah. Because as a thought experiment I can imagine let's say in the world of MATHematics what if um an AI scientist could come up with entirely new problem formulations and then solve them. And these are things that humans had never conceived of before and maybe they would be less interested in the answer because humans hadn't spent time thinking about it. And if you think about it, we could just explore the fogyny of MATHematics just to the nth degree. And at some point, maybe we just wouldn't care anymore. Maybe we can just carve out that space just forever and ever. >> Yeah. But maybe down the road there is a stepping stone that enables a new innovation in a different field that we actually care about. Right. So it's very hard to say a priori whether or not something is interesting or not. Right. >> Yes. And there's also the notion of I

love this idea of diverse intelligences and diverse minds and maybe we we could just create artifacts in a space which is completely alien to us and we might even ascribe moral value to them and we might not want to turn off you know the the power because we we want these alien artifacts to stay alive. Maybe like I'm I read a lot of science fiction, but I I would sort of shy away from from speculating about all of this. But um I do think one thing I'm extremely certain of is that the way how we conduct research and science is going to fundamentally change in the next 5 years, 10 years, and 20 years. And I hope that we're going to be able to sort of tackle some of the biggest problems which are still sort of seemingly unreachable right now with and by AI. >> So Terrence Town has posted that he's been using GBT5 to and and it's it's been speeding him up. It's it's taking away a lot of the um the drudgery, but the cynical take is that and Scott Arson

posted something similar as well. The cynical take is that maybe laziness is is stepping in and in some pericious way using AI models is actually stopping us from thinking outside the box. So it's it's encouraging us to kind of search in the neighborhood of things that are known and that is very useful. It's very useful to have an artifact that knows all of the experiments, all of the things that are ever done by people 20 years ago. But now we don't have people really kind of applying their their brilliance, their talent in completely new areas. >> So first off, it's great that these experts are already using the technology in their day-to-day work, right? And I think it's also important that really really top level scientists try to push what's capable with these systems or squeeze out where there might be sort of black spots or stuff we these systems can't do. Um second off I think it comes down sort of to discipline and how we raise sort of the next generation right so discipline on the personal level like how much do you just sort of tap accept

everything that's being proposed by these systems and um responsibility in terms of educating the next generation in the sense that we need to sort of teach our kids that ultimately what comes out of these systems might not always be be true that uh facts can be sort of uh subjective if you and that there needs to be more research about u what's being given to you and um I think this will be like I said this cultural evolution that we have to step through and um try to make the best out of >> yeah the autopilot thing is very interesting because there is a tendency using cursor just to you know at some point the models are getting so quickly that you can't even read >> the tokens coming at you and then you just press accept and you press accept. It's the same thing in cars that as soon as you have too strong of an autopilot, you just completely switch off and and then you see a divergence because there's something about thinking that it must be grounded on your path. There's this path dependence and when you start kind of becoming parasitized by this

other train of thought, then you stop thinking about your path and then you're not in the driver's seat anymore. This is now like a bit of a harsh statement, but sometimes I wonder if these systems like these coding assistants um are almost like drugs, right? In the sense that you become addicted, you re you use up all your um sort of budget and then you need to load up again and uh once you you fully reach sort of the the budget limit, you feel like okay, what am I going to do now? And I think once that happens to you, you should really sort of rethink um the way how you work, right? And um to me right now there's certain parts where like sort of auto accepting is acceptable and there are certain parts where it's definitely not and you really need to go deep into it. And I think right now we're sort of in this weird um non-equilibrium state where things are moving constantly right so the systems or the models are changing the features are changing the sort of um points where the systems are good is change are changing all the time and we humans need to constantly adapt

to to that right and uh I think it's a big cognitive challenge and uh I think we just all need to be aware that there are certain problems and certain challenges es that we have to adapt to. I think the best way to do so is just interact with this technology as much as you can and um maybe find new research ideas for out of that experience. >> And how is AI scientist V2 different to V1? >> In V1 we we used sort of a template based approach. So we had like a base experiment and then for that base experiment we asked sort of an LLM to generate ideas sort of with semantic scholar calls and sort of literature search and then uh it implemented sort of these ideas based on the template right basically code diffs and then it linearly executed like an experiment plan and wrote a paper in the end and so what could happen was that uh there was an idea and that idea didn't work out right but then in the end the paper like the experiments were still executed linearly and you wrote a paper and uh

this was already impressive in the sense that it looked very much like like science. Um but if you think about human sort of uh science and like the scientific method um it's much more like research like I said before, right? You sort of adapt what you're going to execute next and you sort of refine based on evidence that you accumulated, right? So this is sort of the the notion of falsificationism from from KL Popper, right? in the sense that uh we collect evidence for hypothesis and reject we reject others and we do so in in a loop basically until we we want to publish or we find something and we tried to take this notion and directly build it into the agentic scaffolding for the AI scientist v2. So now it's basically like an paralyzable agentic tree surge where there's no longer a uh template experiment needed but this is drafted up by the LLM itself and thereby the AI scientist v2 can be applied to many more sort of settings if you will. So at the core is sort of this new agentic tree search paradigm and then uh we use um

sort of a couple of uh minor technical ch uh changes like using a VLM reviewer for um sort of figuring out if captions of a paper are aligned with the figures and uh we we scale this up to many more sort of uh computational nodes and then write a paper in the end again. So I'm trying trying to say this in the most polite way possible but um a critic might say I don't want to use the word slot but a critic might say um we are producing papers which appear like papers so they they have figures and they have results and they have things written in a certain way but they're not grounded deep down the epistemic fogyny which means that they they they're they have you know near near the top of the tree we're seeing some novelty and composition happening but it but it doesn't reflect a deep understanding What would you say to that charge? >> It's for sure that not every paper that comes out of the AI scientist v2 is a naturew worthy publication, right? That that's for sure the case. So, um definitely there is uh some amount of let's say slop or um content that is not

like a scientific big discovery being written up by the AI scientist v2. Um but ultimately like we we showed that it was possible to obtain a workshop level paper. I do think this is sort of the first time basically where we can see that at least now we're able to fully autonomously spend compute spend API calls to obtain some amount of scientific insights. And for me at least right now it's a good way to sort of prototype ideas or to investigate a certain field get like initial starting point initial results and then to to work on top of it. Um but for sure more work needs to be done to make this uh entire process more robust, more efficient and um essentially produce many more sort of uh true positives if you will. >> Yeah. And it might be one of these things you know like when we moved from uh GBT3 to GBT4 there was just a massive increase in fidelity because the thing is with with slop to me it simply means lack of deep grounded understanding. And there's no reason in principle why these things couldn't have a deep grounded understanding. They just don't have it

yet. So it's something that could improve quite slowly and then at some point we might just think oh my god we've got an AI scientist. >> Yeah. I mean like to me this this kind of comes back to um what we were discussing about before. So first off there is a verifier in the loop right or in the sense that um experiments are actually executed on a computer right so the numerical results uh can be be fed back or are fed back into the system to come up with the next thing to explore. um but like we haven't made uh like a let's say discovery like a residual connections or something that have diffused into everything in machine learning and I think what we really need is to make these systems be much better at sort of integrating knowledge over multiple experiments and sort of become better at sort of formulating the next hypothesis based on previous insights and uh yeah this might require some amount of post-training on sort of these traces basically But I'm pretty positive

that we might also get there with just diversity and scaling these systems up um in in an efficient but scaled up way. >> I'm just thinking that the the first breakthrough discovery would it resemble the AI scientist paper or would it resemble Chinka Revolve? So for example, we we could do like a massively scaled app Jinker evolve and we could say I want to discover a new architectural design. Yeah. >> And would that happen and then we would get the AI scientist paper to kind of write it up and do ablations and stuff. May maybe that would be the the pattern of it >> to a certain degree. I've been thinking a lot about how you can potentially even combine these two paradigms, right? Um the AI scientist and in and shinka or alpha evolve style optimization algorithms. And I do think there's some amount of work to be done on sort of this autoverification sort of aspect to it on the sort of problem formulation aspect to it. The paper writing part is actually um the least important about the AI scientist right it's a form factor that we humans are sort of uh

used to and it helps anchor our mental model of like a scientific discovery. Um but ultimately uh I'm not sure if the paper is going to be the the knowledge transmission medium in let's say 20 years right something else I've been thinking of a lot is uh whether or not uh we can make papers much easier agentically accessible right in the sense that uh right now it's it's it's a latte document but you could imagine sort of equipping every paper with um sort of several uh model context protocols so that every figure is reproducible, data is accessible, and essentially make it much easier for the LLM agents to essentially either replicate work or to work off of them afterwards, right? Doing sort of absolute improvements, ablations yourself through that interface to a paper. But to be entirely honest, I'm not sure if it's going to happen because there have been many great ideas for improving sort of uh let's say the the

format of scientific artifacts out there and um people still seem to to like the paper format which has existed for let's say hundreds of years, right? So I think it's a question of incentives again and uh really showing that if something like that would exist, it would enable much faster progress of um AI agents for scientific discovery. >> Yeah, paper is a great human interface. It's a similar thing with um automated driving, right? That we could revolutionize the road network to have sensors and we could dramatically improve the the monitoring and observability and optimization. But I'm fascinated by that idea. So, so you're saying not just reproducibility of the experiments, but also the way that the figures are designed and and the code and and so on because then we could create this huge playground where agents can repurpose, recombine, restudy work that has been published by other scientists and it also made me think does like having an automated scientist does that make peer review more or less important? I do think it actually uh

makes it more important at least for now right in the sense that we now have a mechanism or could have a mechanism that generates many many papers right and it uh first increases like the workload on on on human reviewers and we need some effective way for filtering and then essentially only taking the cream of the crop for human verification afterwards right so I think uh for now like the ultimate verification is still like the human and the diffusion of the result through the community and we need better tools for doing this automatic filtering and verification like we have the AI reviewer that comes with sort of the AI scientist um but uh you actually probably need some form of experiment execution for actually verifying everything yeah but there is for example work by open AAI on on paper bench and trying to go into that direction using sort of LLM soft verification and these types of things so I I'm I'm hopeful that we're going to figure this out in the next years. >> Yeah. And I think one of the Rubicon

moments is when the the new Transformers architecture or something massive is discovered by AI and we're all using it. My worry, I suppose, is that probably folks like Google who have enough compute power, they're going to be running AI scientists and they're going to own many of these discoveries, which is why it's so important to have work which can efficiently discover new things in science. And it's important to have um work that's openly available, right? I think like with the AI scientists in Shinka, we're really trying to to make sure that we can sort of apply the collective intelligence of all of us to to shape how this might look in the future. >> Well, Rob, this has been so fantastic to have you on the show. Sakana is hiring amazing engineers, by the way. So, if if this sounds like and it it is an amazing opportunity, get in touch with Rob and the guys. And I trust you're working on some exciting new things that are coming up. Yes. And I I hope to be able to talk to you in the future again about some of this. >> Absolutely. Rob, thank you so much for coming. >> Thank you so much, Tim.