Extreme Harness Engineering: 1M LOC, 1B toks/day, 0% human code or review — Ryan Lopopolo, OpenAI — benchmark.space

Latent Space

Extreme Harness Engineering: 1M LOC, 1B toks/day, 0% human code or review — Ryan Lopopolo, OpenAI

2026-04-07 77min 11,625 views watch on youtube →

Channel: Latent Space

Date: 2026-04-07

Duration: 77min

Views: 11,625

URL: https://www.youtube.com/watch?v=CeOXx-XTYek

We’re proud to release this ahead of Ryan’s keynote at AIE Europe. Hit the bell, get notified when it is live! Attendees: come prepped for Ryan’s AMA with Vibhu after.

Move over, context engineering. Now it’s time for Harness engineering.

Ryan Lopopolo of OpenAI is leading that charge, recently publishing a lengthy essay on that has become the talk of the town. In it, Ryan peeled back the curtains on how the recently announced OpenAI Frontier team have become OpenAI’s top Codex users, running

I do think that there is an interesting space to explore here with codeex the harness as part of building AI products right there's a ton of momentum around getting the models to be good at coding we've seen big leaps in like the task complexity with each incremental model release where if you can figure out how to collapse a product that you're trying to build a user journey that you're trying to solve into code it's pretty natural to use the codeex harness to solve solve that problem for you. It's done all the wiring and lets you just communicate in prompts to let the model cook. You kind of have to step back, right? Like you need to take a systems thinking mindset to things and constantly be asking where is the agent making mistakes? Where am I spending my time? How can I not spend that time going forward? And then build confidence in the automation that I'm putting in place so I have solved this part of the SDLC. Before we get into today's episode, I just have a small message for listeners. Thank you. We would not be able to bring

you the AI engineering, science, and entertainment content that you so clearly want if you didn't choose to also click in and tune into our content. We've been approached by sponsors on an almost daily basis. But fortunately, enough of you actually subscribe to us to keep all this sustainable without ads, and we want to keep it that way. But I just have one favor to ask all of you. The single most powerful, completely free thing you can do is to click that subscribe button. It's the only thing I'll ever ask of you, and it means absolutely everything to me and my team that works so hard to bring the Inspace to you each and every week. If you do it, I promise you we'll never stop working to make the show even better. Now, let's get into it. All right, we're in the studio with Ryan Leoplo from OpenAI. Welcome. Hi. >> Uh, thanks for visiting San Francisco and thanks for spending some time with us. >> Yeah, thank you. I'm super excited to be here. >> You wrote a blogbuster article on harness engineering. It's probably going to be the defining piece of this emerging discipline.

>> Thank you. It is uh it's been kind of fun to feel like we've defined the discourse in some sense. >> Uh let's let's contextualize a little bit this first podcast you've ever done. Yes. And thank you for spending with us. Uh what is where is this coming from? What team are you in? All that jazz. >> Sure. Sure. Sure. So uh I work on frontier product exploration new product development in uh the space of open AI frontier which is our enterprise platform for deploying agents safely at scale with good governance in uh any business. And the role of me and my team has been to figure out novel ways to deploy our models into package and products that we can sell as solutions to enterprises. >> And you have a background I'll just squeeze it in there. Snowflake stripe citadel. Yes. Right. Yes. >> The exact same kind of customer entire life. Yes. The exact kind of customer that you want to >> So, I'll say I was actually I didn't expect the background. When I looked at your Twitter, I'm seeing the opposite, right? Uh stuff like this. So, you've got the mindset of like full send AI

coding, uh stuff about slob, like buckling in your your laptop on your Whimos, and then I look at your profile, I'm like, "Oh, you're just like you're correct in the other room, too." So, perfect mix. Perfect. I uh it's quite fun to be AI maximalist. If you're going to live that persona, open AI is the place to do it and it's >> a token is what they say. >> Yeah. It certainly helps that we have no rate limits internally and I can go like you said full send at this thing. >> Yeah. Yeah. Uh so so open air frontier and you're a special team within OB Frontier. We had been given some space to cook which has been super super exciting and this is kind of why I started with kind of a out there constraint to not write any of the code myself. I was figuring if we're trying to make agents that can be deployed into end enterprises, they should be able to do all the things that I do. And having worked with these coding models, these coding harnesses over 6 7 8 months, I do feel like the models are there enough, the harnesses are there enough where they're isomeorphic to me in capability,

in the ability to do the job. So starting with this constraint of I can't write the code meant that the only way I could do my job was to get the agent to do my job >> and like just a bit of background before that this is basically the article. So what you guys did is 5 months of working on an internal tool zero lines of code over a million lines of code in the total codebase. You say it was 10x more like it was 10x faster than you would have if you had done it by end. So yeah, >> that was kind of the mindset going into this, right? >> That's right. I think right started with some of the very first versions of codec cli with the codeex mini model which was obviously much less capable than the ones we have today. Uh which was also a very good constraint, right? It it's quite a visceral feeling to ask the model to build you a product feature and it it just not being able to assemble the pieces together >> which kind of defined one of the mindsets we had for going into this which is whenever the model just cannot you always pop open that the task double

click into it and build smaller building blocks that then you can reassemble into the broader objective. And it was quite painful to do this. Honestly, the first month and a half was 10 times slower than I would be. But because we paid that cost, we ended up getting to something much more productive than any one engineer could be because we built the tools, the assembly station for the agent to do the whole thing. But yeah, so onward to GBD5 51, 52, 53, 54. To go through all these model generations and see their kind of quirks and different working styles also meant we had to adapt the code base to change things up when the model was revved. Um, one interesting thing here is 52, the codeex harness at the time, did not have background shells in it, which means we were able to rely on blocking scripts to perform long horizon work. But with 53 and background shells, it became less

patient, less willing to block. So, we had to retool the entire build system to complete in under a minute. And you know this is not a thing I would expect to be able to do uh in a codebase where people have opinions. But because the only goal was to make the Asian productive over the course of a week we went from a bespoke make file build to basil to turbo to NX and just kind of left it there because builds were fast at that point. >> Interesting. Uh talk more about turbo to NX. That's interesting because that's the other direction that other people have been doing. Ultimately, I have not a lot of experience with actual front-end repo architecture. >> You're talking to Josh who built us this guy. So, like I know the NXT team and I know Turbo for from Jared Bomber and I'm like yeah that's an interesting comparison. >> The hill we were climbing right was make it fast. >> Is there micro front ends involved? It's like how >> how complex react Electron uh single app sort of thing >> and must be under a minute. That's an

interesting limitation. I'm actually not super familiar with the background shell stuff. Probably was talked about in the FI3 release. >> Basically means that uh Codeex is able to spawn commands in the background and then go continue to work while it waits for them to finish. So it can spawn an expensive build and then continue uh reviewing the code for example. >> Yeah. >> Uh and this helps it be uh more time efficient for the user invoking the harness. >> I guess like and just to really nail this like what does 1 minute matter? Like why not five, you know? Okay, we want the inner loop to be as fast as possible. 1 minute was just a nice round number and we were able to hit it. So, >> and if it doesn't complete, it kills it or some something. >> Uh, no. We just take that as a signal that we need to stop what we're doing. Double click, decompose the build graph a bit to get the time back under so that we can able the agent to continue to operate. >> It's almost like you're you're it's like a ratchet. It's like you're forcing buildtime discipline because if you don't, it'll just grow and grow and grow. That's right. And you mentioned that >> like current like the software I work on currently is at 12 minutes. It sucks.

>> This has been my experience with platform teams in the past, right? Where you have sort of an envelope of acceptable build times and you let it go up to breach and then you spend 2 3 weeks to bring it back down to the lower end of the low end stop. But because tokens are so cheap. Yeah. And we're so insanely parallel with the model, we can just constantly be gardening this thing to make sure that we maintain these invariants, which means there's way less dispersion in the code and the SDLC, which means we can kind of simplify in a way and rely on a lot more invariance as we write the software. >> You kind of mentioned in your article like humans became the bottleneck, right? You you kicked off as a team of like three people. You're putting out a million line of code like 1500 PRs basically what's the mindset there right so as much as code is disposable you're doing a lot of review a lot of the article talks about how you want to rephrase everything is prompting everything is what the agent can't see it's kind of garbage right you shouldn't have it in there so what's kind of like the high level of how you went about

building it and then how you address like okay humans are just kind of PR review like how is human in the loop for this you know >> we we've moved beyond even the the humans reviewing the code uh as well. Most of the human review is uh postmerge at this point. >> But merge merge >> that's not even review that's just like oh let's just make ourselves happy by using >> fundamentally the model is trivially paralyzable right as many GPUs and tokens as I am willing to spend I can have capacity to work on my hood base. >> The only fundamentally scarce thing is the synchronous human attention of my team. There's only so many hours in the day. We have to eat lunch. Uh I would like to sleep. Although it's quite difficult to, you know, stop poking the machine because it makes me want to feed it. Uh you kind of have to step back, right? Like you need to take a systems thinking mindset to things and constantly be asking where is the agent making mistakes? Where am I spending my time? How can I not spend that time

going forward? And then build confidence in the automation that I'm putting in place. So I have solved this part of the SDLC. And usually what that has looked like is like we started needing to pay very close attention to the code because the agent did not have the right building blocks to produce modular software that decomposed appropriately that was reliable and observable and actually acred a working front end in these things. Right? So in order to not spend all of our time sitting in front of a terminal at most doing one or two things at a time invested in giving the model that observability which is that uh that graph in the the post here. >> Yeah. >> Let's walk through this >> traces which which existed first. >> We started with just the app and the whole rest of it from vector through to all these login metrics APIs was I don't know half an afternoon of my time. We have intentionally chosen very high level fast developer tools. There's a

ton of great stuff out there now. Uh we use MI a bunch which makes it trivial to pull down all these go written Victoria stack binaries in our local development. Tiny little bit of Python glue to spin all these up and off you go. One neat thing here is we have tried to invert things as much as possible which is instead of setting up an environment to spawn the coding agent into instead we spawn the coding agent like that's the entry point just codecs and then we give codeex via skills and scripts the ability to boot this stack if it chooses to >> and then tell it how to set some end variables so the app in local dev points at this stack that it has chosen to spin up and this I think is like the fundamental difference between reasoning models and the four 1s and four O's of the past where these models could not think. So you kind of had to put them in boxes with a predefined set of state transitions whereas here we have the model the harness be the whole box and

give it a bunch of options for how to proceed with enough context for it to make intelligent choices. So sales >> feel like a lot of that is around scaffolding, right? Previous agents, you would define a scaffold. It would it would operate in that, you know, loop, try again. That's kind of pivoted off from when we've had reasoning models. They're seeming to perform better when you don't have a scaffold, right? You and you go into like niches here too, like your spec.md and like having a very short agent.mmd. >> Yes. >> Yeah. So you you even lay out what it is here, but >> I like the table of contents. Yeah. that like stuff like this, it really helps guide people because everyone's trying to do this. >> This structure also makes it super cheap to put new content into the repository to steer both the humans and and the agents. >> I mean, you you kind of reinvented skills, right? >> One big agent skills from first principles. >> Skills did not exist when we started doing this, right? Um you have a short one 100 line overall table of contents and then you have little skills, right? Core beliefs, MD, tech tracker. Yeah.

Yeah. Um yeah. So the skills over The techjet tracker and the quality score are pretty interesting because this is basically a tiny little scaffold like a markdown table which is a hook for codeex to review all the business logic that we have defined in the app assess how it matches all these documented guardrails and propose follow-up work for itself. So you know before beads and all these ticketing systems we were just tracking follow-up work as notes in a markdown file which you know we could spawn an agent on acron to kind of burn down. There's this really neat thing that like the models fundamentally crave text. So a lot of what we have done here is figure out ways to inject text into the system. Right? when we get a page because we're missing a timeout, for example, I can just add codecs in Slack on that page and say, I'm going to fix this by adding a timeout. Please update our reliability documentation to require that all network calls have timeouts. So, I have not only made a point in time fix, but also like durably encoded this

process knowledge around what good looks like. >> Yeah. >> And we give that to the root coding agent as it goes and does the thing. But you can also use that to distill tests out of or a code review agent which is pointed at the same things to narrow the acceptable universe of the code that's produced. I think one of the concerns I have with that kind of stuff is like you think you're making the right call by making it persisted for all time across everything. Yes. >> But then you didn't think about the exceptions that you need to make, right? And then you have to roll it back. >> Part of it is also >> sometimes it can follow instructions too well. >> It's somewhat a skill, right? So it determines when it uses the tools, right? Like it's not it's not like it'll run at every call. It'll determine when it wants to check quality score, right? >> Yeah. And we do kind of in the prompts we give these agents allow them to push back. Um when we first started adding code review agents to the PR, it would be codeci locally writes the change, pushes up a PR. On those PR synchronizations, a review agent fires, it posts a comment. We instruct Codex

that it has to at least acknowledge and respond to that feedback. And initially the codeex driving the code author was willing to be bullied by the PR reviewer which meant you could kind of end up in a situation where things were not converging. So we kind of had >> we kind of had to add more optionality to the prompts on both of these things right like the reviewer agents were instructed to bias toward merging the thing to not surface anything greater than a P2 in priority. We didn't really define P2 but we we gave it >> to define P2. We gave it a framework within which to uh score its output and then >> greater than P 0 is worse, right? Georgia P2 is P 0 is you will like nuke the code base if you merge this thing, right? >> Yeah. Yeah. >> But also on the on the code authoring agent side, we also gave it the flexibility to either defer or push back against review feedback, right? It happens all the time, right? like I happen to notice something and leave a code review which could blow up the scope by a factor of two, right? I

usually don't mean for that to be addressed exactly in the moment. It's more of an FYI, right? File it to the backlog, pick it up in the next fix it week sort of thing. And without the context that this is permissible, the coding agents are going to bias toward what they do, which is following instructions. >> Yeah, I do wanted to check in on a couple things, right? like uh all the the the coding review agent it can merge autonomously >> I think that's something that a lot of people aren't comfortable with right and you have a list here of how much agents do they do product code and test CI configuration release tooling internal dev tools documentation eva harness review comments scripts that manage the repository itself production dashboard definition files like everything yes and uh so they're just all turning at the at the same time is there like a cord that that any human on the team pulls to stop everything So because we are building a native application here, we're not doing continuous deployment, right? So there is still a human in the loop for cutting the release branch. I see >> we require a bless human approved smoke

test of the app before we promote it to distribution these sorts of things. >> So you're working on the app you're not building like infrastructure where where you have like nines of reliability that kind of stuff. >> That's correct. That's correct. Okay. And also like full recognition here that all of this activity took in a completely green field repository like there's should be no that this applies generally to like this is a production thing you're going to ship to customers of course. Yeah. You know so this is real >> and like one of the things there is you mentioned you started this as a repo from scratch. The onboarding first month or so was pretty it was like working backwards right and you had to work with the system and now you're at that point where you know you're very autonomous. I'm curious like okay so what how human in the loop is it right so like what are the bottlenecks that you wish you could still automate and part of that is also like where do you see the model trajectory improving and offloading more human in the loop right we just got 5.4 for um it's a really good >> fantastic model by the way. >> Yeah. Yeah. It's the first one that's merged uh top tier coding. So it's codeex level coding and reasoning. So

general reasoning both in one model, right? So >> and computer use >> computer use. Now with I can just have codeex write the blog post. Whereas for this one I had to balance between chat and >> oh I need to I might be out of a job. >> Oh my god. >> I know. You just gave me an idea for a completely AI newsletter that like 54 could do. >> Yeah, I get it. Now, >> this sort of thing is just one example of closing the loop, right? like the dashboard thing you mentioned. We have codec authoring the JSON for the Graphana dashboards and publishing them and also responding to the pages which means when it gets the page it knows exactly which dashboards are defined and what alerts what alert was triggered by which exact log in the codebase cuz all of this stuff is collated together. >> It has to own everything. >> Yes. Yes. And it means that if we have an outage that did not result in a page, it has the existing set of dashboards available to it. It has the existing set

of metrics and logs and can figure out where the gaps in the dashboard are or in the underlying metrics and fix them in one go. In the same way you would kind of have a full stack engineer be able to drive a feature from the back end all the way to the front end. So it seems like a lot of the work you guys had to do was you as a small team are fully working for a way that the model wants the software to be written right it's less human legible for better code legibility agent legibility how do you think that affects broader teams so one at open AI like do you leaison like this is how software should be written like I can imagine say you join a new team with this methodology this mindset uh there's ways that you know teams do code review teams write code like teams are structured And a lot of it is for human legibility. So like should we all swap? Like how does this play back one broader into OpenAI and then like broader into software engineering, right? Like is it like teams that pick this up will like you know it's pretty drastic, right? You have to make a pretty big switch. Should they just full send like the mindset is very much that I'm

removed from the process, right? I can't really have deep code level opinions about things. It's as if I'm group tech leading a 500 person organization like yeah like it's not appropriate for me to be in the weeds on every PR. This is why that postmerge code review thing is like a good analog here right like I have some representative sample of the code as it is written and I have to use that to infer what the teams are struggling with where they could use help where they're already moving quickly and I can pivot my focus elsewhere. >> Yeah. So I don't really have too many opinions around the code as it is written. I do however have like a commandbased class which is like used to have repeatable chunks of business logic that comes with tracing and metrics and observability for free right and the thing to focus on is not how that business logic is structured but that it uses this primitive because I know that's going to give leverage by default. >> Yeah. >> Yeah. back to that sort of systems thinking >> and you have part of that in your blog post enforcing architecture and ta taste

how you set boundaries for what's used uh there's also a section on like redefining engineering and stuff but yeah it's just it's interesting to hear you know >> and you know as the models have gotten better they have gotten better at proposing these abstractions to unblock themselves which again lets me move higher and higher up the stack to look deeper into the future on what ultimately block the team from shipping >> yeah you mentioned And uh so you this is primarily a it's like a 1 million line of code codebase electron app uh but it manages its own services as well. So it's like a back end for front end type thing. >> We do have like a a backend in there but that's hosted in the cloud. But this sort of structure is actually within the separate main and renderer processes with within the electron. >> That's just how electron works. >> Yeah. Yeah. So like like have also treated like MVC style decomposition with the same same level of rigor which has been very fun. >> Uh I have a fun pun this is like a tangent but you know MVC is model view controller and any sort of full stack web dev knows that but my AI native

version of this is model view claw the claw the the harness. >> That's right. That's right. That's right. I do think that there is an interesting space to explore here with codecs the harness as part of building AI products right there's a ton of momentum around getting the models to be good at coding we've seen big leaps in like the task complexity with each incremental model release where if you can figure out how to collapse a product that you're trying to build a user journey that you're trying to solve into code it's pretty natural to use the codeex harness to solve that problem for you. It's done all the wiring and lets you just communicate in prompts to let the model cook. >> It's been very fun. And it's also like a very engineering legible way of increasing. Right. Yeah. >> Just give you just give the model scripts, the same scripts you would already build for yourself. >> Yeah. >> Um >> Yeah. So for listeners, this is Ryan saying that software engineering or coding agents will eat knowledge work

like the non-coding parts that you would normally think, oh, you have to build a separate agent for it. No, you start with coding agent and go up from there, which openclaw has, right? It's pie under the hood. >> Yes. >> Basically define your task in code. Everything is a coding agent. >> By the way, since I brought it up, it's probably the only place you bring it up. Is any open claw usage from you? Any >> No, no, not for me. I don't have any spare Mac minis rattling around my house. >> You can afford it. Um, no, I just I'm kind of curious if it's like changed anything in OpenAI yet, but it's probably early days. And then the, you know, the other thing I want to pull on here is like you mentioned ticketing systems and you mentioned PRs and I'm wondering if both those things have to go away or be reinvented for this kind of coding, right? So the git itself and is like very hostile to multi- aents. >> Yeah, we make we make very heavy use of work trees, >> right? But like even then like I just did a dropped a podcast yesterday with cursor saying then they said they're getting rid of work trees because like it still has too many merge conflicts. It's still too unintuitive. But go

ahead. The models are really great at resolving merge conflicts. Yeah. And to get to a state where I'm not synchronously in the loop in my terminal, I almost don't care that there are merge >> disposable, right? We invoke a dollar land skill and that coaches codecs to push the PR, wait for human and agent reviewers, wait for CI to be green, fix the flakes if there are any merge upstream if the PR comes into conflict, wait for everything to pass, put it in the merge queue, deal with flakes until it's in main. And this is kind of what it means to delegate fully, right? like this is this is in a you know very large model probably a significant tax on humans to get PRs merged but the agent is more than capable of doing this and I really don't have to think about it other than keep my laptop open. >> Yeah. I used to be much more of a control freak but now I'm like yeah actually you could do a better job this me. >> Yeah. >> With the right context.

>> Yes. >> Anything else in harness engine in general? Just this piece. I just wanted to make sure we >> I think one thing that I maybe didn't make super clear in the article that I I kind of heard on Twitter as an interest to them. What's the chatter and then what's your response? >> Ultimately, all the things that we have encoded in docs and tests and review agents and all these things are ways to put all the non-functional requirements of building high-scale highquality reliable software into a space that prompt injects the agent. We either write it down as docs, we add lints where the error messages tell how to do the right thing. So the whole meta of the thing is to basically tease out of the heads of all the engineers on my team what they think good looks like, what they would do by default or what they would coach a new hire on the team to do to get things to merge. And that's why we pay attention to all the mistakes mistakes that the agent makes, right? This is code being

written that is misaligned with some as yet not written down non-functional requirement. >> Sorry. What did the online people misunderstand or >> No, what somebody just literally said that. I was like, "Oh, yeah. Okay. This this is this is the thing. This is what I was doing agree with." Yeah. I see. I see. I see. I see. >> I see. I see. Interesting. One other neat thing which I did totally did not expect is folks were just taking the link to the article and giving it to like pi or codeex and and say make my repo this >> you achieve a full recursion >> and it was wildly effective really it was wildly effective like this actually is something I tried with 54 yesterday I I didn't have that much time I was like out speaking at something and this is one of my things I was like okay I have this article can we can we just like scaffold out what it would be like to run this and I I did it first as that and then I was like okay let me take another little side repo and say like okay if I was to fully automate this like this cuz I haven't written a line of code it's like a full set >> it's a side thing I'm doing with like voice TTS I'm just like slobbing out

whatever it's not production I'm like how would I make this like this and it's it's actually like a really good way it's like a good way to learn what could be changed what could be like it's just a good analyzing right you give it all the code you give it all the context you give it the article and it it walks you through it very well >> that's right that's right I guess one more thing before we go to Symphony is I wanted to cover Brett Taylor's response. We had him on the on the show. He is your chairman which is wild. >> Yeah. >> Uh that he's reading your articles as well and like getting engaged in it. He says software dependencies are going away basically. They can just be like vendored. >> Yes. >> Uh response >> 100%. You still prom you still pay data dog. You still pay temporal. Thank you. >> Yep. The level of complexity of the dependencies that we can internalize is I would say low medium right now. Right. just based on model capability. >> What is what is medium? >> I I I would say like a a couple thousand line dependency is a thing that we could inhouse no problem in an afternoon of time. One neat thing about it is like probably most of that code you don't even need, right? Like by in-housing an

abstraction, you can kind of strip away all the generic parts of it and only focus on what you need to enable the specific things you're building. >> I've been calling this the end of plugins. >> Yeah. because there's so much like you know when I publish an open source thing I want to accept everything and be liberal I want to accept right this is postal's law but that means there's so much bl so much overhead >> one other neat thing about this too is when we deploy codeex security on the repo it is able to deeply review and change the internalized dependencies >> in a much lower friction way than it would be to like push patches upstream wait for them to be released pull them down make sure that's compatible with all the transitives I have in my repo and things like that. So, it's also much lower friction uh to kind of internalize some of these things if code is free because the tokens are cheap sort of thing. >> Yeah. Yeah. I I think like the the only argument I have against this is basically scale testing which obviously the larger pieces of software like Linux my SQL he calls up even the data and temporals and then maybe security

testing where uh classically I think is it Linus Tovals who said like security open source is the best disinfectant >> right many eyes >> many eyes and uh if you you know inline your dependencies and and code them up you're going to have to relearn mistakes from other people that you know >> Yep. Yep. And you know to internalize that dependency you're back to zero and you have to kind of start reassembling all those bits and pieces to have high confidence in the code as it is written right. >> Yeah. Um >> even part of like the first intro of this you basically mentioned like everything was written by uh codeex including internal tooling right so internal tooling like when you're visualizing what's going on it's it's writing it forward to Yeah, I built internal tools for AI now and like I just showed them off and they're like how long did you spend and I I they I didn't spend any time I just prompted it, you know, >> very funny story here. >> Yeah, go ahead. >> We had deployed our app to the first dozen users internally uh had some performance issues. So we asked them to

export a trace for us. Uh get a tarball, gave it to our on call engineer and he did a fantastic job of working with codeex to build this beautiful local dev tool nex.js app that you drag and drop the parall in and it visualizes the entire trace. Uh it's fantastic. Took an afternoon, but none of this was necessary because you could just spin up codecs and give it the tarball and ask the same thing and get the response immediately. So in a way optimizing for human legibility of that debugging process was wrong. It kept him in the loop unnecessarily when instead he could have just like codex cooked for 5 minutes and gotten the same. >> Yeah. You have to fight your instincts here of like this is how we used to do it or this is how I I would have used to solve it. >> Yeah. in this in this local uh observability stack like sure you can def deploy Jerger to visualize the traces but I wouldn't expect to be looking at the traces in the first place because I'm not going to write the code to fix them. >> Yeah. I mean so basically there needs to be like this kind of house stack and

owning the whole loop. I think that that is very well established and uh it sounds like you might be like sharing more about that in the future, right? >> Yeah. Uh I think we're excited to do so. We're gonna talk about Symphony in a little bit, but like the way we distribute it it as a spec, which I think folks are calling ghost libraries on Twitter. Like this is like a such a cool name. Um it does mean it becomes much cheaper to share software with the world, right? You define a spec how you could build your own specifying as much as is required for a coding agent to reassemble it locally. The flow here is very very cool. Like we have taken all the scaffolding that has existed in our proprietary repo, spun up a new one, ask codeex with our repo as a reference, write the spec. We tell it, spin up a T-mox, spawn a disconnected codeex to implement the spec, wait for it to be done, spawn another codeex and another T-Mox to review the spec or review the implementation compared to upstream and update the spec so it diverges less. And

then you just loop over and over and over. Ralph style until you get a spec that is with high fidelity able to reproduce the system as it is. It's fantastic and >> and you're basically you're not really adding any of your human bias in there, right? Like a lot of times people will write a spec and be like okay I think it should be done this way and you'll you'll riff on something and it's like no that agent could have just handled it. Like you're still scaffolding in a sense, right? I want it done this way. It can determine that spec better better. >> That's right. That's right. Part of me uh you know I've been working a lot on eval recently and part of me is wondering if an agent can produce a spec that it cannot solve like is it always capable of things that it can imagine or can you imagine things that it is impossible to do. I think with symphony we there's like this uh there's this axis right where you have things that are easy or hard or established or new right and I think things that are hard and new is still something that uh the models need humans yeah drive but I think those other quadrants are largely

solved given the right scaffold and the right thing that's going to drive the agent to completion >> it's crazy that it's solved >> but it it means that the humans the ones with limited time and attention get to work on the hardest stuff, right? Like the problems where it's pure white space out in front or like the deepest refactorings where you don't know what the proper shape of the interfaces are. And this is where I want to spend my time because it lets me set up for the next level of scale. >> Yeah. Yeah. Amazing. Uh let's let's introduce Symphfony. I think we've been mentioning it uh every now and then. Uh Elixir, interesting option. >> Yeah. Yeah. And again like the the the the elixir manifestation here is is just a derivative. >> Is it a model chosen? >> Uh yeah. Yeah. And it chose that because >> the process supervision and the gen servers are super amendable to the type of process orchestration that we're doing here. Right. You are essentially spinning up little dammons for every task that is in execution and driving it to completion. Which means the model gets a ton of stuff for free by using

elixir and the beam. I mean I I had to go do a crash course in Beam and Elixir and I think most people are not operating at that scale of concurrency where you need that but it is a good mental model of resumability and all those things and these are things I care about. Uh but tell me the story the origin story of Symphony uh what do you use it for? Is this how did it form and maybe any abandoned paths that you didn't take? >> At the end of December uh we were at about three and a half PRs per engineer per day. This was before 52 came out in the beginning of January. Everyone gets back from holiday with 52 and no other work on the repository. We were up in the five to 10 PRs per day per engineer. And like I don't know about y'all, but like it's very taxing to constantly be switching like that. Like I was pretty tapped out at the end of the day. So again, where are the humans spending their time? They're spending their time >> context switching between all these active T-Mo panes to drive the agent

forward. So let's again build something to remove ourselves from the loop. And uh this is what uh frantic uh sprint adapter here to find a way to remove the need for the human to sit in front of their terminal. So lot of experimentation with dev boxes and you know automatically spinning up agents like it seems like a fantastic end state here where my life is beach. I open l twice a day and uh you know say yes no to these things and >> this is again a super super interesting framing for how the work is done because I become more latency insensitive. I have way less attachment to the code as this is written. Like I've had close to zero investment in the actual authorship experience. So if it's garbage, I can just throw it away and not care too much about it. In Symphony, there's this like rework state where once the PR is proposed and it's escalated to the human

for review, it should be a cheap review, right? It is either mergeable or it is not. And if it's not, you move it to rework. the elixir service will completely trash the entire work tree and PR and start it again from scratch. >> And this is that opportunity again to say why was it trash, right? What did the agent do that was >> fix that before moving the ticket to progress again? >> Yeah. >> Why is this not in Codex app? I guess it's you guys are you guys are ahead of Codex app, I guess. >> Yeah. So the way the team has been working is basically to be as AI pill as possible and spread ahead and a lot of the things we have worked on have fallen out into a lot of the products that we have like we were in deep consultation with the Codex team to have the Codex app be a thing that exists right to have skills be a thing that Codex is able to use so we didn't have to roll our own to put automations into the product so all of or automatic refactoring agents didn't have to be these handrolled

control loops. It has been really fantastic to be in a way unanchored to the product development of Frontier and Codeex and just very quickly try to figure out what works and then later find the scalable thing that can be deployed widely. It's been a very fun way to operate. It's certainly chaotic. I have lost track very often of what the actual state of the code looks like because I'm not in the loop, right? Uh there was one point where we had wired playright directly up to the Electron app uh with MCP. MCPs I'm pretty bearish on because the harness forcibly injects all those tokens in the context and I don't really get a say over it. Uh they mess with autocompaction. Uh the agent can forget how to use the tool. There's probably only like what three calls in Playright that I actually ever want to use. So I pay the cost for a ton of things. Somebody vibed a local Damon that boots Playright and exposes a tiny

little shim CLI to drive it. And I had zero idea that this had occurred because to me I run codecs and it's able to you know get better. >> Yeah. Like uh like no knowledge of this at all. So we have had like in human space uh to spend a lot of time doing synchronous knowledge sharing. We have a daily standup that's 45 minutes long because we almost have to fan out the understanding of the current state. >> Yeah, I was going to say like this is good for a single human multi- aent but multihuman multi- aent is a whole like pol like explosion of stuff. >> Yeah. And that this is fundamentally why we have such a rigid like 10,000 engineer level architecture in the app because we have to find ways to carve up the space so people are not trampling on each other. >> Sorry, I don't I don't get the 10,000 thing. Uh did I miss that? >> The structure of the repository is like 500 mpm packages. Uh it's like

architecture to the access for what you would consider I think normal for a seven person team. But if every person is actually like 10 to 50 then the like numbers on like being super super deep into decomposition and sharding and like proper interface boundaries make a lot more sense >> right to me that's why I talked about micro front ends and you know NX is from that world but cool just coming back to to this like I don't know if you have other you know thoughts on orchestrating so much work going through this is this enough is this like any aha moments >> it'll be interesting to see like where Okay, so right now you pick linear as your issue tracker, right? Like >> or it's like a is it is it actually linear? >> This is actually linear. >> Oh, that's linear. >> It's linear. >> Oh, I I never look at the video. The demo video I had to download to run, but >> yeah. So I I cuz I'm a Slack maxi, but like Yeah, linear is also really good. Yes, >> we do make a good use of Slack. We um we fire off uh codecs to do all these >> lowlexity fixups, the things that like

sync that knowledge into the repository. It's super cheap. >> Yeah. do it in codeex. >> My biggest plug is openi needs to build slack, right? You need to own slack builds to turn this into >> I I did I did read it. Yeah. Um >> I would say that if we think that we want these agents to do economically valuable work, which is like this is the mission, right? We want AI to be deployed widely to do economically valuable work. Then we need to find ways for them to naturally collaborate with humans which means collaboration tooling I think is an interesting space to explore. >> Yeah totally. Yeah. GitHub Slack linear. Yeah, that was kind of my thing like okay where do we see right now Codex has started Codex model then CLI now there's an app app can let me shoot off multiple CEXes in parallel but there's no great team collaboration for Codex right and it seems like your team had some say into what comes out right so like you talked to them Codex kind of was a thing from there if you guys are on the bound stuff that like you know you might not focus on but like what do you expect other people to be building right so

people that are like 5x 50xing should you build stuff that's like very niche for your workflow, for your team. Should it be more general so other people can adopt it? Is there a niche there? Like because because part of it is just like, okay, is everything just internal tooling? Do we have everything our own way? Like the way our team operates has our own ways that we like to communicate or you know, is there a broader way to do it? Is it is it something like a issue tracker? Just thoughts if you want to riff on that. >> I think TBD like we have not figured this out in a general way. I do think that there is leverage to be had in making the code and the processes as much the same as possible. If you think that code is context, code is prompts, it's better from the agent behavior perspective to be able to look in a package in directory XYZ and it not to have to page so deeply into directory ABC because they have the same structure, use the same language, they have the same patterns internally. And that same like leverage comes from aligning on a single set of skills that

you're pouring every engineer's taste into to make sure that the agent is effective. So like in our codebase, we have I think six skills. That's it. And if some part of the software development loop is not being covered, our first attempt is to encode it in one of the existing setup skills. Which means that we can change the agent behavior more cheaply than changing the human driver behavior. >> Yeah. >> Have you ever you experimented with agents changing their own behavior? >> We do. Uh yes. Or parent agent changing a sub agent's you know behavior or something like that. We have some bits for skill distillation. Um, so for example, there's one neat thing you can do with codeex which is just point it at its own session logs to ask it to >> tell you how you can use the tool better. It's like >> introspection ask it to do things. >> How can I use this session better? What skills should I have? Yeah, I like the modification of you can do just do things to like you can just ask agent to do things.

>> Yeah, you can just codeex things. This is this is like a this is like a silly emoji that we have. You can just codeex things. You can just prompt things. Uh it's really glorious future we live in. But like okay, you can do that oneonone, but like we're actually slurping these up for the entire team into blob storage and running agent loops over them every day to figure out where as a team can we do better and how do we reflect that back into the repository. Yeah. Though everybody benefits from everybody else's behavior for free. Same for like PR comments, right? These are all feedback that means the code as written deviated from what was good. A PR comment, a failed build, these are all signals that mean at some point the agent was missing context. We got to figure out how to slurp it up and put it back in the repo. >> By the way, I do this exactly right. I used when I use uh cloud code for knowledge work. >> Cloud code work is like a nice product, right? I think you would agree. I always have it tell me what do I do better next time, >> right? And that's the meta programming

reflection thing. So almost think like you have six reflection extraction levels in Symphony. Almost like the the zero layer. So the six levels are policy, configuration, coordination, execution, integration, observability. We've talked about a couple of these, but the zero layer is like the okay well are we working well? Can we can we improve how we work? Like can I modify my own workflow MD or something? I don't know. >> Yeah, of course. Yeah, of course you can. Um, like this thing is also able to cut its own tickets because we give it full access. >> Yeah. Make it a ticket to have it cut tickets. You can put in the ticket that you expected to file it on followup work. >> Self modifying. Yeah. >> Yeah. Put don't put the agent in a box. Give give the agent full accessibility over his domain. >> I had a mental reaction when you said don't put the agent in a box. So I think you should put it in a box. Like it's just that you're giving the box everything it needs. >> Yeah. Context and tools. Right. But we're like as developers we're used to calling out to different systems. But here you use the open source things like the Prometheus whatever and you run it

locally so that you can have the full loop. Right. I I assume. Yep. >> Right. Um >> I think I think like >> you want to minimize cloud cloud dependencies. >> You also want to make sure that you think about what the agent has access to, right? Like what does it see? Does it go back into the loop like from the most basic sense of uh you let it see its own like calls traces. Uh it can determine where it went wrong, right? But are you feeding that back in? So, you know, just the most basic level of like you want to see exactly what's input output. Like, does the agent have access to what is being outputed, right? It can self-improve a lot of these things. >> It's all text, right? My job is to figure out ways to funnel text from one agent to the other. Um, >> it's so strange. Like, you know, like way back at the start of this whole AI wave, like uh Andre was like, you know, English is the hottest new programming language is it's here. It's here. Yeah. The features. Yeah, a lot of okay like a lot of software a lot of stuff there's a guey it's made for the human uh you know we're seeing the the evolution of CLI for everything right all tools have CLI your can use them well but you know do

we get good vision do we get good little sandboxes like right now it's a really effective way right models love to use tools they love to bass they love to read through text so slap a CLI let it let it go loose that works for everything >> that does yeah yeah yeah we've also been adapting non textual things to that shape in order to uh improve uh model behavior in some ways, right? Like we want the agent to be able to see the UI. Agents do not perceive visually in the same way that we do, right? Like they don't see a red box, they see red box button, right? They see these things in latent space. Uh so if we want >> Yeah. Yeah. We have a thing that goes off every time he goes to space. >> Ding. Anyway, um if we want to actually like make it see the layout, it's almost easier to rasterize that image to ask and feed it in to the agent. Uh and there's no reason you can't do both, right? To like further refine how the

model perceives the object it's manipulating. >> Cool. Uh could we you want to talk about a couple more of these layers that might bear more introspection or that you have personal passion for? I will say that the coordination layer here was a really tricky piece to get right. >> Let's do it. Yeah, I'm all about that. And this is Temporal's uh core core thing. >> This is where when we turn the spec into elixir where like the model takes a shortcut, right? Like it's like, oh, I have all these primitives that I can make use of in this lovely runtime that has native process supervision. uh which is I think kind of a neat way to have taken the spec and like made it more achievable by making choices that naturally map the domain, right? In the same way that like you would >> prefer to have a TypeScript model repo if you are doing full stack web development, right? Because >> the ability to share types across the front end and back end reduces a lot of complexity. Uh and because

>> that's what GraphQL used to be. >> That's right. and and >> I don't know if it's still alive, but >> no humans in the loop here. So like my own personal ability to write or not write Elixir doesn't really have to bias us away from using the right tool for the job, which is just wild. >> Love it. I love it. Yeah. I wonder if any languages struggle more than others because of this. I feel like everyone has their own abstractions that would make sense, but maybe it might be slower. It might be more faulty where like you would have to just kick the server every now and then. Um I I don't know. I think observability layer is really well understood. Integration layer MCP is dead. I think all these like just like a really interesting hierarchy to travel up and down. It's common language for people working on the system to understand. >> The the policy stuff is really cool, right? Like yeah, you don't really have to build a bunch of code to make sure the system wait for CI to pass. It's your institutional knowledge. >> Yeah, you just give it the GH CLI with some text to say CI has to pass.

>> It makes the maintenance of these systems a lot easier. >> Do you think that like CLI maintainers need to be do anything special for agents or just as is? It's good cuz like I don't think when people made the GitHub CLI they anticipated this happening. >> That's correct. The GH CLI is fantastic. It's great. Super industry. If you want to go try ghre repo create like gh pull and then pull request number right gh like 153 whatever right and then it it like pulls >> basically my only interaction with the github web UI at this point is ghpr view-web glance at the diff and be like sure thing send it. Yeah. Yeah. Yeah. But um the CLI are nice cuz they're super token efficient and they can be made more token efficient really easily, right? Like I'm sure you all have seen like I go to build kite or Jenkins and I just get this massive wall of build output and in order to unblock the humans your developer productivity team is almost certainly going to write some code that parses the actual exception out of the

build logs and sticks it in a sticky note at the top of the page. And you basically want CLIs to be structured in a similar way, right? you're going to want to patch d- silent to prettier because the agent doesn't care that every file was already formatted. It just wants to know it's either formatted or not, right? So they can then go run the write command. Similarly like in our PNPM sort of distributed script runner when we had one when you do d-recursive like it produces a absolute mountain of text but all of that is for passing test suites. So we ended up wrapping all of this in another script >> to suppress the >> which you can vibe to generally output the failing parts of the test. Yeah, you could pipe uh errors versus the standard standard out. I don't know. Okay, whatever. Too much too much thinking to have to do the CL. I used to maintain a CLI for my company and like Yeah, this is this is like core very core to my

heart, but you're vibing my job. >> That's right. >> Cool. Any other things? I mean, this is a long spec. I I I appreciate that. Like, it's it's like got a lot of strong opinions in here. Any other things that we should highlight? You know, I think obviously you can spend the whole day going through some of these, but like you know, I I do think that some of these have a lot of care or some of this you might you might want to tell people, hey, take this, but you know, make it your own. >> Fundamentally, software is made more flexible when it's able to adapt to the environment in which it is deployed, right? Which means that things like linear or GitHub even are specified within the spec, but not required pieces of it, right? there's like a more platonic ideal of the thing uh that you could swap in like Jira or Bitbucket for example, right? But being able to tightly specify things like the ID formats or how the Ralph loop works for the individual agents basically means you can get up

and running with a fully specified system quickly that you then evolve later on. I think we never intended for this to be a static spec that you can never change, right? It's more like a blueprint to get something working up and running >> for you then to vibe later till your heart's content. >> You have like code and scripts in here where it's like, oh, I mean I I think this is a really good prompt. It's just a very very long prompt. >> Fundamentally, the agents are good at following instructions. So, give them instructions, right? And it will, you know, improve the reliability of the result, right? Like we much like the way we use Symphony, we don't want folks to have to monitor the agent as it is vibing the system into existence. So being very opinionated, very strict around what these success criteria are means that like >> our deployment success rate goes up. >> Yeah. Means we don't have to get tickets on this thing. >> I think it all goes back to that like go to disposable, right? Like early on when

you had CLI or you'd kick off a codeex run, it would take two hours. you would kind of want to monitor like, okay, I'm in the workflow of just using one. I don't want it to go down the wrong path. I'll cut it off and but you know, just shoot all four. Like that was my favorite thing of the codeex app, right? Just 4x it. Like it's okay. One of them will probably be right, one of them might be better. Stop stop overthinking it. Like my my first example was probably like deep research. when you put out deep research and I'd ask it something like I asked it something about LLM it thought it was legal something and spent an hour came back with a report completely off the rails and I was like okay I got to monitor this thing a bit no don't don't monitor it just you you want to build it so that it goes the right way and you don't want to you don't want to sit there and babysit right you don't want to babysit your agents >> with that deep research query that you made looking at the bad result you probably figured out you needed to tweak your prompt a bit right like that's that guardrail that you fed back into the code base for the ask your prompt to further align the agent's execution. Same sort of concepts apply there too >> when you talk I mean how are the customers feeling

>> for symphony uh I I think we have none right this is a thing we have put out into the world >> I mean symphony is internal right as long as you're happy you're the customer >> that's right >> uh just you know what's what's the external view >> I say folks are very excited about this way of distributing software and ideas in cheap ways for us as users it has again pushed the productivity 5x Which means I think there's something here that's like a durable pattern around removing the human from the loop and figuring out ways to like trust the output. Right? The video that is shared here >> is the same sort of video we would expect the coding agent to attach to the PR >> that is created. You know that's part of building trust in this system. And that's to me like fundamentally what has been cool about building this is like it more closely pushes that persona of the agent working with you to be like a

teammate, right? I I don't shoulder surf you like for the tickets that you work on during the week. I would never think that I would want to do that. I wouldn't want a screen recording of your entire session in cursor or claude code. I would expect you to do what you think you need to do to convince me that the code is good and mergeable >> and compress that full trajectory in a way that is legible to me the reviewer. >> Y >> it's just uh and and you can just do that because >> CEX will absolutely sling some F around. It's great. >> Oh, I mean EV F ev is the OG like god CLI. >> Yeah. Swiss army chainsaw. I used to say uh there's a SAS micro SAS let's call it in every flag in FFmpeg. >> Oh, for sure. >> You know what I mean? For sure. >> Like just host it as a service, put a UI on it. People who don't know FFmpeg will pay for it. >> When we were first experimenting with this, it was a wild feeling to be at the computer with just like Windows just

popping up all over the place and getting captured and files appearing on my desktop. like very much felt like the future to have a a a thing controlling my computer for like actual productive use, right? Like I'm just there keeping it like awake jiggling the mouse every once in a while. >> That's what some office workers do. They buy a mouse jiggler. >> That's right. That's right. >> One thing I wanted to ask so like okay as stuff is so good is disposable async shoot off a bunch of agents. One question is like okay are you always like a extra high thinking guy and where do you see spark so 5.3 spark like there's a lot of me wanting to make quick changes I'm not going to open up ID I'm not going to do anything but I will say okay fix this little thing change a line change a color spark is great for that but like am I still the bottleneck you know like why don't I just let that go back in like just riff on that you know is there >> spark is such a different model compared to the the extra high level reasoning that you get in these you know

>> to be fair for people it is a different model different architecture different like it doesn't support it just >> it's incredibly fast >> I have not quite figured out how to use it yet uh to be honest I faster I was I was adapting it to the same sorts of tasks I would use x high reasoning for and it would blow through three compactions before writing a line of code >> and I mean that's another big thing with uh 5.4 for right million coken content which is huge in aentic right like you can just run for longer before you have to compact the more tokens you can spend on a task before compacting like the better you'll do >> that's right that's right I'm not sure uh how to deploy spark I think your intuition is right that like it's very great for spiking out prototypes exploring ideas quickly doing those documentation updates it is fantastic for us in taking that feedback and transforming it into a lint where we already have good infrastructure for eslints in the codebase. Uh these sorts of things it's great at and it allows us to unblock quickly doing those like

antifragile healing tasks in the codebase. >> Yeah, that makes sense. So you're push you guys are pushing models to the freaking limit. What can card models not do well yet? >> They're definitely not there on being able to go from new product idea to prototype >> single one shot. This is where I find I spend a lot of time steering is translating end state of a mock for a net new thing, right? Think no existing screens into product that is playable with. Similarly, while this has gotten better with each model release, like the gnarliest refactorings are the ones that I spend my most time with, right? The ones where I am interrupting the most, the ones where I am now double clicking to build tooling to help decompose monoliths and things like that. This is a thing I only expect to get better, right? Over the course of a month, we went from the low complexity tasks to like low complexity and big tasks in both these directions. So, this is what it means to not bet against the model,

right? You should you should expect that it is going to push itself out into these higher and higher complexity spaces. Yeah. So, the things we do are robust to that. It just basically means I'll be able to spend my time elsewhere and figure out what the next bottleneck is. I >> I do think it's also a bit of a different type of task, right? Like Codex is really good at codebase understanding working with code bases but companies like lovable uh bolt replet they solve a very different problem scaffold of zero to one right idea at a product and it's like there are people working on that and models models are also pushing like step function changes there it's just kind of different than the software engineering agents you see today right >> like I said the model is isomeorphic to myself uh the only thing that's different is figuring out how to get what's in here into context for the model. And for these whites space sort of projects, I myself I'm just not good at it. uh which means that often over the agent trajectory I realize the bits

that were missing which is why I find I need to have the synchronous interaction and I expect with the right harness with the right scaffold that's able to tease that out of me or refine the possible space right to be super opinionated around the frameworks that are deployed or to put a template in place right these are ways to give the model all those non-functional requirements that extra context to anchor on and avoid that wide dispersion of possible outcomes. >> Thank you for that. Uh I wanted to talk a little bit about Frontier. >> Yeah, sure. Uh overall, uh you guys announced it maybe like a month ago. Um and there's there's a few charts in here and there if it's kind of like your enterprise offering is kind of what I view it. Is there one product or is there many? I can't speak to the full product roadmap here but what I can say is that frontier is the platform by which we want to do AI transformation of every enterprise and from big to small and the way we want to do that is by making it easy to deploy highly observable safe control

identifiable agents into the workplace right we want it to work with your company native IM stack we want it to plug into the SK uh security tooling that you have. We want it to be able to plug into the workspace tools that you used. >> So, you're just going to be stripping specs, >> right? >> We expect that there will be some harness things there. Agents SDK is a core part of this to enable both startup builders as well as enterprise builders to have a works by default harness that is able to use all the best features of our models from the shell tool down to the codeex harness with file attachments and containers and all these other things that we know go into building highly reliable complex agents. We want to make that great and we want to make it easy to compose these things together in ways that are safe. For example, right like the GPT OSS safeguard model

for example, one thing that's really cool about it is it ships the ability to interface with a safety spec. Safety specs are things that are bespoke to enterprises. We owe it to these folks to figure out ways for them to instrument the agents in their enterprise to avoid excfiltration in the ways they specifically care about to know about their internal company code names these sorts of things. So providing the right hooks to make the platform customizable but also you know mostly working by default for folks is kind of the the space we are trying to explore here. >> Yeah. And this is like you know the snowflakes of the world just need this right. Yeah. Brexites of the world stripes. Yeah, makes sense. I was going to go back to your, you know, I I I think the demo videos that you guys had was was pretty illustrative. It's kind of like also to me um an example of very large scale agent management. >> Yes. Like you give people a control dashboard that if you play if you like play any one of these like multiple agent things. You can dig down to the individual instance and see what's going on. >> Yes, of course.

>> But who's the user? Is it is it like the CEO, the CTO, CIO, something like that? So, you know, at least my personal opinion here, the buyer that we're trying to build product for here is one and employees who are making productive use of these agents, right? That's going to be whatever surfaces they appear in, the connectors they have access to, things like that. Something like this dashboard is for IT, your GRC and government's folks, your AI innovation office, your security team, right? the stakeholders in your company that are responsible for successfully deploying into the spaces where your employees work as well as doing so in a safe way that is consistent with all the regulatory requirements that you have and customer attestations and things like that. So it is kind of a iceberg beneath the actual end. Yeah, you you jump like every I guess layer in the UI is like going down the layer of extraction in terms of the agent, right?

>> Yep. >> Yeah. Yeah. I think it's good. >> Yeah. The the ability to dive deep into the individual agent trajectory level is going to be super powerful >> not only for like from like a security perspective but also from like someone who is accountable for developing skills. One thing that was interesting that we also blogged about shipping was uh an internal data agent which uses a lot of the frontier technology in order to make our data ontology accessible to the agent and things like that to understand what's actually in the data warehouse. >> Yeah. Semantic layer type things. Uh I was briefly part of that that world. Uh is it solved? I don't know. It's actually really hard for humans to agree on what revenue is. >> Yes. >> You know. >> Yes. What is what is what is an active user? >> There's like what five data scientists in the company that have defined this golden >> they all different yeah and like no and there's also internal politics as to attribution of like I I'm marketing I'm responsible for this much and sales is responsible for this much and they all add up to more than 100 and I'm like

well you guys have different definitions. >> Yeah. And if you're a startup everything is a r you know. >> So so I think that's that's cool. Oh you guys blogged about this. Okay. I didn't I didn't see this. Uh yeah. Is this the same thing? >> I don't Uh, is this what you're referring to? >> Uh, yes. >> Okay. Well, we'll send people to read this as our data agency. >> This one. >> Uh, yeah. I don't know if you you have any highlights. I >> No, no, no. I mean, in general from the point, a lot of good things to read. >> Yeah. Yeah. Lot lots of homework for people. Uh, no, but like data as the feedback layer. You need to solve this first in order to have the products feedback loop closed. That's right. Like so for the agents to to understand and like this is not something that humans have not know of this like in >> this is how this is how you build artists that do more than coding right >> to actually understand how you operate the business you have to understand what revenue is what your customer segments are right >> what your product lines are right like one thing that's in like looping back to the codebase that we described here for harnessing one thing that's in core beliefs MD is like who's on the team,

what product we're building, who our end customers are, who our pilot customers are, what the full vision of what we want to achieve over the next 12 months is like these are all bits of context that inform how we would go about building the software. Oh my god. So, we have to give it to the agent, too. >> I'm guessing that stuff is like pretty dynamic and it changes over time, too, right? Like part of it was it's not just a big spec. you you have it as one of the things and it will iterate. >> One one thing that I think is going to break your mind even more is we have skills for how to properly generate deep fried memes and have reacti culture in Slack because with the Slack chatgpt app that you're able to use and codeex like I can get the agent to post on my behalf. >> Just it's part of humor. Humor is part of AGI. Uh is it is it funny? It's pretty good. Yeah. >> Okay. Yeah, it's pretty good at making, you know, it's it's a lot of like I think humor is like a really hard intelligence test, right? Like it's like you have to get a lot of context into

like very few words. >> This is this is why this is why 54 is such a big uplift for our varieties. It's it's the memeing. Yeah, for sure. >> Yeah. Yeah, it's really cool. >> So 54 can chip us. That's the take away. >> Yeah. Maybe um maybe when y'all are uh done here today, ask Codeex to go over your coding agent sessions and to roast you. Um love it. >> I'll give it a shot. Give it a shot. Uh just coming back to the the the final point I wanted to make is yeah, I I think that there there are multiple other like you guys are working on this, but this is a pattern that every other company out there should adopt regardless of whether or not they work with you. To me this like I saw this I was like every company needs this. I mean >> this is multiple business what it takes to get people to Yes. Yeah. Actually realize the benefits and distribute layer. Um and it's it's it I think it sounds boring to people like oh you know it's for safeguards and and whatever but like um I think you to to handle agents

at scale like you're envisioning here. Um I don't know if it's like a real screenshot like a demo but like this is what you need. This is my original sort of view of what temporal was supposed to be like you you built this dashboard and you basically have every longunning process in the company and one dashboard and that's it. >> That's right. That's right. >> Yeah. I think it's pretty it's pretty like customized towards every enterprise, right? Like you care about different things. >> There's a lot of customization, right? But like I mean there'll be multiple unicorns just doing this as a service. Like I don't know. I'm like very very frontier pled if you can't tell. >> Amazing. But but like it only clicked cuz obviously this came out first, then harness and then Symphony and it only clicked for me that like this is actually kind of the thing you ship to do that. >> Yeah. Yeah. There's a set of building blocks here that we assembled into these agents and the building blocks themselves are part of the product, right? the ability to >> steer, revoke authorization if a model becomes misaligned. Like all of this is

accessible through Frontier >> and there's going to be a bunch of stakeholders in the company that have >> the things they need to see in the platform to get to Yes. >> So we'll build all those in the frontier so that we can actually do the widespread deployment. That's the fun part. >> Yeah. Yeah. I'm also calling back to like uh there's this like levels of AGI like I don't know if OpenAI is still talking about this but they used to talk about five levels of AGI and one of it was like oh it's like an intern and the coding software engineeration at some point it was AI organization and this is it right this is level four or five I can't remember which which level but it's somewhere along that path was this >> you know how I mentioned that my team is having fun sprinting ahead here right and we do this thing where we're collecting all the agent trajectories from codecs to slurp them up and distill them like this is what it means to build our team level knowledge base you know happen to reflect it back into the codebase but it doesn't have to be that way right you know and it doesn't have to be bound to just codeex right I want chatbt to also learn our meaning culture

and also the product we are building and how right so that when I go ask it it also has the full context of the way I do my work and I'm super excited for Frontier to enable this >> yeah amazing what are the the model people say when they see you do this like you have a lot of feedback obviously you have a lot of usage you have a lot of trajectories I don't I don't imagine a lot of it's useful to them but some of it is >> you have this too you deploy a billion tokens of intelligence a day and this was you know this was at the beginning of 206 you're yeah you know cooking >> yeah there's this fundamental tension which I think you have talked about between whether or not we invest deeper into the harness or we invest deeper into the training process to get the model to do more of this by default. Yeah. >> And I think success for the way we are operating here means the model gets better taste because we can point the way there and none of the things we have built actively degrade Asian performance

cuz really all they're doing is running tests and like running tests is a good part of what it means to write reliable software. If we were building an entire separate ROS scaffold around codecs to restrict its output, that I think would be like additional harness that would be prone to being scrapped. But yeah, if instead we can build all the guardrails in a way that's just native to the output that Codex is already producing, which is code, I think one, no friction with how the model continues to advance, but also like just good engineering. And that's that's the whole point. >> Yeah. So I've had similar discussions with research scientists where the RL equivalent on policy versus off policy. >> Yeah. >> And you're basically saying that you should build an on policy harness which is already like well within distribution and you modify it from there. But if you build off policy well it's not that useful. >> That's right. >> Super cool. Well any thoughts any things that we haven't covered that we should get get out there? >> Just uh I've been super excited to kind

of benefit from all the cooking that the codeex team has been doing. They absolutely ship relentlessly. This is one of our core engineering values. Ship relentlessly and they the team there embodies it to an extreme degree. Oh yeah to have 53 and then spark and 54 come out within like what feels like a month is just a phenomenally fast. >> This exactly a month ago it's 53 and yesterday was 54. Yeah. I mean is do we have every month now is 5'5 nice? Like >> uh you know I can't say that the poly markets would be very upset, right? Uh well I I think it's interesting that like it's also correlated with the growth you know they they announced that it's like 2 million uh users but like almost don't care about codeex anymore like this is it this is the game man like it's like coding cool soft like knowledge work >> that's right you know this is the thing to chase after and uh you know this is one of the things that my team is excited to support >> get the whole like self-hosted harness thing working which you have done and like the rest of us are trying to figure out how to catch up but like then do

things, you know, right with you. >> Do things. >> That's right. You can just do things. That's the line for the episode. >> That's it. Any other call to actions? You're you're based in Seattle. Your team, I'm guessing. >> New Belleview office. >> New Belleview office. We just had the grand opening yesterday as of the recording date. Uh which was fantastic. Beautiful building. Super excited to be part of the Belleview community building the future in Washington. And I would say that there is lots of work to be done in order to successfully serve enterprise customers here uh in Frontier. We are certainly hiring. And if you haven't tried the Codex app yet, please give it a download. We just passed 2 million weekly active users, growing at a phenomenally fast rate, 25% week over week. Please come join us. Uh yes and I think that's an interesting I don't know my my final observation um open is a very San Franciscocentric company like I I know people who have been who turned down the job or didn't get the job because they didn't want to move to SF and now they just don't have

a choice right you have to open the London you have to open the the Seattle and I wonder if that's going to be a shift in the the culture right obviously you can't say but >> I was uh one of the first engineering hires out of our Seattle office so See it was very natural. >> Success has been part of what I have been building toward and it is has grown quite well. Right. We have durable products and lines of business that are built out of there. Uh ton of 0ero to one work happening as well which is kind of the core essence of the way we do applied AI work at the company to sprint after it uh new to figure out where we can actually successfully deploy the model. So uh yes 100%. We also have a New York office too uh that has a ton of engineering presence. >> Yeah. Uh exa exactly that's these these are my road maps for AIE. >> Wherever people hire engineers I will go. That's right. >> It's a cool office too. New York is the old REI building I believe. The REI office. >> Yeah it's just No, you'll never be as big. Right. New York is like you can't get the size of office that they need.

The the New York Seattle has a very like office madmen sort of vibe. It's it's beautiful. Uh the the Belleview one is very green, gold fixtures, very Pacific Northwest is very cool place which a lot of people are like there for. People like New York, they want to be in New York, right? >> Yeah. Yeah. We have a fantastic workplace team that has been building out these offices. It really is a privilege to work here. >> Yeah. Excellent. Uh okay. Well, thank you for your time. Uh you've been very generous and uh you you've been cooking. So, I'm going to let you get back to cooking. >> It's been amazing chatting with you folks. Uh happy Friday. >> Happy Friday.