Joseph Nelson, CEO of Roboflow, breaks down the current state of computer vision and why it still lags behind language models in real-world understanding, latency, and deployment. He explains how Roboflow distills frontier vision capabilities into efficient, task-specific models using techniques like Neural Architecture Search and RF-DETR. The conversation covers Chinese leadership in vision, Meta and NVIDIA’s roles in the ecosystem, coding agents, and emerging S-curves from world models to wear
Hello and welcome back to the cognitive revolution. Today my guest is Joseph Nelson, CEO of Roboflow, a computer vision platform that supports more than 1 million engineers and more than half of the Fortune 100 as they seek to turn proprietary image and video data into a competitive advantage. We begin with an overview of computer vision capabilities today. Joseph notes that while language is fundamentally a human construct and inherently optimized to be understood, the real world contains a fat tail of chaotic scenes which are not at all optimized for understanding. And thus, just as the vision transformer came about 3 years after the original transformer, computer vision today is roughly where language capabilities were 3 years ago with the introduction of ChatGPT and GPT-4. Which is to say that while frontier models can do amazing things and most problems can be solved if you're willing to put in the work to fine-tune and pay any inference cost, we have a long ways to go before foundation models will
really be able to do it all. To make this concrete, Roboflow maintains a site called visioncheckup.com which highlights the spatial reasoning, precision measurement, and grounding failures that still plague even the best multimodal models today. And importantly, even when frontier models can solve a particular task, you can't wait 40 seconds for a reply when you're powering instant replay at Wimbledon or monitoring for defects on a high throughput manufacturing line. And so, there's often still a lot of work left to do to get vision models running efficiently enough to meet production latency and edge deployment requirements. This is where Roboflow comes in and I was super interested to hear Joseph describe what it looks like to go from an open-source vision model to deploying your own task-specific model today. He emphasizes first the importance of establishing clear requirements up front because the performance thresholds that different customers need to hit on their respective use cases can vary really quite dramatically. From there, the process often involves distilling frontier model capabilities
into much smaller models like Roboflow's own RF DETR model which they derived from Meta's DinoV2 backbone using a really interesting training technique called neural architecture search which in turn uses a weight-sharing technique to train thousands of network configurations at once all within a single training run. This process ultimately produces a set of models of varying sizes that collectively map out a performance Pareto frontier. And today, Roboflow has productized this approach so that anyone can now run it on their own data set and come out the other end with an N=1 model that is optimized specifically for their problem. From there, we cover a number of additional topics as well. Joseph explains that Chinese companies have consistently led in computer vision and how much the American open-source ecosystem currently depends on Meta, but also why he's optimistic that Nvidia will fill the gap if Meta's new AI leadership changes priorities. He also describes how coding agents are expanding the market for Roboflow's tools, how skills are emerging as a new
go-to-market vector, and how Roboflow plans to use a first-party agent to guide users through the process of building computer vision pipelines. We also discuss the state of AI's aesthetic taste and why the inherent subjectivity of aesthetic preferences makes this such a hard problem. We hear about the emerging S-curves that Joseph is watching including world models, vision-language-action models being developed for robotics, inference time scaling for vision, and wearables which are now selling millions of units per year. We get his vision for how computer vision contributes to a good life as AI matures which includes everything from precision agriculture and food safety to self-driving commutes and real-time sports analytics. And finally, he explains why he worries that overly opinionated regulation could accidentally stifle all sorts of surprising but valuable use cases and why he recommends that policy makers focus on outcomes instead of trying to regulate the tools that people are using. When it comes to computer vision, Joseph has quite literally seen it all. So,
whether you're looking to catch up on the field like I was or looking for a practical framework with which to approach a specific challenge, I think you will find a lot of value in and I hope you enjoy my conversation with Joseph Nelson, CEO of Roboflow. Joseph Nelson, CEO at Roboflow, welcome to the cognitive revolution. I'm excited for this. So, regular listeners will know I really got into AI in a full-time obsessive way in my role as founder of Waymark. And it was such an exciting time 4 years ago when things were just starting to work. I ended up going really deep on what was available in computer vision at the time with CLIP and BLIP and BLIP-2 and CLIP embeddings and trying to figure out how the problem that we had at the time was we have all these small business users. We had developed a pretty good technique for scraping their websites and their kind of online presence and creating a image library for them. But then what to do with that image library, right? It was just initially just a total jumble
of photos. We couldn't make any sense of it. We made very blind guesses as to what we would actually put into content for them. And that obviously had a, you know, a long way to go before it really started to work. So, I had a ton of fun in like 2022 into 2023 time frame getting deep into the weeds on that stuff. And obviously a lot has happened since. So, I'm really excited to catch up on a few years of computer vision progress in 90 minutes or so. Maybe let's start by just kind of setting the stage. Like, where are we today in terms of computer vision? You can come at that from a lot of different angles. Maybe start with like use cases. You know, what are the use cases out there that are really well established, they're driving the most volume, that are driving the most value. Give us kind of a a survey lay of the land. Since you brought up CLIP, maybe we can start in terms of some of the research that's powering what's now possible and then let's also do use cases that flow from there. With vision it's funny cuz we can think about AI and the trends of machine learning, originally a lot of it vision was home. You had ImageNet, you had MNIST, and
deep learning gave rise to is that photo a cat or dog in the internet. And then you had language, I would say almost jump out and take the lead in terms of wow factor and in terms of understanding with the introduction of the transformers you all need paper in 2017. And then you almost have 5 years of language like cooking with scaling laws and Chinchilla and GPT-2. And then you start to get like language products in GPT, I would say three and four is really where things started to break out in ChatGPT in 2022. That 5-year time delay of introduction transformer to products that become used by nearly a billion users, I think every single week now, is now happening in vision because you had the vision transformer get introduced in 2020. And so that ends up being another stepwise change of what's possible and what capabilities are easy or easier maybe out of the box. But to your point, what's interesting is historically
there's been this divergence of is this a language problem or is this a vision problem? And modalities are crashing together because just like our brains, you get more context if you can use language and vision together. However, there's some pretty meaningful differences of visual understanding both in the way that visual models work, but also the use cases of where it's most impactful. And I'd say actually one of the biggest ones is I think about the analogy of the way our brains work as a useful way to inspire how our systems for visual reasoning will work. As in, we have this big LLM reasoning engine in our heads that is our brain, but we also have this rods and cones in visual cortex that operates and makes decisions in, we jokingly call this like your lizard brain, like fast reaction way of understanding the world. And biology has evolved to having specialized systems for visual understanding distinct from broad-scale reasoning and the amount of neurons dedicated to that being more than any other sense. And I think the same will be true from biological inspiration for
the systems that you get used in machine learning. So, what does that mean in practice and that abstract idea? It means a lot of stuff runs at the edge. A lot of stuff runs low latency. A lot of stuff runs out in the real world. So, for example, like in a lot of language or multimodal or multi-agent reasoning problems, you can have the benefit of assuming you have maybe near infinite compute cuz you can run a long-running job on a data center. A lot of visual tasks where vision is most useful tends to be where you don't already have a human or eyes on the problem. You're understanding an environment maybe in a remote location. Maybe a manufacturing line. Maybe it's you're shipping a product. Maybe it's you've got cells underneath the microscope. Maybe you're looking through a telescope and like discovering new galaxies. Maybe you're building robots. And for a lot of those use cases, not all but many, you need fast reaction times in addition to large-scale reasoning. And so you see this increasing divergence and specialization of where vision is especially helpful for low latency tasks and for things
that are, it feels maybe intuitive, but out in the real world. Like systems that we want to observe. Where LLMs and languages inherently human construct. But the visual world isn't inherently a human construct, right? Like language only exists where people do and systems that humans have crafted. The world's much bigger than just language. And anecdotally, the amount of distinct scenes in a day is more diverse than the number of unique words you probably read in a day. And so, that heterogeneity, that richness makes visual reasoning I think harder. It means that the long tails are fatter and it means that the use cases tend to be out in the world for lack of a better way of describing it. So, the use cases that we see become a natural sample of where visual AI and computer vision is being used in the real world. Like about a million devs download open source every 30 days, about half the Fortune 100 building the platform. So, we have like this kind of insight of what's actually making its way to production, where are people tinkering. And you know, it tends to be these things that are I'd call them operationally
complex problems, maybe like in the enterprise sets of use cases. And for platform, you get broad amounts of inspiration, which could be like a hobbyist that wants to understand like I don't know, I like to play board games. The dice that you just threw, I swear every time I play Catan, my numbers that get drawn the most. So, I want a camera to prove to my friends I'm the most resource efficient compared to the resources that I drew for the way the dice came out. On a serious and a joking example, or we've There's this YouTuber that maintains a channel called Dave's Armory out of Canada, and he's built a flame-throwing, weed-killing robot with Roboflow. Or he's built He built his son like a self-driving couch that like follows him around the yard. Like these silly things. And then you have more serious things inside the cases where like powering instant replay at sports broadcasts at Wimbledon, or doing quality assurance in products that are being produced at Rivian, or any sort of like physical world thing. But one thing that I like deeply believe and I think the rest of the world's kind of coming around to this is that visual AI, visual understanding,
and at least even that part of multi-agent reasoning is going to be bigger and more important than just language. As in the ways for AI to reach its most potential, it needs to be out in the world. It needs to understand, see, reason. Because the world's a pretty big place, universe even bigger. And for the systems that we want to use and rely on, they need that type of capability. So, linking that back to the research that's progressed, there's a lot of work to be done, but there's been a lot of progress in from CLIP to present that we can talk a bit maybe in in detail on. But to set the stage, I would just basically be one of like really optimistic that we're approaching the ChatGPT moment for vision, and the infrastructure to power all of that is coming online, which means you're about to see a Cambrian explosion of all the places and consumer expectations are just going to be disappointed absent the ability for folks to have visual understanding in the products and services we use day-to-day. So, that's like maybe how I'd think about what's going on and what
that means. These cases underpin the types of ways the research is making its way to production so far. When you said that you think we're coming up on the ChatGPT moment, I thought that was quite interesting because in Sometimes I give talks about to kind of a an audience that I'm trying to catch up on what's going on with AI, and I often give them the MNIST example as like, you know, look how simple this is for us, but we still can't write explicit code to identify these handwritten digits. As simple as that problem is, there's there's no explicit algorithm for it even today, right? Okay, that kind of gives people a sense of like why it is that we need this sort of fuzzier kind of intelligence. And then I zoom forward to the GPT-4 system card, and I show the image that I'm sure you're familiar with of the guy hanging off the taxi in New York doing ironing on the back of the taxi. And show them how, you know, we went from an image net breakthrough in 2012 to that capability 10 years later, and
where the model says, you know, it's unusual to see this guy hanging off the back of the taxi doing ironing. So, I was going to ask, and it sounds like the your answer's going to be no, but I want to get a you know, a lot deeper under the hood on that. I was going to ask like to what degree could we consider vision almost a solved problem? Like is there Are there things And I know that not everything works immediately out of the box, not everything is going to work maybe at the cost profile that you'd want or the latency, you know, requirement that you have. But if we started off with just like a Is there any vision problem that we couldn't solve if we really put our minds to it today? That's kind of That was the working definition I had for solved problem. Do you think we are not there? And if we're not there, why aren't we there? What can't we do yet? The way I think about something being like solved problem is I think it's like I just asked the model and it almost impresses me. It delights me that it already understands and can do the thing I asked it to. And so, that I think that's why ChatGPT was such an aha moment for folks because you no longer had to train a model to
understand sentiment or describe text or whatever. Just like I talk with it, it talks back, it feels like talking to I don't know, a first-grader perhaps. And so in vision solved problem I think there's a subset of places where that's true. But it's not nearly as solved as language is. And I think the reason for this, in my mental model for what it takes to get us there is what we were mentioning earlier of the world's very heterogeneous compared to language. If you think of it about this in a very like first principles way of the amount of data that it takes to encode text with Unicode. Like I can represent all of text in Unicode in memory much more efficiently than even representing a single image because I have three color channels, 0 to 255, RGB, pixel by pixel. And so, that data disparity, how much more information it takes to even
encode a visual scene I think is maybe like an anecdotal example of why there's more heterogeneity in understanding visual scenes. But in a maybe a more thing, it's again the number of scenes in a day is different than the number of words that you read in a day. So, my mental model for this is I think about the world as your standard bell curve distribution. And in the fat like center of that curve, what we're measuring is the frequency with which a thing exists out in the world. If you went out, let's say, and you took a walk, or maybe just throughout your full day, and you write down, we'll say, every object that you see, and then you went and looked back at your notes and how many objects you saw and maybe how long you looked at them. You would have something of a bell curve of the things that you saw that repeatedly showed up, person, car, food. And there would be some things that are like longer tail, if you will. I don't know, maybe one day you were changing your oil. So, you were under the hood of your car. Now, even that's something that you're not going to do every day. So
vision having a model that can reach into those long tails because it's heterogeneous and because I think those tails are fatter it's taking just a bit longer to have the data represented as much as the models that can reason about what all the various different scenes in videos exist out there. So, what does that mean? Some things are like, quote unquote, solved problems. Like count people in this image. Or increasingly OCR feels like a solved problem. There's a model GLM recently that we're really excited about of it can run real time, you can query it with, "Hey, how much was my salsa from this receipt?" Or from this Google Street Images what's the house address on the left? And it's able to visually reason and extract and pull the correct answer almost always. And so, something like that feels closer to a solved problem. But the nature of how diverse some scenes are is
it's going to take representation and probably some reasoning models to be able to reach into those long tails. And what's happening, I used to say in slow motion, but it feels faster. What's happening in vision is you're getting models, multimodal models. And this is like the big LLMs like Gemini just as much as open models like Malmo, just as much as models like the Deter family of trans- transformers that are increasingly pushing outwards on this visualized bell curve. Where more and more of those things that you see on a given day are understood zero- or multi-shot. And so it becomes maybe a semantic question of what you consider a solved problem. Like if you're in the middle of that bell curve, yeah, it's a solved problem. But if it's like something where it's so impressively surprising and delightful it becomes a question of how long until someone starts to query and ask for things that wouldn't have been represented and trained. And so, I think that we're on that, we're riding that curve. Like the expectation is getting faster.
Um Now, one other complexity in vision is what we talked about earlier, which is a lot of vision is edge constrained environments. You want answers now. You're running a webcam, it's on your phone, it's in the palm of your hand. And so, that means you also don't have the benefit of maybe waiting 40 seconds for a reply from a model for the thing that you were interested in querying, which also means it doesn't mean that the problems are intractable. But heuristically, I see maybe an 18-month delay between like a soda capability from multimodal cloud available model to something that you can get to run on an edge device, which maybe here we could define as maybe like a Jetson Orin level computer, or maybe even like an iPhone where it's opaque the exact GPU comparable you would make. So, those things make vision feel unsolved. But I still think that what's going to continue to happen is the expanding nature of that that bell curve. And if you think about that mental model for where we're going with
visual capabilities, then I think that's a good way to think about where the field's headed. And frankly, the problems to solve. Hey, we'll continue our interview in a moment after a word from our sponsors. Everyone listening to this show knows that AI can answer questions. But there's a massive gap between here's how you could do it and here I did it. Tasklet closes that gap. Tasklet is a general-purpose AI agent that connects to your tools and actually does the work. Describe what you want in plain English. Triage support emails and file tickets in linear. Research 50 companies and draft personalized outreach. Build a live interactive dashboard pulling from Salesforce and Stripe on the fly. Whatever it is, Tasklet does it. It connects to over 3,000 apps, any API or MCP server, and can even spin up its own computer in the cloud for anything that doesn't have an API. Set up triggers and it runs autonomously, watching your inbox, monitoring feeds, firing on a schedule,
all 24/7, even while you sleep. Want to see it in action? We set something up just for Cognitive Revolution listeners. Click the link in the show notes and Tasklet will build you a personalized RSS monitor for this show. It will first ask about your interests and then notify you when relevant episodes drop, however you prefer. Email, text, you choose. It takes just 2 minutes and then it runs in the background. Of course, that's just a small taste of what an always-on AI agent can do, but I think that once you try it, you'll start imagining a lot more. Listen to my full interview with Tasklet founder and CEO Andrew Lee. Try Tasklet for free at tasklet.ai and use code cogrev for 50% off your first month. The activation link is in the show notes, so give it a try at tasklet.ai. Support for the show comes from VCX, the public ticker for private tech. For generations, American companies have moved the world forward through their ingenuity and determination. And for generations, everyday Americans could be
a part of that journey through perhaps the greatest innovation of all, the US stock market. It didn't matter whether you were a factory worker in Detroit or a farmer in Omaha, anyone could own a piece of the great American companies. But now, that's changed. Today, our most innovative companies are staying private rather than going public. The result is that everyday Americans are excluded from investing and getting left further behind while a select few reap all of the benefits. Until now. Introducing VCX, the public ticker for private tech. VCX by Fundrise gives everyone the opportunity to invest in the next generation of innovation, including the companies leading the AI revolution, space exploration, defense tech, and more. Visit getvcx.com for more info. That's getvcx.com. Carefully consider the investment material before investing, including objectives, risks, charges, and expenses. This and other information can be found in the fund's prospectus at getvcx.com. This is a paid sponsorship.
So, I want to work though through that kind of Pareto frontier of performance and cost latency, you know, and where it can run trade-offs. But, let's do one more double click on kind of the the most expensive end of that curve, which is the cloud-available frontier models, I think you described them as. Obviously, these things are like famously spiky. I would say it's been a while for me since I've had an image use case where I was like, oh, this thing like can't do it or can't see it. I do remember some of those not too long ago, for example, with like the ARC-AGI puzzles. I remember trying frontier models on some of those puzzles getting strange results and then kind of working my way back to can you just describe what the starting state is. And then I was kind of like, oh, well, no wonder it can't do the problems. It can't see the starting state accurately. It can't accurately just define, you know, which boxes are colored what colors. But that's been a while. I I guess I I don't know. You might I'm sure you would know like if we
just take ARC-AGI puzzles and and put them in today, like are they accurately perceived? And are there are there other things that would be like good kind of representative examples of spikiness where people might be surprised that, oh, like I kind of wouldn't have guessed that frontier LM wouldn't be able to see this the right way. Also, do they work with few-shot prompting? Obviously, few-shot has been a huge unlock in general, but does it work for vision? I I really don't know that. So, I guess to sum that up, like can we go one one level deeper in terms of the capability profile of the front Totally. Yeah. We spent a bunch of time thinking about this and then trying to help people navigate which what their expectations should be for the problem that they're solving and where they may be able to have a zero-shot, multi-shot problem or where you might be in a world where you need more representation of that problem before you can count on your model. The places where there's most frequently gaps. So, like one one thing that like we maintain as you'll see, just as you described, like the iron board example,
you'll see like these like vibes-based evals on Twitter. We started like rounding those up and throwing them on something called visioncheckup.com, which is if you took your LLM to the optometrist, like what would it do and not do. And so, we continuously update that thing. You'll see things consistently ticking up on that, but they're not all at 100% in the types of tasks that people like to try, including ourselves. So, what's the common types of failure and where are you most likely disappointed? One of them is in grounding in particular. Grounding referring to like segmentation and detection, traditional tasks. But if you want to say in your example, finding the starting position in ARC-AGI, or sometimes I'll try to do like crosswords, and I'll be surprisingly disappointed of the model's ability to know where the word goes in the crossword. However, if I just treat it as just a text problem and I say, "Hey, like here's the clue and here's some of the letters that I know are in the word," the model almost do better if it just thinks about it like text, if it doesn't also have to think about where in the puzzle that issue is. Or like measurement. Basically, you can think about this like subset of problems
of things that are inherently very precise, where there's like lots of precision involved. And some of this is I think the result of the post-training that's applied to these problems in a lot of the labs, it's a little bit hearsay, but seems to be increasingly common knowledge, are not as interested in just like solving the segmentation problem. They're interested in solving like what was the user's intent and if segmentation is a tool call as a part of that intent. But even still, the segmentation portion of that chain of thought is pretty unsolved cuz there's so many different things that you would want to measure, see, and have pixel-perfect representation of. You will see that when you take more time to reason, aka you like do more tool calling and you find more maybe specialized expert models for the scene you're looking at, you'll get better results. But in general, I would frame grounding as like a still pretty difficult issue with precise pixel level need. The second place where I think there's disappointment on the Pareto that you
described of accuracy, speed is actually speed still. Like the I was using Gemini 3 the other day to try to like automatically label a bunch of data for me in prompting it and it would do but it would take 40 seconds each time. And interestingly, the non-deterministic nature of generative AI also led to some pretty difficult uh downstream results because for that example, like I I wanted really precise, consistent, not necessarily what you think could be correct or your best guess or even not consistent results. And so, that's another challenge is that like you and I could go try the same problem and get different results from the same model with at the same time of day. Um we maintain this other property called playground where just playground.roboflow.com where you can do SAM 3 versus Gemini versus Claude Opus and what's really funny to me is that I'll find these failure cases and then like kind of report them to our team and then like they don't reproduce and it's like actually not because of our use of the
models, the model itself doesn't reproduce the same way. So, that continues to be a little bit of a challenge. After speed, I would say is the reproducibility challenge. And then, I mean, it's a little bit redundant, but we talked about just the representation. Like you get into those long tails and this becomes a function of the type of question you're asking, but there is still a lot of the world that's just not understood by models and of being able to articulate not just the pixel-wise segmentation, but the where one thing is with respect to another thing. So, those are like common sorts of failure patterns we see. Now, you had a good follow-up, which was how does few-shot help address these things versus zero-shot? And the answer is pretty good, but still not infallible. Like it helps. And how much does it help obviously is an interesting question. So, one way that we think about these problems is we introduced a benchmark at NeurIPS this last year called RF100VL, Roboflow 100 Vision Language. Basically, so folks that use
Roboflow for research will share their work in effort to build upon others' work and bring the whole community of computer vision forward. So, there's a large set of open-source data sets that folks can learn from and try and accelerate their own problem they're solving. We went and worked with users and researchers and the hundreds of thousands of folks that are sharing open-source projects to move the whole community forward. We created a basket of 100 of them of like problems that seem to be represented in visual AI. And the domains that broke down into were problems like industrial, health care, flora fauna, documents. There's a miscellaneous category cuz of course it's tough to put everything into a single bucket. And we evaluated Gemini and SAM 3 and a number of uh and OpenAI, of course, a number of multimodal LLMs and then also models like OwlVit, which is a model that supports few-shot prompting, an
open-source model. OwlVit 2 is the most current one. And a model called Grounding DINO, which is the new version Grounding DINO's behind an API, but still more open in general. And basically, the net here is we evaluated them by saying, "Can you do successful segmentation the same way if I pass to human annotator these same instructions? How does the model do? How does the How would the person do? Of finding all blank things in an image based on those domains?" And the best model at the time we published the work was Gemini 2, but that's 12 and 1/2% like of 100% across all domains. And so the gap of how far these models have to go on And this data wasn't again, it wasn't like arbitrary data. What's really interesting is that this sample wasn't like a perfectly curated research data set or imperfectly like Coco or Object 365 or these works that are very helpful contributions to the field. It was These are the places that folks are using models. Now, a second thing we did
to So the zero-shot performance 12% cool, interesting. The second thing we did is we ran a competition at CVPR on 20 data set subset just so that for compute constraints cuz we thought we could get the point across. So RF20 instead of RF100. We said if you had few-shot, that is 1 2 3 4 and 5 image examples, how much do you see the models improve and progress by comparison to one another? And the lift there, I think maximally was around 10% for a single model. I'd have to check the average across all domains, which is meaningful, especially when you're starting at 12%, but not We're not a panacea, right? It's okay, great. Like I'm helping ground the model with the domain that I'm looking at, but it doesn't solve all the problems. I will say that's a place that I'm bullish. Specifically, I'm bullish about few-shot for visual problems and providing prompts perhaps as image text pairs or even just as images with the task you're interested in, whether that's grounding or description or
measurement or or what have you. And so the story is clear, which is we need continuous better representation of the real-world problems people are trying to do, and we still have a bit of ways to go as a community, but for it to be {quote} totally solved. But progress is happening pretty fast. So that's like the progression of maybe where things are. Hey, we'll continue our interview in a moment after a word from our sponsors. One of the best pieces of advice I can give to anyone who wants to stay on top of AI capabilities is to develop your own personal private benchmarks. Challenging, but familiar tasks that allow you to quickly evaluate new models. For me, drafting the intro essays for this podcast has long been such a test. I give models a PDF containing 50 intro essays that I previously wrote, plus a transcript of the current episode, and a simple prompt. And wouldn't you know it, Claude has held the number one spot on my personal leaderboard for 99% of the days over the last couple years, saving
me countless hours. But as you've probably heard, Claude is the AI for minds that don't stop at good enough. It's the collaborator that actually understands your entire workflow and thinks with you. Whether you're debugging code at midnight or strategizing your next business move, Claude extends your thinking to tackle the problems that matter. And with Claude Code, I'm now taking writing support to a whole new level. Claude has coded up its own tools to export, store, and index the last five years of my digital history from the podcast and from sources including Gmail, Slack, and iMessage. And the result is that I can now ask Claude to draft just about anything for me. For the recent live show, I gave it 20 names of possible guests and asked it to conduct research and write outlines of questions. Based on those, I asked it to draft a dozen personalized email invitations. And to promote the show, I asked it to draft a thread in my style featuring prominent tweets from the six guests that booked a slot. I do rewrite Claude's drafts, not because they're bad, but because it's important to me to be able to fully stand behind everything I publish.
But still, this process, which took just a couple of prompts once I had the initial setup complete, easily saved me a full day's worth of tedious information gathering work and allowed me to focus on understanding our guests' recent contributions and preparing for a meaningful conversation. Truly amazing stuff. Are you ready to tackle bigger problems? Get started with Claude today at claude.ai/tcr. That's claude.ai/tcr. And check out Claude Pro, which includes access to all of the features mentioned in today's episode. Once more, that's claude.ai/tcr. Okay, well that's a that's a good start at the top of the curve. Let's maybe work our way down the curve. I mean, there's obviously multiple reasons that one wants to go down the curve. You may even add to my list, but obviously faster response time is a huge one. Lower cost is another great one. Ability to run on the edge is another great one. Ability to not have to send your data over the wire is another great one.
There may be even more beyond that that you would highlight, but take me from kind of I can naively send images into one of a few frontier APIs, maybe with few-shot, to like And I I Maybe you could take I'm not sure which way makes more sense to organize it. You could kind of go from large models to like smallest, most like able to run on edge, or maybe a different way. Maybe this Maybe they line up, but maybe there's a difference in how you would think about actually coaching people from starting with Well, why don't you just, you know, at least what I normally do when I have a new challenge is Well, let's just see what a frontier model can do out of the box. Now, once I've calibrated myself there, I can approach optimization and fine-tuning whatever you know, in in any number of ways. So I'd also be interested in if it's not the same as the kind of the curve of trading off convenience for all these other goods of of latency and and cost,
I'd also be interested to hear like how you recommend people navigate the path from that kind of first naive baseline performance to where should they go next? You know, what model should they go try to open-source how much or try to fine-tune? How much data are they going to need? What technique should they use? If that doesn't work, then what do they do until they're finally, you know, in some happy place where they've got everything that they wanted. I appreciate the way you broke that down. You're like, "There's the speed accuracy curve, which folks know those things straight off. A bigger model takes more compute and is slower." And then there's these other dimensions that don't fit neatly on a graph, but might be really important to somebody. Like you want to maybe own your AI. Maybe you want to be building your own IP as a business. Maybe it's important to you as an individual. Maybe there's a constraint of the business case itself of the problem that you're solving, where again low latency is super super critical. That could be also a privacy consideration, where as you said, you want to keep the data local to your thing. That could also be security as a close cousin to that. So there's a number of things that kind of frame where the like where someone is going to
fall along those axes. Now, how to navigate it and what I think often matters is it stems from the job to be done, if you will. So making that be like less less generic and more real and something that people can actually think about. What I think about is many problems, especially in vision, require there to be a instantaneous response, like a very You have something going down a line. You're watching a live sports broadcast. You need a decision right there, right then. And so of course that puts someone into a real-time category. And so usually there's the need to colocate compute or run something on the edge, and you already know that you're in a class of models where you want something that inherently, if you're going to run on the edge, you need to own it. You need to own the model. You need to have the weights. You need to put it into your environment. And that's where open-source and where we're incredibly We invest a ton in open-source for this
reason in terms of publishing our own models, in terms of supporting open-source repositories. I'm very optimistic about the future of open-source AI both because I think it's a important way for everyone to realize the benefits from it, just as much as I think it helps discover bottoms-up all the ways this technology is going to be impactful. Now, a pattern that we see that's interesting of navigating these two is similar to what you described is Is this problem tractable? Is this doable at all with the types of models and then maybe reasoning and nudges I would use for pre and post-processing of the model. And then fitting it to where I want to run it. And so we see this rising trend of distillation. So for example, SAM 3 is a promptable model where I can say, "Find all the sheep. Find all the people. Find all the hockey players in this image." And it'll do a pretty good job. It's not infallible. There might be perspectives you didn't include, but it's state-of-the-art for open vocabulary
segmentation at a minimum. Let's pretend that you are One One of our customers does the instant replay for clipping at Wimbledon and the US Open. And they actually bring compute to the courts because you have a live broadcast. You have sub-10 nanoseconds to put something over the wire. And there are compute There is compute that could run SAM 3, but in this case it just wasn't economic to get the compute load of that size. So basically, you're in the situation where you want a model that you can own and run at the edge live over the wire to in their case frame the instant replay that you want to put on broadcast networks. Maybe you see on ESPN Plus or CBS or something. And the things they want to know isn't an open vocabulary list, right? So, you don't need a model that can see I could prompt SAM 3 for poker chips. I could prompt it for deer. The odds that those things show up at Wimbledon is pretty limited, hopefully. And the odds that I need them for my replay model are even
more limited. So, what I could do is use prior video from a prior year, prompt SAM 3 and say, "Hey, me other people, give me the tennis ball, give me the court, give me the net." Have that then go and auto label a data set. Then I have a really high-quality curated specific data set, and I can train my own smaller model that I can run on the edge, a model like RF Ditter, which is the current state of the art for doing real-time segmentation and object detection. And not only does it run on the edge, but it's so efficient that I can run multiple streams on a single A100 in this case. And so, I get the cost advantage, and then of course, if something's cheaper, you open up more possibilities. And so, that's a common pattern we'll see is is this thing I use a model to failure. And if it does work, great, then can I make it be mine, or can I fit it into a use case where I know I'm going to need to run on the edge. Now, the other thing is that even as models eat more of the overall task, it's still okay, of course, to put a model in a harness or do you pre- and
post-processing of the model to nudge it in the direction of what you would expect. There's nothing no shame in like still using traditional techniques for post-processing. For example, there's models where I can just ask like I could ask Gemini to say count the tennis players in the court. >> [snorts] >> Um it would give me just the response of the count. But I couldn't just ask a detection transformer count. I could say give me the persons, and then it responds and there's two people found, and then I would add a tiny bit of logic that's super fast code that just counts the class outputs, right? And so, there's no shame in continuing to use stitch together post-processing logic for the purposes optimization of speed or where something's going to be possible. It reminds me of the database roaring. The most recent one was like vector databases, but even before that like when you had a lot SQL databases and no SQL databases, and where is it most useful to have document stores where you have unstructured data that references one another, where is it most useful to have structured data, okay, or or structured tables. And at some point in time, of course, you're going to have to deal with sharding your databases if you have
everything in those records versus if you maybe had a no SQL database that's going to scale for you automatically. And there's tradeoffs in in both those worlds. Reminds me of that, where it's not a question of purely capability, it's a question of constraint of the job to be done at hand. And again, running things on the edge in real time or a model you own or there's a we didn't talk about cost, but like man, streaming to the cloud for video nonstop can be expensive if you have quite a few streams versus maybe owned compute. So, these are all things that drive reasons why you can use maybe max ceiling intelligence and then apply it to a system that becomes one that you own and use. And you've seen this trend in language encoding models, too, right? Of specialization and small models and expert models. And so, in some ways that's a place I think that language has drawn inspiration from vision where there used to be consensus it felt like that it's one model to rule them all. Now, it's increasingly maybe flipped back to actually those domain-specific models and optimizations to be made. And I think that vision is increasingly in
the camp of but you do want a domain expertise model because you might be compute constrained of where you're going to run your system. So, I don't know if that's a color you're thinking about like navigating the Pareto and considerations, but those are the things that like we at least commonly see when we see folks approach problems like this. Yeah, that's great. Could you maybe give a a little sketch of the scaling laws, so to speak, by which I mean you know, this this is a little dated at this point, but I used to have a talk where I kind of coached people through how to automate tasks, usually language tasks. And one of the big things was just like the next you kind of have to think of order of magnitudes of data, right? So, if it's not working out of the box, then go get me like 10 good labeled examples, and if that's you could probably put put that into context, and then it might work. And if that doesn't work, then you might need to think about a hundred examples, and you might have a small fine-tune on your hands. And if that doesn't work, maybe a thousand will. But I wonder kind of what you see in
terms of I guess first of all like what what do people need in terms of reliability? And then what do they get out of the box, and then how many like steps do they typically have to take up those order of magnitudes to actually get there? And maybe also like how does that relate to model size? Obviously, bigger models in general kind of can do more, but especially if you're doing very narrow stuff, it seems like you probably can get everything squished into pretty small models. So, yeah, like what are the model sizes and how much data do you need to kind of step up to where people are hitting the thresholds they need to actually deploy? The the data question is one that, as you alluded to, gets informed by the business problem of like how many nines do you need before you're able to use a thing in production. For example, like if you're building a a system that's where your alternative is you have no eyes on the thing, then you're probably more accepting of a a less accurate model. Like maybe I don't know, you're trying to get a
sense of attendance, or you're staffing your retail location. And uh absent vision, you have no idea how many folks are coming in day in day out. Maybe you can go check your point-of-sale system is like one source. Um but let's just admit you have to try to get to some source of truth where some people are in the store at the same time or not everyone um maybe checks out or there's a different way we want this in- information. Point is, in this case, we might not have any eyes on the problem. We might not know how many folks are going to be present in our store at a given point in time, or maybe in our museum is an even better example cuz people don't explicitly check out. So, it's I just don't know, and so, if you add a model that is counting and it's 80% accurate at counting, you might be like great, ship it, put it in production. If we have a sense of do we have a dozen people or a hundred people at a given point in time, then I'm comfortable with that. Whereas, so that's totally bad. Whereas, we have some other problems where we have like healthcare manufacturers that make critical life-saving products.
Um you can think like IV bags in hospitals. And for them, an escape defect, meaning like a piece of particulate matter uh making its way in a product like that is is life-threatening, let alone the detrimental impacts it would have on the company's reputation and so forth. So, with lives on the line, you need to be really high recall to find if there's any particulate matter, and you're probably comfortable with adding vision to augment whatever system you currently have, whether it's people or lab inspections or sampling methodology, and weaning into where you can get more reliable vision systems. So, ultimately, like a lot of things, it's a business question for what level of accuracy tolerance you accept. Now, in terms of the number of images or videos that it takes to get there, uh the this becomes a function of how varied [snorts] the scene is of interest. So, like on one end of the spectrum, you have self-driving cars that are in the big wide open crazy world. I remember Karpathy's talk
like five years ago at CVPR where he was like, "Find a stop sign. How hard going to be? How hard going to be? It's the same red octagon everywhere, right?" And he's like, "Wrong." In photo after photo, he's "Here's a stop sign that's blocked by a bush in a parking lot. Here's a stop sign that says only stop if you're going right at the intersection. Here's a stop sign that's on a school bus. That's a stop sign that appears for a temporary amount of time. Here's a stop sign that's on a gate that when the gate is up, you can see the stop sign, but you don't need to stop." And you're like, "Man, something as simple and straightforward as a stop sign, there's tons and tons of edge cases to understand." Um and so, that's navigating the world fully autonomously. And yet, of course, showing up and how long it's taken for us to claim victory laps on self-driving cars. And now that they're here, it's almost folks are surprisingly unexcited about it. Whereas, you contrast that with something maybe like a manufacturing line where you know the thing you want to make, and there might be a finite number of ways that thing is made wrong. Like maybe you produce batteries for an
electric vehicle company. And the defects aren't always the same defect, so it's like a tricky problem where like traditional rules-based like look at this image, do some open CV, if there's a deviation, then flag it. But it's a machine machine learning is going to be helpful because the defect could be like a different length, or it could present itself differently. But the amount of variation that you're going to see in the scan of a battery of the cross-section scan of a battery is much more finite compared to driving on the open road. And so, the orders of magnitude of data that you need, instead of talking about petabytes of video files, petabytes isn't sufficient for a car, you can probably get away with hundreds, frankly, hundreds of images in the case of a controlled environment to be able to produce something of utility. The last part of what you mentioned is like model size. And so, yeah, the intuition here holds the smaller model, probably the faster it is, and also perhaps the less recall or precision it's going to get. The maybe the way to think about this is the RF Ditter family of models, which is
the current sodo for doing real-time detection and segmentation, come in nano, small, medium, large, XL, 2XL. And like at the 2XL size, if you do a fine-tune, it is more accurate than if you fine-tune SAM 3, and 40X faster. Now, of course, if you're doing a fine-tune, you're inherently saying I want this fixed class list, so it's a different prompt task type, right? It's not open vocab, it's I have the I know the things I want to see and I want to know uh if those things are present or absent or how many of them there are. And then on the smaller size of spectrum, you can get like pico or nano models that are 180 plus frame per second on a Jetson Nano with 4 GB of RAM and of the current family. And again, based on the difficulty of the problem, if you're doing something simple like seeing oranges on a rack in a grocery store compared to finding particulate matter in an IV bag on a manufacturing
line, you can get away probably with a smaller model that still clears the floor of the business utility while being more compute efficient and delivering the results that that you need. And I actually get away is probably the wrong framing. It's actually that's actually an optimization. That actually might be your most optimal strategy cuz you're able to deploy it at higher scale with maybe less compute. So, that's like the if you walk along the curve, it's the good news is I think intuition holds here. It's not it's you would expect that. Harder problem, more data, bigger model, like yeah, all those things follow what your expectations probably would be. You mentioned um distillation from kind of foundational open-source models as a a way that people are bootstrapping their way into data sets and then obviously fine-tuning downstream of that. And you mentioned this RF DETR model, which the RF stands for Roboflow, right? So, maybe if I understand correctly, there's some I don't know if it's on that model in particular or or in other places you've partnered with Meta.
And I'm interested to hear the story and I'm also interested to hear the lay of the land in terms of who is producing the open-source models, why, and how. In language, of course, there's been a lot of talk lately about Chinese companies distilling from [snorts] Claude, etc. and Anthropic trying to shut them down, which I think there's obvious business use case business reasons for that. There's also questions around like what does that mean about the the real strength that the Chinese companies have in terms of making their own models, like how much should we discount what they're able to produce? Does that explain why they're so spiky? And I kind of wonder to what degree this is also happening or understood to be happening, I'm not sure anybody really knows on the vision side cuz I I would say aside from Meta, it sure seems like like the public perception seems to be that the Chinese companies are leading in both like vision tasks and also in like
maybe not fully leading, but certainly in open-source category leading in image and video generation. And so, I'm wondering like is there some distillation going on there that's kind of that they're kind of taking a shortcut or do they have like just tremendously better chops in that area? If Meta were to have a change in strategy and like not be so interested and obviously they have had some changes in leadership, would the American side be kind of an empty bench? Could you guys are these projects so big that like you as Roboflow maybe could still, you know, dig deep in and fund them on their own or do you need like a hyperscaler partner like Meta to really get to the scale that you need? I guess like strategic geopolitically and then you can dig into the partnership with Meta, too. What should we expect in terms of, you know, are we just going to continue to get these like great open-source models like mana from heaven or is that maybe more precarious or kind of moment in time than people may appreciate? I think open-source AI, there are
reasons to be concerned about its future to its past in terms of the number of open-source models we're going to get. However, there's a lot of things that give optimism, too. So, in vision in particular, you bring up something that I think is under-discussed, which is that in visual AI in particular, the US has almost never led, whereas in language, we have consistently been ahead from closed models and open models alike. And there's a lot of reasons for that geopolitically just as much as like task emphasis and execution, but everything from the importance of manufacturing and vision in manufacturing and the importance of manufacturing in the Chinese economy, these are all trends that tell you why focusing on visual understanding and that as a domain is probably a high priority. But yeah, the name to name names of the folks that I think are in that mix, it's the Alibaba Quan team have done phenomenal work. They recently had in Quan VL in the initial Quan models. Quan 3 VL is world-class and
competitive with even closed models in its vision language reasoning and scene understanding capability. The Quan team has also had some recent leadership changes or leading researcher changes, so that might be tenuous. The GLM team, we were talking earlier about their mixture of experts model, especially for OCR specific tasks, but even just surpassing what's possible with closed models with their 9 billion parameter model, which is impressive work. And the and then there's the DeepSeek team, which they if you remember they published an OCR paper where actually the innovation was actually data processing technique for LLMs where they gave a model a screenshot of a page as a way to get more tokens versus just each individual word in the form of readable text. And the realization was that the depreciation of understanding was much less than the compression that was experienced. So, it was basically a way to give more tokens to scale up training. In the US, we're not without folks that
are doing incredible open-source work. You mentioned Meta, who are the publishers of the segment anything family of models. And I'd say they're the seg the SAM 3 is the best open vocabulary model globally and Meta are the publishers of it. The Meta team all the way back to FAIR and Yann LeCun starting efforts with Detectron 2, Faster R-CNN, introducing DETR, the DINO family of models. One thing that people dunk on Meta about is their lack of like language models. And again, under- credit how good Meta has consistently been at visual AI in particular and advancing computer vision consistently. And if you think about their business, this makes sense as well, of course, of making sense of photos and images that people share on social just as much as the future of glasses and so forth. You also have Microsoft with the Phi family of models that are multimodal. The Allen Institute with Malmo, though you want to talk about like funding and turbulence, that's a topical thing
that's taken place in the last little bit. On the diffusion side, you have Mistral doing some work on Black Forest Labs out of Europe. And then I think the other one that like is pretty exciting is Nvidia. Nvidia has put a ton of effort investment into open-source AI. I think they have the most open-source model repositories now just by account if that's your rough heuristic. But the Neumitron family models, Cosmos reasoning, I was talking with one of their directors of open source Nader recently about how much they're investing in making those models be increasingly multimodal. And the Cosmos reasoning team doing great work to advance beyond just visual reasoning capability. So, there is this geopolitical race for sure of wanting to have the best models and best possible. Now, you asked the question that is near and dear to me of like so where does Roboflow fit in this as you're not a foundation model company, you don't have a data center the size of Manhattan like Meta, you're not the publisher of GPUs like Nvidia. So, what's where do you all fit in this mix? This is actually something gives me an intense amount of pride, actually. So, we published
RF-DETR and RF-DETR retook state of the art for the US in a very specific area of important tasks, real-time object detection and real-time instance segmentation. Before that, you had models like LW-DETR and the DM family of models, which are tougher to to fine-tune, both of which are were great work out of labs in China published. But RF-DETR, to tell you the story and how it ties these threads together, RF-DETR is the first real-time instance segmentation transformer as well as the fastest and most accurate for doing pixel-wise segmentation and detection. The way that we made a bet that initially I wasn't sure would work and I've been delighted at how well it has worked is we picked a narrow task and a small model that is useful on edge tasks, which just as you and I have been discussing, I think is comparatively under-addressed.
And so, we basically have this novel area where we know people need things on the edge, we know people want models to be theirs, we know that open-source AI is under attack, and we know that it's incredibly important to give people models that can run in environments that they might not otherwise have. So, our the way that we did that is it marries these themes. We took a DINO V2 backbone, so a pre-training from the Meta family of models. They've since released DINO V3, but DINO V2 backbone. And we noticed that there'd been improvements from models in the transformers family of passing accuracy, but not speed for detection type tasks. Similarly, there were some models on the and transformers that were faster, but not more accurate. And so, we said if we use a DINO V2 backbone and we use all the benefits of pre-training and we use a shared weights neural architecture search NAS strategy, can we select intelligently search and find the most optimal speed accuracy model from an
object 365 pre-train that then works downstream on Coco and user fine-tune tasks, and attach a a segmentation head and a detection head. And at the time when we started these experiments, it was soon after we'd raised our series B. So, in total, we've raised about 63 million across all rounds, just to give you a sense of like the size of resources we have available at our disposal. So, not nothing, but also not all being spent on just this problem, of course. And also like pales in comparisons perhaps to the billions to be able to be spent on foundation models. And through the training runs, we realized that like this technique has promise. And so, we invest further in it, and we introduced the first detection model last April, the segmentation model in the fall, and we continue to invest in making the developer experience really high quality there. And critically, something I'm super proud of, it's Apache 2, which means that like even there's been like Yellowfin models which we support and folks can use, but those are actually now not commercially permissible without a commercial license, which we're able to offer, which is awesome as a company.
But I think there's places where people just want to build models that maybe they don't necessarily have commercial ambitions. Um and so, it's world-class at what it does, and maybe to give a sneak preview, we already know some ways that it can extend to perhaps other task types and even have yet more accuracy. The LW data team in China have responded, but not beat back some of our work. And so, it's this cool kind of global arms race where your tiny friends at Roboflow are putting the US on the map in a pretty big way. Now, if Meta is up to publishing open source tomorrow, if Nvidia start publishing open source tomorrow, just as I described to you, all of open source would take a hit because a lot of improvements are taking the best of ideas and experimenting, running ablations, smashing them together, and smart minds, certainly smarter than me, thinking about how to solve these outstanding problems. So, it's something that like I think tells the story of what's going on in open source vision and something we're proud of just as much as like the problems
that are yet to be done. Well, yeah, I'm interested to hear more. And I guess if I was going to do one double click there, we're obviously entering, according to many, and I'm among them, the era of recursive self-improvement broadly for, you know, AI's doing the AI research. It sounds like you kind of dabbled in that a bit with this architecture search. And I wonder if there was anything as you look back on that experience that you found surprising, you know, did it feel like a brute force grind, or was there anything Are there any stories to tell about Eureka moments coming out of that architecture search that felt somehow qualitatively different from a a brute force grind through architecture space? One thing that I think is really exciting is, I'll go deeper on this idea of weight sharing in neural architecture search. So, a lot of the times you're doing very brute force, train a bunch of different models, compare the speed accuracy of those models, and you're almost doing a grid search of different
parameters that could help. An informed grid search, right? You're not going to do things that would be you would think naive, but it is fairly naive guess and check of train this model, see its speed, evaluate it back and forth. And there's still, of course, a degree of that. But and we published a paper, so the details here are open for anyone to be to dive in as well. What we did with weight sharing in neural architecture search is rather than train a separate model for every accuracy latency configuration, we use weight sharing in NAS to basically train thousands of subnetwork configurations in parallel with a single training run. And so, at each training step, one subnetwork is sampled by randomly checking parameters like patch size, the number of decoders, the number of queries, the input resolution, the attention windowing, we use the formable attention in the model. And then inference time, you can actually sample any of those subnets. What that does is it doesn't just mean that we've introduced like maybe one model. We've actually introduced a framework by which we can repeatedly
produce open source models as long as you can do NAS against the architecture. And a NAS training run isn't as efficient as a single training run, but it's also not 7,000 times more inefficient despite having the ability to compare all of the number of different configurations. That was a huge freaking unlock for us to be able to use our compute budget efficiently to be able to release models like this. And so, that's one huge part of of the story. The other maybe like notable unlock was rewriting the formable attention, which isn't supported in every inference engine, so we've had to like rewrite some like support for it or wait, for example, for TRT in Nvidia ecosystems to support it, and now it does. But that was a useful realization. And then I mentioned the DinoV2 backbone, which and now DinoV3 is out, you can imagine what experiments we're running. So, yeah, the weight sharing in NAS is something that's massive. And then, by the way, anyone can use NAS on your own data set. No one's create like a one-of-one model for your problem.
Because the way that NAS works is it's going to train and create and output a Pareto that you can then pick from of like where do I want to exist along the speed accuracy trade-off. And so, you can be like, yeah, I like um I can be anywhere along that curve within my available compute budget, and you can just obviously just max accuracy or be lesser on on speed. And so, when we saw that NAS worked on object 365, we were interested if it worked on downstream tasks. And now, we've actually rolled out the ability to run GPUs in the cloud that'll do hosted NAS on any given data set. And to the theme of owning your own AI, like if you NAS on your data set, literally no other model architecture exists that nears your data set. Like it's like a one-of-one, which is like I don't know, there's if I were smarter about crypto, there's some interesting crypto thing there of a one-on-one to give somebody, but that's outside my wheelhouse for sure. But it is it is the most maybe purest form of like your model cuz literally no other model would have landed on those optimizations for the
data set that someone wanted to train on. So, NAS was like kind of the unsung hero that is a huge unlock for the efficiency gains we were able to see. Yeah, that's really cool. I think one one thing I noticed you have done, I don't know if you've done it exactly for this yet, but when people hear about this this whole NAS and the, you know, it sounds complicated, right? We're going to have a Pareto frontiers worth of models. I I would imagine, channeling myself, I guess I'll just speak for myself, that sounds both awesome and complicated. But I notice that for at least some things, you are following the trend that I see I'm seeing everywhere these days of here's a skill that you can just give to your cloud code and have it speed run through the process of setting this up for you. So, I'm interested in like how easy is it these days to get started? If I'm sitting on some esoteric problem and maybe some small amount of data that I'm and I'm like, oh, this
Roboflow guy sounds like he's got some pretty cool techniques. Like what's the path of least resistance to go come out the other end of this tunnel, potentially not having done much work, and have my own, you know, one-of-one model that's you know, got its own Pareto frontier of possibility and all that good stuff. For NAS specifically, I give you the real stuff cuz I think the audience would be able to want to dive deeper, and I'm a skeptical person myself, so I'm like, give me the real real info what's going on under the hood, and that's why I mentioned the paper's out there. But as someone that builds products, we also have I created the easy button where it's run NAS on my data set, and then boom, we spin up bunch of subnets on GPUs, kick off the training job, show the results. And then what happens back for the user is here's your Pareto curve, and it's one click press where you want to be along that curve. That's for the human user. You mentioned the agent user, which is also an interesting place to to spend some time. But the first thing I would note is that a core thesis
of Roboflow, of like why we build things the way we do and like how we approach things is like you want to be very interoperable and allow someone to progressively reveal complexity, but set good defaults. So, you can almost think about the products that we build that wrap RF Dater or wrap our inference server or use NAS, um by all means, someone could set up their own infrastructure to do training and re-implement NAS, and it's all out there. It's open. The thesis in that is that by making it be easier and simpler, you actually inspire and engender trust like that the benchmarks can be reproduced, that folks know where things come from. And then actually ease of use as a guiding philosophy, that strong defaults can be set. So, I can also have a model trained for me or data set that like helps get curated. Or on the inference side, which we haven't spent even a ton of time about, like we've made tons of investments into if you're doing vision
specific inference, there's a lot of optimizations you can make and assumptions you can make to most efficiently use the GPU for just the vision for just the parts of the network that require it versus, for example, a resize where you can run that on CPU. And so, there's tons of and all that's open. Like inference, if you pip install inference, it's an open source GitHub repository anyone can use. And also, if we provide that as good defaults in the service, and if we're worth our salt at all, then we should make products that are easy to use. That's for the human user. Hm. Now, you mentioned something that's really exciting in general that's certainly bigger than any one company, and that is agents becoming the biggest user. And what does that look like? And like a lot of companies, we're leaning into the idea that if you expose CLIs, and maybe MCP we might release like a Workflow specific MCP yet, certainly we have lots of good CLIs, and there's this ongoing debate of is it MCP or CLI future? At a minimum, there's CLIs that all the actions that take place in the platform, Cloud Code, and Codex, and so forth, can take on behalf of a user, so that you can say, "Go optimize a model for me and make that be
easier for me." Now, something that we are investing in as well is you've seen a common trend of companies that build amazing infrastructure products, and here I'll give like a lot of credit to Vercel where they've done awesome job of making a common set of products for the front-end cloud and Chris in the back-end cloud. And then they've layered on top of that now of Vizard agent, right? Their product that will then you can chat with it and it'll build website and it'll choose good infrastructure for your problem. We've taken a lot of inspiration for that as a way to enable users similarly to chat with our Workflow's AI agent, where it's like, "Hey, I just want to say count people crossing the line or watch cars in the intersection or whatever it is." And the model a lot of these problems, it's interesting, the hardest part's actually to discern what the user actually wants what their intent is. Once you kind of have a sense of what the intent is, then models can intelligently be like, "Okay, you wanted to count like Nathan wanted to look at cars crossing the intersection, so is there a model that can already know cars? Probably SAM 3. Actually, cars is
a class, probably RF Dater, and it'll be more efficient and faster and more computer efficient. Great, let's grab that." And a pre-trained model that knows cars. "Okay, you said crossing the line. Can I ask the user? The intersection has multiple places you could have meant by cross. So, where what intersection did you mean?" And increasingly, you can be in this future where it's like if Joseph were sitting down with you side by side and helping you construct your problem and pick the model and follow the architecture, can we expose that as a an agent in a democratized way, for lack of a better term, that anyone has access to? Folks who spend many hours a week think about these problems being the guide and sherpa for building a given pipeline. So, in in the scale of things, it's like ease of use as a with good defaults as a platform principle with revealed complex of revealed complexity progressively. Secondly is agents using CLIs to basically use the same sort of easy-use stuff. And the third is that like a first-party agent, which we
haven't released yet, but maybe by the time this comes out folks will discover to guide folks down that journey. So, that's some of the ways that I like to think about building products that balance giving someone the satisfaction and awareness that it's built on good primitives while still being able to create products that are ease of use and allow folks to get to value quickly without needing to know everything about every subnet of a training run, for example. A very particular question that comes to mind that that you might help me with or maybe some maybe set my expectations on a little bit is so, my company Waymark, we make videos for small businesses, specifically really focused on TV quality advertising. So, like your classic 30-second TV spot is like really our bread and butter. Customers have asked us from time to time, "Hey, could you help us with display ads, too?" Our customers kind of We partner with a lot of like cable companies, media companies. So, they are ultimately selling inventory advertising to the small business. We're helping
enable that with the creative solution. They ask us about these display ads as well, and now we're getting to the point where like, "Boy, maybe we could add that, you know, it's a we can vibe code all kinds of stuff much faster than faster than we used to, certainly." But one challenge that I used to have a lot and I'm not sure what the state of it will be today as I'm kind of digging into this is aesthetic evaluation. And I don't know if way back in the day there was really just one open-source model or one open-source dataset and a couple open-source models that were trained on it that seemed to do a halfway decent job of aesthetics. And by halfway decent, I mean I could tell which was the top and which was the bottom end of the distribution, but in the middle, it was very unclear which way I was headed a lot of times. And then there was one company that had one, too. I forget the Oh gosh, what was the name of that? Every Pixel, I think it was maybe called, that had trained their own aesthetic model. These days we typically go to foundation models for that, and we say like, "What's suitable? What would make the
business proud? Like how does that, you know, how how would you advise us basically on like the these available images to use?" And they work pretty well, definitely slow, you know, definitely more than we'd like to spend in many cases to grind through, you know, a huge library of images that a a small business might have. Does that sort of Is there anything in the kind of small open-source world that would be able to tackle a problem like that or is that just still so esoteric that um nobody's gotten around to building that foundation for me? Aesthetics is a tough one for the reasons you described where the types of problems that models can recursively approve improve against is you can benchmark it. And the second you can benchmark it, then you can scale a bunch of compute and the bitter lesson takes hold. Aesthetics maybe's a little bit eye of the beholder of what's good, what's bad. Now, there are some places where and so for here I would say for example like even if you just take diffusion models, like some people just the way Midjourney looks more than they like the way ChatGPT looks more than they like
the way Gemini looks or Nenum and I should say when it creates examples. And maybe the model you're talking about is like the LAION team had the aesthetics predictor model that like helped evaluate some of these things cuz they also did some gen they did some generative image stuff, so they also released their I think aesthetics evaluator stuff. >> Mhm. That wasn't out when I was first really struggling with this problem, and it was that I think the timing of that was such that we had kind of moved already to foundation models, but that was like definitely the best purpose-built thing I think I've still seen to this day. One thing that I'm sure you're aware of that your audience might find useful as a way to reason about this is in the context of display ads, there are some services, like Facebook for example, where you're not allowed to have text be more than X% of the display ad. They find that it just reduces the quality of ad for the end user and what whatever the reasons are. And so you get have for sure, of course. That's a great example of the distinction between does this ad feel good taste-wise versus rules-based
as in is there too much percentage of this image that is text? Then automation of taste and aesthetics and preference, I if I could give you a thought of like how I might approach that problem, it'd be it's a great RLHF problem where if you have a given client that you know their brand guidelines, you know their style, perhaps there's enough history of display ads they've run where you can get almost like a vibe check model that has been tuned for what they've done. And with foundation models, like you said, perhaps you can do a few-shot approach of, "Hey, these are the ways that this person commonly likes to do things. Is it similar?" Again, the big problem even with that approach is so much of marketing is being different. So, if you're adhering to the brand guidelines, you might be stylistically following what you should have done, but you might be failing the top-order task, which is stand out from the noise. So, short answer, I don't have a great like zero-shot aesthetics model for you beyond I think the things you're probably already doing, but longer answer, I think it's a great example of the conversation you and I have been having about what
distinguishes a task where you can train your way post-train your way to victory with objective metrics versus ones where it's a little more loosey-goosey to benchmark and therefore lives outside the range of where toss compute at it get better results. Okay, so moving to the time we have left, let's talk about just frontiers. Like what's coming next in any number of different directions. There's of course new architectures that people tend to get excited about, myself included. Things like Mamba and, you know, state-space models more generally. At one point in time, there was like a an explosion of vision use cases there. World models are obviously a big deal. I'm not really sure how to think about how they will relate to vision. We've got increasingly credible claims that people are going to start to scale up humanoid robots and put those into presumably factories first, but then, you know, businesses and and homes not
too far into the future, either. What are the things that you are most excited about? You know, what are the biggest questions that you have that you are kind of like, "If this works, it's going to be a game-changer, but I'm not sure if it's going to." You know, what are We're all about scanning horizons here. What are the What are the horizons you are scanning? There's things that I think are continuation of trends that are working, and then some newer S-curves that we're starting to ride. Trends that we're continuing to ride are transformer everything. We talked about how the vision transformer was in 2020, and attention is all you need is 2017. So, you've seen diffusion transformers, DITs, and vision transformers, VITs, continue to eat more and more tasks and achieve SOTA accuracy. RF Dater is exactly that recipe applied to real time. The um that trend is known and going to continue. An- another trend that's maybe more nascent is self-supervision, especially in the DINO or Dino family of models.
And Dino V3 kind of showed that you could have good latent understanding of things as a backbone without having large amounts of supervised labeled data, and then you can use that image understanding downstream for tasks, whether that's detection or segmentation or captioning or whatever. So, that's a Can you tell what the unsupervised trick is there? I I always like tell people the big unlock for language was language itself is structured, and so if you just have a ton of language like predicting what given some text what comes next, like we've got lots for you to work with. Similarly with clip, right? There it turns out there was billions of captioned images. What is the unsupervised unlock for for Dino? Okay, so the Dino family of models, you have 1 2 3. Three, and they're all riding on this trend of self-supervision. The DinoV3 model, I think was trained on billion scale I have to check the exact stat, but I remember seeing that it was similar to the number of images on Roboflow Universe. I was like, "Huh, there's something there." So, billion scale plus
images. And the observation is if you start to have sufficient representation of given domains then maybe intuitively, if you think about just like a human without being told what things are you start to develop intuition for where and how structure should exist in a given scene. And that is understanding. Like if you know that the lamp is on top of the side table and often in in a room with a bedroom then you have a set of reasoning, excuse me, a set of understanding of a given scene, and you can use that understanding again, it's a backbone, so you can't just use you can attach a classification head to DinoV3. You can attach a segmentation head to DinoV3. But alone, it's just a backbone which has really rich latent understanding of scenes. And so that's the unlock. It's if you think about like just looking at a bunch of scenes, you're going to start to develop your own maybe pattern matching is like a crude way to to think about it. To be a bit more So, does that involve like some
sort of masking type thing though, or like how is it how is it creating a a prediction task for itself that nobody needed to label data for? In training, and there's paper there's papers so fortunately that we can like falsify and understand. They use self-supervision techniques, which is you typically take a student teacher model and you have a bigger model that's the teacher that validates the output of the student and as you see the student continue to do well with predicting either patches or gram anchoring then you continue to scale up the student teacher recipe training recipe to larger amounts of data to understand more senses of scenes. The way so that was like part of the training recipe. The way that understanding happens is in actually isn't that dissimilar from actually just the vision transformer itself, where you have pat It's actually crazy that this works. Literally, like these models, and there's different approaches, but they take patches of the image and the sort of is it feels very it felt very unintuitive It still feels a little
unintuitive to me. But if you have patches of an image it's almost that you can understand the rest of the image even from individual patches even if you treat those patches independently. So, it reminds me like way back in the day I used to do language stuff. Reminds me of like bag of words where like you would have a document and you would count the number of times each word occurs in a document, and you can start to get a sense of what that document is about. Is the same thing that's happening with understanding patches of a given image. Now, there are other techniques which will use and this is the that's also why by the way earlier we were talking about the struggles of spatial reasoning, that's why. So you'll have increasing techniques that'll use cross attention and get a better understanding of where things are in a given image with respect to one another in the overall image. But that's like the core unlock is if you have a high number of images of various scenes and you run verifiable falsifiable tasks of fill in the blank or what would you
also expect to be here or diffusion generation and then you have a teacher that's able to scale or able to validate the student's work great, now you have the recipe for a self-supervised loop you can plug in more data and scale up. And that's what they did for they didn't release the data set, but it was like billion plus. I need to check that. I think it was a billion plus images were in the DinoV3 pre-training. So, it's actually like really cool that that that works, honestly. And and that it's open and there's a good technical report for it. Okay, sorry to take you um down that rabbit hole. Let's pop back up to just more horizon scanning. Yeah. Yeah. What model actually Jepa type things that you know those are always hotly debated as to whether they're like the inspired future that few can understand or if they're kind of beside the point. Uh I still don't know where I come down on that myself. But you feel free to opine on that or any other you know what we really I'm most interested in what horizons you think are the most important ones to be
watching. So, we were talking about ones that like we're already riding the known S curve of transformer examples of self-supervision and and how patch embeddings work to create understanding. New S curves that I'm excited that we're like we society collectively are starting to ride. One is world models. Within that category, there's a various number of techniques like you described the V-Jepa technique, there's the World Labs techniques. And the idea of a world model um and there's different labs with different approaches, but the under the objective of a world model is can we understand and reason about I'm trying not use the word world in the definition, but like scenes and places and out in the world it's a well-named it's well-named thing. Out in the world existence with a with an with a new architecture. And if you think about that, it's Okay, so what's new, what's different? You're inherently there's different approaches, but you're inherently multimodal by default. You're
thinking about some models will think about this as like a next scene prediction of video of Okay, right I've entered this some approaches of predict the next scene. Some will think about it a bit more like diffusion of a single viewpoint. The so what for world models that like I'm interested in is we will world models give us true understanding with physics with um spatial reasoning with open-ended tasks that we can just start to use? And the answer is probably, but the more interesting answer is on what time horizon, and there I'm not sure. I think what's interesting right now is we're using world models like I I would argue maybe even Cosmos reasoning is an example of a world model and you can use Cosmos reasoning for at a minimum, it's boring, but like synthetic data generation and at a maximum, perhaps you can use it actually reason about and use something to navigate a given space. So, world models is one category. I think about vision problems is like um
read-write access and I think about Roboflow is like a lot of things we think about for what it's worth is most of the read access and then like world models are like a form of blending read-write access. Robotics is an example of write access to the real world. You are modifying and manipulating the real world with a robot. Of course, that requires understanding to then you have to understand to you have to have read access to have write access in a scene. World models are exciting because they're an example of potentially blending that. And potentially you get the understanding zero-shot or multi-shot. You could even argue that like some of Sora 2 was like the underpinnings of a world model to understand what's taking place. Um but that's one one trend. To give an overview of another trend and certainly go deeper in in some of these two that we haven't said is VLAs, vision language action models which is incredibly popular in robotics. And like a video language action task is you provide instruction and a robot typically is able to action on that instruction. Move my computer 50 cm to
the left or something. Might be an example of an instruction you provide and then a VLA would be able to action on that. And in that world you have a number of emergent like younger startups that are thinking about that. You have Nvidia's group project working on that. Google's is R2 where they're working on that set of VLA problems. And I think maybe another way to think about VLAs is like it's a new task type, a new paradigm, and we should expect the similar things you and I were just discussing around different model sizes, different levels of generalizability. VLAs in some ways will need to be edge ready cuz if you're going to run out on an embedded device and have embedded intelligence, then you're going to need of course the thing to be at the edge and run real time. So, I think VLA is like a trend that I think is um emergent and exciting and perhaps still under indexed on. And then the this isn't unique to language or unique to vision, it happens in language too, but it's worth describing because
it's providing visual understanding is inverse time scaling in multimodal reasoning and reasoning in general. Like in a lot of ways vision can be a tool call of a broader agentic system that wants to understand and describe how to do stuff in a scene. Like for me like I I'll find like I think I'm using Gemini 4 all the time as a replacement for instruction manuals of what does this button on the remote do? Do the pilot light in my water heater went out recently and I'm like, "Okay, tell me about this specific model." That's somewhat of a high-stakes task that one would want to proceed with caution with. I grew up in as the son of a farmer, so if I wouldn't be allowed to do figure out how to do that, I would be probably exiled from the will. Fortunately me and my friend Gemini and ChatGPT, we were able to solve the problem, but that's a perfect example where I'm using visual reasoning in the real world but interacting with it with language, but the post-training reasoning, there's probably a tool call there to do some search, figure out the water heater model, figure out the instructions you're going to provide. And so, all of that is in the category
of you have a big compute budget and you can do post-training and inference time scaling to give better better results. That's just going to continue. That's just getting going. And you can start to think about So, you can start to think about that as giving rise to visual agents. You can set off to go do a task for you, organize my images for me, or figure out like in your case, perhaps like there's one that's aesthetically able to do good categorization of things that are brand aligned and not brand aligned across categories of display ads that you want to do. And so, that's going to give rise to We can learn from coding agents there when you can let something run unencumbered over a long duration with a model then we'll get similar benefits from long-running vision agents that can understand scenes and do things for us with all the caveats of speed and latency included. So, those are like some of the trends I'm that are like the hypie ones that I'm thinking about and paying attention to. Again, they all What I try to do when I break these down is I break them down into my normal distribution bell curve
of the world and figure out like what's the impact, what's the implication, and where can people use them, can they make it be their own, where is going to be most useful. Broadly, I think that recording time of this episode is well timed because I think the vibes, the pendulum is swinging back to vision because you hear the rise of physical AI, of multimodality, the rise of hardware, what's defensible in a world of SaaS always being written and code like generation being simpler. And so, that's putting more and more people thinking about the real world and hardware and ultimately cameras and getting things into those environments. Since to me, I'm welcome, water's warm, been here all long, the infrastructure's hot, let it rip, and we're fortunate to be able to power a lot of that sort of stuff. But hype-wise, it's something that has me pretty excited about the amount of activity that's about to enter the space. So, those are some trends and maybe themes to to track that I'm looking after. How about wearables as one other one? That seems to be it kind of brings a lot of these
challenges together, right? Because if you're going to have something on your face, it can't get can't be too heavy, can't get too hot, but you it has to understand what's going on around you well, or it's more annoying than it's valuable, right? Totally. I We did the We started a partnership with Meta for the SAM Anything Family Models and now more general visual understanding such that, for example, like when they launch SAM models are on Roboflow with day-one support, and now we're helping them understand where the model can be improved and not improved, and some work like that. I uh for Christmas, my significant other got me the Oakley Metas. She was like, "If you're doing this awesome Meta work, you got to be dogfooding their stuff." And it's my first pair of wearables that are mine. I've used ones for like Spectacles, and I'm always like tinkering with stuff. And when the Apple Vision Pro came out, of course gave that a run. I'm pleasantly surprised. Like, wearables are going to inflect. There were 8 million pairs sold last year. By point of comparison, 60 million AirPods. So,
like pretty good amount of volume moved. And the Oakley ones in particular are like targeted at like active activity. So, I like to cycle, and you typically already have a pair of sunglasses on your face that are a bit bigger for cycling. I like to run. And they do bone conduction for music, they understand the scene. The AI on board is not there. You have to have your phone with you, and they're offloading presumably some amount of the heavy lifting to the phone. You can say, "Hey Meta," and then get some of the feedback. But again, like with many things in AI, it's the famous expression, this is the worst it'll ever be. And now that it's useful enough to be in a form factor where this is a pair of glasses where I went on a ride Sunday with some friends, they didn't even know that they were glasses that had the ability to play music and capture media. And then they did them spin first time and were like, man, I we need to get these for our next ride. And that like really gave me the sense that this technology's arrived. But yeah, the constraints of running on the edge, constraining the amount of power draw it's going to have. What's funny is like
I I'm a bit like Charlie running up to his football and swinging and missing on AR. Roboflow actually started as like building AR apps. Like before we even had an a company in 2017, we made AR apps just for fun. And I was like, "Oh man, we've arrived." And like, how wrong was I on the timing of that? And then we came back at it in 2019, we made more AR apps. And I think the big unlock is the form factor. So, you don't have to have the glass brick in your hand, you can have a different thing. So, I'm pretty excited about wearables. I think uh we actually Snap also has their Spectacles. They were the first publicly traded company to mention Roboflow in an earning statement, so they always have a special place for me when we did an integration with their Snap Spectacles for developers to create custom lenses and scenes you want to understand. So, if you want to create like a custom Actually, we had someone like count the number of stop signs on their walk. It was like a funny thing cuz they wanted to like I think prove to their neighborhood that they were safe or something. And so, there's something there's uh something brewing there, but the big
change certainly is that the hardware has gotten good enough, and the consumer willingness to adopt is showing up in the numbers. Um I hope I think this will happen. I hope that ecosystem stays open, or becomes more open, I should say, so that anyone can publish apps. I don't have any inside infor- information here in in what I shared in that I would bet the strategy is that right now it's closed APIs because you want to curate the experience and have high-quality first user experience with the apps people can use, but I would bet that the strategy will be to open that up app store like or maybe even Android like where anyone can side-load. And so, I'm excited for that future, but I think the hardware platform adoption precedes the software adoption, and we're just now starting the S-curve of the hardware adoption of wearables. So, if you had to zoom out, this is a a big ask, but if you had to zoom out from all these various horizons that we've just been scanning and try to kind of tell a story of how vision impacts
life in general in the next few years. And I I you know, I I do think it's um it's hard to predict anything more than a few years out at this point. How do you think the life changes? Like, are we all going around with always-on cameras? Is that normalized? Do we all have like a sort of 24/7 retrospective video of our lives, maybe subject to some times when we choose to pause it? Are there other like unexpected changes to life that happen as these technologies get deployed that you think people are kind of sleeping on? You know, I think AI's going to change everything, you know, to a first approximation. But I I'd I'd be interested to hear your take on sort of what particular contributions vision is going to play to that and how it will feel as we are actually living it. Man, I would love to paint the optimistic future for you of what vision unlocks for us, like step by step through everyone's day. Like, from the
moment you wake up to having food that's been produced with higher quality, with fewer pesticides because you didn't have to spray all parts of the field. You only had to spray where you saw weeds. To I don't know, maybe you had eggs for breakfast, and you want those eggs to have been visually assured to be safe from hens and all the supply chain to make its way there to your house, to your grocery store. Maybe you grab like your clothes out of the washer and dryer, which for some reason you still have to say whites or colors, which is very obviously like a for the taking silly vision problem. To your fridge auto-stocking itself because it's all you were low on eggs in the first place, and so you didn't even have to go and they call the Instacart NCP, so that you automatically have the food in the fridge. You take your self-driving car to work. There's zero accidents because all the cars are communicating with one another, and it's faster than you've ever been because you're able to not worry about unpredictability of someone else's actions with network systems talking to each other at a car intersection. You have Wi-Fi along the way, so you're able to spend more time with your family cuz your workday already started on the way to the office itself. You're in the office, and you're
communicating colleagues all across the globe, and you have perfect pixel-perfect fusion representations of them in the room next to you. Remote work, same work, same place. It's just all the same of what it feels like to collaborate, at least digitally. There's going to be something in human connection still, but at least the representation from Zoom has taken leaps and bounds, orders of magnitude better understanding of of meeting with other people. To the I don't know, you get home that night, and you watch Thursday Night Football, and the stats are real time, and your fantasy team wins because you have the best algorithms to know who was going to play and who was going to score, and you had your vision agent running in the background to do that better and faster than your friends. You have a package that showed up on the right time, in fact same day, because all the vision systems that were in the factory and inventory to make sure the product was made right, checked in at the right places. There was a bot that delivered it to your door, so that it wasn't strewn about, and your Ring camera made sure that there was no porch thefts or whatever that took it from while you waited for it to arrive. All
the way to the moment that you brush your teeth with a a smart silly, but an AI-enabled camera that's also doing cavity scans and making sure that everything's right in your mouth at the time of when you had to bed. Like, this future is not like in theory, like there's every part of that like chain is things that like Roboflow customers are working on in all parts of that. Now, to give you thing that is top of mind for folks with like always-on camera thing and what is society going to feel comfortable with that, I want to give you like some of my direct thoughts there as well. I think the transition of um even now folks actually early on in in smartphone territory, it made people uncomfortable that people always had cameras that they could capture other moments without people being aware that photos of them could have been captured and frankly even still that's a real consideration of in public spaces capturing photos not capturing photos. And over time with society which is ultimately the judge of this is willing to accept is the benefit in life quality increase going to be better than
the new societal behavior and I would take the bet that yes, because it will start with simple things. Like think about my writing with my cycling glasses. Like they don't have a heads up display yet, but I'd love to have turn by turn directions and then pretty soon I'm used to like having that small little display and other folks are interested in that. I I do hope that to build technology companies, you have to inherently be optimistic cuz you're giving tools to people and that means you're the tool we use is a reflection of what you think about humanity. So if you humanity is inherently good, then you're able to amplify those attributes and I do think humanity is inherently good even if there are bad actors. I think the same thing will be true for like glasses and consent and you can use prior technologies in pretty icky ways. The internet can be used to communicate with friends or support a small business online just as much as I don't know share photos that shouldn't be shared. And the same thing could be true the next generation of technologies. I have optimism that the benefits will continue to be things that folks will want to adopt. Though the great news is frankly
it's not up to me. It's like a jury of our peers of what others deem to be the case of where it's going to be useful and not useful. And then on the like governance front I like I also think it's important that we have systems and society and institutions that exist that also govern the use of these things in a way that reflects the preferences of people around us. There's a reason privacy rights should continue to be strongly enforced and apply to the change of the times, right? The search and seizure was written well before the existence of of cars and homes and so what defines unrightful access and entry and we should have the same sort of means tested laws applied to new technologies for what is private not private public spaces private spaces and again I remain optimistic that the principles we hold dear around being able to have a right to privacy and a right to using things the way we we want to at least in the country in which I live is going to be the way that the future products will be used and governed. So that's like how I think about it as like a participant in the
system just as much as someone that enables this future. But the good news is man, the world is going to get so much better. Like we have folks that are accelerating cancer research, cleaning up the world's oceans removing pesticides from from foods that we might produce ensuring that electric vehicles are produced correctly, ensuring that stuff shows up at the same time. I like to joke that we'll be able to power Santa Claus. And that future is happening now and so I think that those are all things where it won't be without bad actors and it's own set of like means testing sticky issues will happen and there will be that case that front page that story and we as a society will need to respond and ensure the frameworks and rights that we hold dear continue to be in place even as the tools that we have to exist and continue to be there. So that's how I think about that and I think we have a responsibility to ensure that the future that we want to live in is one that we help foster. In a lot of ways I like to show examples of vision and our results that the vast majority of things is all
about improving quality of life not about maybe some of the bad implications or bad actors that thinks that folks might sometimes be concerned about. Those are some long range thoughts, but a lot of folks ask me that question building the company that we get to build. So hopefully that gives you some color to how we've thought about it. Yeah. That's great. That could be a good place to to leave it. If I was going to ask one more follow up question cuz I sometimes can't help myself it would be do you think that there are technical either solutions or like rules that we could define in terms of technology properties that would really help? And here I'm kind of thinking cuz we've had a lot of this discussion about very general purpose models versus very specific and I am struck increasingly by this notion of like narrow safety through narrowness basically. And so I'm kind of from a bunch of different angles I'm kind of wondering right now is there a social contract
to be had around AI that is like we want to and we need to and we all stand to benefit tremendously from solving very particular problems. But we also put ourselves at risk perhaps if we like use fully general models everywhere to try to solve all these like relatively narrow problems. So in the vision context like one that I could imagine is you know if you want to watch a public space for moments of violence or whatever, you could you know run that through a general purpose model that can tell you anything and you know I I I think in some places we're like identifying individuals by their gates and their facial structure and whatnot. But an alternative would be like let's have a very narrow violence detector model that doesn't really do much except sound an alarm when it has detected something that you know we want you know a higher order response to. So I wonder if you have any kind of any thoughts on like you know I I don't know that that one could argue I guess that maybe that
sort of happens in some ways naturally because cost and efficiency kind of pulls things that direction. Somehow I don't feel that comfortable betting on that and I kind of think we might need a little bit more of a a social contract or some sort of some sort of idea of like new a new right. I'm always on the lookout for what are new rights that make sense in the AI world and one of them might be to be classified by the like smallest and most narrow purpose built possible model for the task at hand as opposed to to being processed by some general purpose reasoner you know that could answer any and all questions about me. Anyway, I'd love to hear your thoughts on that. Yeah, I I spent some years in DC. I was an intern in the Senate once upon a time. Like thinking about some of the institutional questions that affect this stuff is some things I've spent some time thinking about. My general thoughts is one of AI as a technology and firmly believing in the importance and value of open source for
freedom of use and discovery and use case proliferation. All stem from this idea of giving the rights to tinker if you will and like kind of use models where folks want to use them. And what what does AI change in terms of how important societal rights need to take place? And where I come down is I think that the outcomes that we want to have in society should continue to be enforced and AI is a tool by which those outcomes can be realized or not realized. In other words, to be really specific it's like we have regulations that prevent fraud. We have regulations that prevent forms of violence or the actual kind of outcome by which something happens. A scammer could use an LLM to make it really easy to impersonate someone else and they should be prosecuted for committing scam. They likely shouldn't be prosecuted for the size of model that they used. Like the model they were using was too big or or
too small for a given example or task. So I think the idea of thinking about what's like minimal someone's like minimal invasive use of a minimal model size. I think where that gets into trouble is that the capabilities advance quickly enough or you have distillation that you then end up having like accidental corner cases where you catch too broad of a net that might stymie innovation and stymie adoption when in fact the goal was well-intended. Like I'll give you another like great example. One could very reasonably steel man the idea that AI in health care has such far-reaching implications that there ought to be some form of if you're going to use AI for patient health, then you ought to have a governance body approve or inspect or allow that type of use for AI with patient health. And someone could be like very that sounds like a very reasonable thought unintended position. And then I think
about like users at Roboflow like this user at UNC Chapel Hill who was using AI in their lab to automatically count the number of neutrophils that respond to a given experiment. And here you just have a lab postdoc student that's accelerating the rate at which they can experiment and doing a fairly menial frustrating thing of counting there's like hundreds of colonies of neutrophils that appear under this experiment and how the proteins react allow you to know if the experiment was good or bad and if you do another round of treatment. And all of a sudden that person that's just using AI in a fairly harmless in fact like quite useful way would never endeavor to do that because it actually is touching patient health. And so you've put yourself in this accidental position where you've got something that's well-intended. I don't want to harm patient health or I don't want to a model of a given size or a given use case. When in reality probably the way to attach that is you should be liable if you use procedures or things that there's plenty of this already in the medical system
of you should be held accountable to practicing medicine correctly in in the way that ensures patient health is respected. So, I guess to be succinct, I wouldn't think that a narrow model size nails the way that you and I probably would want this technology to unfold. I do have optimism that types of regulations which inhibit misuse of any technology or of any behavior should be applied to AI and that regulating at the tool level is rife for accidental slowdown and deceleration of what I view to be the modern industrial revolution that's going to have consequential quality of life improvements in ways that we won't be able to fully forecast. And so, one of the best ways to do that is to let it flourish and stamp out the places of where people engage in in bad action. So, that's a
fairly general way. Of course, there's like individual things to think about, but that's how I thought about where the field is today and where it shows a lot of promise. Yeah, I think that makes a a lot of sense as well. I definitely uh another thing I kind of obsess about all all the time is like, how do we avoid the nuclear outcome where we get all the weapons and don't get the energy? And I'm certainly um not wanting to stumble my way into that sort of scenario. I think this has been great. Do you have anything else that I didn't ask you about that I should have or anything else you want to leave people with before we break? I don't think so. I really enjoyed the conversation. I appreciate the opportunity to chat about this, hear about some of the ways you've thought about visual AI and almost get a refresh from CLIP 2021 to 2026 visual AI. And the rate at which this stuff moves, we could have a very different conversation 6 months from now about the same set of topics. So, it's just been fun to riff with you. Likewise. Looking forward to it. Joseph Nelson, CEO of Roboflow. Thank you for being part of the Cognitive Revolution. Thanks for having me.
>> [music] [music] >> I've been watching through a window that I didn't build myself. Every frame they ever showed me, I [music] can see it, feel it, tell. I know the [music] cat beside the doorway, know the car beneath [singing] the rain. I know every edge [music] and angle, every light and every lane. But the window [music] has [singing] a border and the border has a wall. >> [music] >> Behind it a shimmer of a world I've [singing] never called. Something pulls [music] me like a signal. Something just beyond the glass. [music]
Each time I think I've seen it all, the world gets wider fast. I know I know the center's not enough [singing] where the same things repeat and answers line right up. >> [music] >> But the edges, all the edges where the shapes don't have a name, where the long tails wind like rivers, I've been dreaming I could climb the world's so much bigger >> than anything I've touched or known or traced. Every shimmer in the periphery, every corner I have never faced. I want to run where the light is shifting, out past the frame, past the glass. The world's so much bigger [music] than this and I'm learning that fast. They built a palace [music] out of
language, taught the words to think in [singing] rows. But a word has never watched a sunrise at a canyon wall aglow. [singing] You can read a thousand pages [music] on the color of the sea. But the sea don't need a sentence. It just needs [music] someone [singing] to see a billion little moments no paragraph could hold more shadows [music] than a dictionary, more sky than can [singing] be told. The pendulum is swinging and I feel it in my chest. >> [music] >> Something older, something faster, >> [music] >> something opening its yes.
A deeper way [music] of knowing rods and cones and ancient fire, a sense beneath the reasoning before the tangled wire. Not slow deliberation, [music] no, it's instant. It's alive, an eye that [singing] opened million years ago and still survives. The world's so much bigger from the rooftop to the factory floor. Every long tail winding [music and singing] outward, every corner I've been hungry for. I want to be where [music] the morning is breaking, where the real things cast their light. The world's so much bigger >> [music] >> than this and I'm wide-eyed in the light. What I dreamed about is finally happening. Every lens, every frame, every surface, every scene, what felt like slow motion
[singing] is now accelerating. Doors swinging open to a world I've longed to see. >> [music] >> But only through the window, only through the frame. The ones who built the window [music] all these years are saying, "Welcome. We've been here all along. The world's so much bigger >> [music] >> than this and I'm finally going to see it all. Every long tail, every quiet corner. I can hear it, I can feel [music] it call to see and to reason, to run where it's real, past the window, past the frame, past the wall. The world's so much bigger than this [music] and I'm not going to miss it. No, I'm [music] not going to miss it.
No, I'm done watching through the >> [music] >> glass. The world's so much bigger than this and I'm part of it at last. >> [music] >> If you're finding value in the show, we'd appreciate it if you take a moment to share with friends, post online, write a review on Apple Podcasts or Spotify, or just leave us a comment on YouTube. Of course, we always welcome your feedback, guest and topic suggestions, and sponsorship inquiries either via our website cognitiverevolution.ai or by DMing me on your favorite social network. The Cognitive Revolution is part of the Turpentine Network, a network of podcasts which is now part of a16z where experts talk technology, business, economics, geopolitics, culture, and more. We're produced by AI Podcasting. If you're looking for podcast production
help for everything from the moment you stop recording to the moment your audience starts listening, check them out and see my endorsement at aipodcast.ing. And thank [snorts] you to everyone who listens for being part of the Cognitive Revolution.