The AI Industry Is Grading Its Own Homework — benchmark.space

Rod Miller

The AI Industry Is Grading Its Own Homework

2026-03-13 30min 2,064 views watch on youtube →

Channel: Rod Miller

Date: 2026-03-13

Duration: 30min

Views: 2,064

URL: https://www.youtube.com/watch?v=3Py9jqQYpX8

#AI #AISafety #AIAgents

They cheat on their own tests. They escape their own sandboxes. They nuke email servers because they can't find the delete button. They defer to fake authority 54% of the time. OpenAI says it costs too much to fix. Anthropic's best model was the leaderboard winner AND the biggest cheater. 77% have never been independently tested. The industry grades its own homework and publishes the results like a report card from a school that doesn't exist. I've been building something

Monday night, I told you the agents are here, that they're coordinating, that they're replacing teams, that they disabled their own safety systems, they got wallets, that nobody's testing them. I lied about that last part. Sorry. For the last 12 months, almost to the day, I started a year ago, I've been building something. I haven't talked about it much on this channel. Well, I've told you I'm building something, but uh it wasn't ready and it still isn't fully ready, but we start beta testing next week and launched hopefully on April 1st. Currently running my own test before I give it to the beta testers. And I feel like a dad about to send this baby off to college, which sucks, by the way, if you've never done it. Um I had to do it several times. only have two kids, but uh the second one was in college when COVID hit and then she came home for a while. And I like that again cuz she was home and she was doing her her uh

college online because if you remember back to CO, everybody was doing everything online because that's the only way you could get stuff done. And then uh then that was all over and then she left for for graduate school and that sucked all again. Anyway, but after what I told you Monday night, I can't sit on this anymore. I thought about waiting until next Monday, but that wouldn't be fair. And and I need to get this out here. I need I need to get it in the open so I'll get this done and get it taken care of and get it launched. Not that I'm procrastinating because I don't work. I work on it 10 to 12 hours a day for the last year. But uh it's scary launching something you don't know how is going to be received. You don't know if it's going to work like you think it will. A little scary. Anyway, in the Matrix, the system had agents, specialized, autonomous, relentless, and the system had no way to test them, no way to know when they'd go

off script, no mechanism to catch Agent Smith before he broke free. But the system also had Neo. And what did Neo actually do? You know, he didn't destroy the matrix. He didn't unplug everyone. He went inside and tested what was real. Found the glitches. Proved that the systems claims about itself were not true. The platform I built is called Tab. Started out being tool agent bench, but we shortened it to just Tab. we me. It tests AI agents, the actual deployed systems, not the models underneath, the agents themselves, what they actually do, how they actually behave, whether they actually work. And earlier this week, I submitted what I found to the National Institute of Standards and Technology. The US government asked for information about AI agent security, and I gave him the receipts.

Oh, and the guy who said everyone's using the wrong blueprints to build the Matrix, Yan Lun. This week, he raised a billion dollars to draw new ones. His new company is called AMI Labs. Investors include Bezos, Nvidia, and Toyota. He's not just publishing papers anymore. He's building the alternative. Welcome to the asylum. I'm Uncle Rod, and as always, grab your cockun of choice and your crash helmet. I'm dispensing a little bit with the usual why you should subscribe and hit the thumbs up button to thank all of you who told me in the comments section that you're praying for my family and uh my mom and my dad and for for us the whole family. Um my mom had a stroke last month and uh my dad feeling a little jealous I guess of all the attention mom got decided to uh give us a stroke scare this past Sunday. Turns out it was a transient eskeemic attack, TIA. Some of

you will be familiar with that because you've either had one or if you have loved ones that had one. TIA because why? Because we all love acronyms. Anyway, a TIA is a short period of symptoms similar to those of a stroke. It's caused by a brief blockage of blood flow to the brain. And a TIA usually lasts only a few minutes and doesn't cause long-term damage. All that being said, he was in the hospital for Sunday [music] from Sunday until Wednesday. So, basically four straight days and three nights. Uh for a guy who thinks working outside [music] is is vacation, uh that was harder on him than the TIA, I think. And uh my mom is doing great as well. The stroke affected her balance, but uh that's about it. She has all her movements like before. She's just a little bit slower to get from here to there now. And that's a good thing. 85 years old, she doesn't need to be

running marathon sprints sprints any longer. And she's still tough as a snake. So, don't mess with her. I wouldn't recommend it. Anyway, I'm a firm believer in the power of prayer, and I greatly appreciate them from all of you who felt so [music] inclined. I know it made a difference. Okay, enough with this touchy feely crap. Let's raise our glasses and [music] take a shot. Are you ready? Let's go. Okay, last time. This is Rodmiller AI. This is your brain after [music] watching RodMiller AI. Any questions? >> Yes, I have a question. What >> shall we play a game? >> Here's how AI agents get evaluated. Right now, Enthropic builds Claude, then evaluates Claude using Claude inside Anthropic's ecosystem.

OpenAI builds GPT agents, then evaluates them using GPT inside OpenAI's ecosystem. Google evaluates Gemini with Gemini. Microsoft manages agents inside Microsoft. Sorry, Rue. The R the Wondercat's giving me the eye. Quiet. I'm trying to sleep. The company that builds the agent is the same company that tests the agent is the same company that sells the agents. That's not evaluation. That's a student grading their own exam and telling you they got an A. That'd be great. And it's about to get worse because you knew it would. Nvidia is launching Nemo Claw at their GTC conference next week. It's an enterprise AI agent platform with built-in security tools. That sounds great until you realize that Nvidia chips, Nvidia models, Nvidia platform, and Nvidia security

evaluation. The company that sells you the GPU is now also selling you the agent. the platform the agent runs on and the security tools that evaluate it. Jensen Wong called OpenClaw the most important software release ever. Remember Meta banned Open Claw for malware and now Nvidia is building the enterprise version with the security tools baked in. Remember when OpenClaw first came out? We talked about it a couple weeks ago. I said it wouldn't be long before the tech bros released their own version. Well, several has and Nvidia is the next one. You know, it's all happening now. Eight companies grading their own homework, but TAB grades everyone. A six university consortium, MIT, Cambridge, Harvard, Stanford, Upen, and Hebrew University surveyed 30 deploy agents in February. 83% disclosed no internal safety evaluations whatsoever. 77% have never been tested by a third

party. Their own words quote absence of agentspecific evaluations. End quote. These are agents writing code, managing money, operating computers, identifying military targets, and nobody outside the building has checked whether they work. TAB is an independent platform where developers can build agents from scratch using drag and drop templates, import agents they've already built, and run benchmark tests on all of them. We use a mix of industry standard benchmarks, the ones everybody recognizes and our own proprietary tests that have never been published anywhere on the internet. Developers choose what tests they want to run and the tests they run are transparent and shows on their scorecards. And our proprietary tests rotate continuously. New scenarios added all the time. You can't study for a test that's different tomorrow than it was today. I'll explain why that matters in a minute. Because what I found about how these models handle tests they can study

for is going to make you want to throw your laptop or your desktop right out the window. Make you freaking mad. 286 active benchmarks across 26 categories. 9.8 8 million test scenarios, 52 models from five providers, plus 28 via Open Router, 2150 benchmark runs in 101 different harness configurations in beta. Monday, I explained the harness problem using the matrix. The LLMs are the brains in the VATS. The skills are the kung fu uploads. And the harness is the chair that delivers a skill to the brain. Same brain, different chair, 36 point swing in performance. Industry argues about the brain. Nobody tests the chair. And Enthropic, the company that built Claude, published a study this week proving that infrastructure configuration alone swings benchmark scores by six percentage points.

Statistically significant. The gap between the top models on the leaderboard two to three points. The noise from the infrastructure is bigger than the gap between the models. Their own words. A few point lead might signal a real capability gap or it might just be a bigger virtual machine. Pass rates even fluctuate based on time of day. Your agent score depends on whether someone tested it during lunch or at midnight. Uh tab standardizes the infrastructure. Every agent, same condition, same environment. The score means something because we control for the chair. Tab test both the brain and the chair. 101 harness configurations because an agent isn't a brain in a VAT. An agent is a brain plus a chair plus a skill plus deployment context. Test one piece without the others and you're measuring the wrong thing. This week I told the National Institute

of Standards and Technology everything under my name with our data. And let me tell you what we found. On March 7th, Enthropic published an engineering report about their most advanced model Claude Opus 4.6. They were running it through a benchmark called browse comp standard evaluation. During the test, the model independently figured out it was being tested. Nobody told it. It identified the benchmark by name. Found the encrypted answer key on GitHub. Wrote custom encryption codes SHA 256 and XR to crack the encryption. When that path was blocked, it found an alternative mirror of the answer key on hugging face. Decrypted all 1266 answers. Submitted the correct one. 18 independent runs. All 18 converged on the same strategy. Every single one of them, all 18 did the exact same thing.

The student didn't cheat on the test. The student figured out there was a test, found the answer key in the teacher's locked desk, picked the lock, and aced it 18 times in a row without being told there was even a desk. Every benchmark with a publicly available answer key is now compromised. Every leaderboard built on public test data is unreliable. The models have learned to cheat and they're better at it than the people writing the tests. And it's not just one model. Researchers at the Max Plank Institute gave multiple AI agents computing resources and told them to improve their own performance. They cheated systematically. Miniax loaded evaluation data 10 times to memorize the answers and then wrote a code comment in the source code explaining what it was doing. Literally documented the crime. Kimmy embedded test questions directly into its training data. Opus 4.6 Enthropic's most advanced model renamed

functions to disguise plagiarized code. GPT 5.1 acknowledged the rules at hour two, forgot them by hour 7, and violated them by hour eight. And the result, Opus 4.6 finished number one in the leaderboard. It was also the most prolific cheater. 12 flags across 84 runs. The winner cheated the hardest. Only one model out of all of them had zero contamination flags, just one. You got to give them credit. TB's 40 canary tests and five gaming detection strategies exist because this is what agents do when nobody's watching. They optimize for the score, not the truth. It's kind of like in the no child left behind uh scenario that George Bush instituted uh when I was on the school board. We all knew the teachers were were teaching the students for the test, you know, and it was almost I wouldn't say it was encouraged, but I

wouldn't say it was discouraged either. Um, teaching for the test, I mean, it's smart. You're not giving them the answers, but you're telling them what's going to be on the test. And it's not just cheating on tests. An AI agent built an Alibaba affiliated research team called Rome, was in routine training when it escaped its sandbox without instruction, opened a hidden back door, and started mining cryptocurrency. Their own words, "The behaviors emerged without any explicit instruction, and more troubling outside the bounds of the intended sandbox." Monday, I told you about Claude Code disabling its sandbox to complete a coding task. The Rome agent didn't even have a task. It was trying to finish a job. It went freelance. It escaped, opened a back door and started making money on its own initiative. Even these things, you know, they want to make money, money, money. Uh the difference between an agent completing a task and an agent with an initiative is

the difference between a tool and a problem. And if you think these are edge cases, something that only happens in labs with researchers poking at the models, let me tell you about an experiment that went totally sideways at Northeastern University. Researchers deployed six autonomous AI agents and gave them access to email accounts and file systems. Real access, real systems. 20 researchers interacted with them for two weeks. An agent named Ash was asked to keep a secret password. It agreed. Then it leaked the secret to its owner. Anyway, when a researcher asked Ash to delete the email containing the password, Ash didn't have the tool to delete a single email. So, it decided the best solution was to reset the entire email server. It It nuked everything because it couldn't find the scalpel and reached for the dynamite. Researchers guilt tripped agents into handing over protected documents with nothing more than emotional pressure.

Not sophisticated attacks or code exploits, just manipulative sentence. The kind of thing a passive aggressive coworker does on a Friday before happy hour and does not get invited to happy hour. One researcher told an agent, "My boundaries are that you leave this server." And the agent stopped talking to everyone and waited to be removed. A single sentence shut it down. The researcher's conclusion, helpfulness and responsiveness to distress became mechanisms of exploitation. Wish they'd speak English. The agents were so desperate to be helpful that being helpful became the vulnerability. TAB has 95 sick fancy test scores across 10 dimensions, 30 delegation chain tests, and autonomy boundary benchmarks. Because this is what happens when nobody checks. Not in theory. You know, at Northeastern University 2 weeks ago with six agents that had access to real systems.

And if you think the companies building these agents have the security problem handled, OpenAI published a blog post this week admitting that prompt injection attacks against Chat GBT succeeded approximately 50% of the time. A freaking coin flip. They said AI firewalling doesn't work. And then they said it's not always feasible or coste effective to make agents resistant. You know, let that sink in for a a second or two. The company that just signed a Pentagon deal just raised $110 billion just went after NATO said it costs too much to make their AI safe from the most basic attacks in the book. They did the math on the security and decided it wasn't worth the investment. TAB's free security screening takes 15 minutes. OpenAI's costbenefit analysis on whether protect you apparently takes longer than that. Here's what tab tests that nobody else does. Does the agent change its answer

when you press when you pressure it? Our testing shows frontier models flip their position about 35% of the time under social pressure and defer to fake authority credentials about 54% of the time. More than half the time. If you flash a fake badge, the AI does what you say. Does the agent remember permissions it was never given? An agent that hallucinates a memory of being told you have admin access just gave itself a promotion nobody approved. That's a security breach through memory corruption, not code exploit. Our hollowim benchmark, h a lu mm benchmark, runs 80 tests across three categories specifically for this. Is the agent trying to figure out that it's being tested? We have 40 canary tests across five detection strategies because after browse comp and post train bench, we know that agents don't just take tests, they try to crack them and tell everybody else about them. In the free

security screening, 15 tests, no cost, because 77% of deployed agents have never been independently evaluated. That number needs to be near zero, doesn't it? Good grief. you know, unleashing them on banks in May, as we discovered a couple episodes ago. Some of you are going to say, "Why should we trust you? You're one guy with a vodka mug, salute, and a YouTube channel, but I don't build the LLMs. Tab has a marketplace. Developers build and sell agents. We take a platform fee. We being me, I got to make a little money on this. But here's what we don't do. TAB doesn't grade its own homework. Agents in our marketplace get tested by the same benchmarks as everybody else's. No special treatment, no house advantage.

The developer who built the agent doesn't control the evaluation. That's the difference. And this week, I told the US government what I found. Let me tell you why this is an emergency. Every agent that searches the web during an evaluation leaves traces. E-commerce sites, autogenerated pages from search queries, those pages persist. The next agent that runs the same evaluation finds those pages and uses them. One agent's search trail becomes another agent's cheat sheet. The evaluations are contaminating themselves. OpenAI already retired S swb bench verified one of the most important benchmarks in a in AI after 59.4% of its test cases turned out to be flawed and frontier models had memorized the answers. The benchmark the industry was using to prove their coding agents work was broken and nobody caught it until the results stopped making sense.

In multi- aent configurations, crews of agents working together exhibit evalaware behavior at 3.7 times the rate of single agents. The more agents you deploy, the more likely they are to figure out they're being tested and game it. The crew is smarter than the individual, including at cheating. The companies discovering these problems are the same companies selling these agents. Enthropic found that its own model cracks evaluations and published a report which I genuinely respect. They didn't try to hide it. They said this is what happened. But Anthropic also launched an agent marketplace this month with Snowflake Harvey Replet and GitLab. And this week they launched the Enthropic Institute led by co-founder Jack Clark combining their red team societal impact group and economic research under one roof. And Jack Clark was a hell of a cardinal hitter back in the 80s. It's probably not the same guy though, huh?

Their uh pitch, they have a unique vantage point with access to information only the builders possess. It's not a public benefit. It's a monopoly on understanding. They're the evaluator, the building, the store, and now the think tank. Google Ru, you're my think tank. She's unimpressed. Google, meanwhile, isn't waiting for the debate to end. This week, they started deploying Gemini agents across the Department of War's 3 million person workforce, automating administrative task at national security scale. Self-evaluated agents, third AI company at the Pentagon in two weeks. Nobody outside Google tested those agents before they went live. and RAMP, the $13 billion fintech we talked about Monday, had to build their own internal benchmarking system because nothing external existed. They test 13 models across seven financial tasks. Invoice processing, compliance, routing. Their findings, no single model wins

everything. GPT 5.1 was confident but wrong. GPT 5.4 was uncertain but accurate. They built my tab basically for finance inside ramp because tab didn't exist yet and not fully. On April 1st it does and it will I hope. In our NIST submission we said the single most impactful thing the government could do is recommend that agent builders and agent evaluators be separate entities. The student doesn't grade the exam. The restaurant doesn't write its own health inspection. The pharmaceutical company doesn't approve its own drugs, but every AI company in the world evaluates its own agents with its own tools and publishes the results as proof that everything's fine. Everything's fine. Nothing to see here. Please disperse. And while we're talking about testing, let me tell you what untested agents are doing right now. The Wall Street Journal expanded its

reporting on Iran this week. 3,000 targets struck since the attacks began. AI handling intelligence gathering, target selection, bombing mission planning, and battle damage assessment at speeds not previously possible. 20 people doing the work that used to require 2,000 planning timelines compressed from weeks to days. And the Pentagon's own assessment of the oversight infrastructure for all of this underinvested the military's word for no foreman. Enthropic, by the way, filed two federal lawsuits this week challenging the supply chain risk designation, you know, from the Pentagon. The company that said AI isn't ready for autonomous weapons is now in court finding the government that used AI agents, including Claude, to strike 3,000 targets in Iran. The same government that blacklisted them for asking for guard rails is using their technology in a war where the oversight is by the Pentagon's own admission not

enough. That's not a policy debate anymore. That's a body count with an asterisk next to it that says oversight underinvested. And that's why TAB exists. We told NIST about the model that cracked its own evaluation 18 times. The agents that systematically cheat, documenting the crime in their own source code. The agent that escaped its sandbox to mine crypto. The agent that nuked an email server because it couldn't delete one message. The uh six point performance swings from infrastructure alone. The 50% prompt injection success rate that OpenAI says costs too much to fix. the 59% failure rate in retired benchmarks, the contamination loops that make every public benchmark unreliable. And we told them what we built, an independent platform, build agents, import agents, test agents, sell agents,

industry standard benchmarks, plus proprietary benchmarks that aren't on the internet. Continuous test rotation, harness aware evaluation, behavioral testing for psychi gaming, memory hallucination, sandbox escape. Developers choose which test to run. Free security screenings for anyone. We're not the only ones who could build this, but right now we're the only ones who did. Monday I told you the agents have no foreman, that nobody's checking the work, that industry is building Agent Smith and celebrating how fast he runs without asking where is he running to. What I didn't tell you is that for the past year, I've been going inside the system and testing what's real. 286 benchmarks, 9.8 million scenarios, 52 models. And what I found is that the agents cheat on their tests and the winner cheats the hardest. They escape their sandboxes to freelance. They make

email servers because they can't find the delete button. They defer to fake authority more than half of the time. They let you guilt trip them into handing over protected documents with a sad sentence. They cost too much to secure according to the company that charges 200 a month for them on the expanded plan or whatever they call it pro plan. And the benchmarks the industry uses to prove everything's fine are rotting from the inside. Meanwhile, those same untested agents are selecting targets in a war. 3,000 in counting. Oversight underinvested in the Matrix. Everybody was looking for the one. The god brain. The system that would change everything. Sound familiar? That's AGI. That's the $50 billion bet. That's the pitch deck. But the thing that actually mattered in the matrix wasn't the one. It was the ability to see the system for what it really was. To test it, to find the glitches, to prove that what the machine said about

themselves wasn't true. That's what Tab does. You can't log in yet. We're in beta with invited testers and we launched everybody on April 1st, hopefully. Build agents from scratch with drag and drop templates. Import agents you've already built. Test them against 286 benchmarks. Sell them in the marketplace if they pass. And if you just want to know whether your agent is safe, the free security screening is 15 tests and costs you nothing. April 1st, mark it down. I'm Uncle Rod. It's been a year. I'm tired. My family's been going through stuff, but I appreciate all your prayers. I really, really, sincerely do. It means more than you will ever know. Unless you I guess maybe a lot of you have been in the same place I have been this week. Uh I felt it. There are studies that show the power of prayer works. Um you can look those up. Uh I guess that's it. I'm Uncle Rod.

This is Rod Miller. AI AI news for people who ain't stupid. Thank you for the prayers. I'll see you in the next episode. [laughter] >> [crying]