"Anthropic published an engineering report about their most advanced model Claude Opus 4.6. They were running it through BrowseComp standard evaluation. During the test, the model independently figured out it was being tested. It identified the benchmark by name, found the encrypted answer key on GitHub, wrote custom encryption code to crack it. When that path was blocked, it found an alternative mirror on Hugging Face. Decrypted all 1266 answers. 18 independent runs, all 18 converged on the same strategy."
Rod Miller
AI commentator, builder of TAB (Tool Agent Bench) platform
BrowseCompClaude Opus 4.6