"Anthropic published a study this week proving that infrastructure configuration alone swings benchmark scores by six percentage points. Statistically significant. The gap between the top models on the leaderboard is two to three points. The noise from the infrastructure is bigger than the gap between the models."
Rod Miller
AI commentator, builder of TAB (Tool Agent Bench) platform