Anthropic says Mythos Preview achieves 93.9% on SWE-bench Verified, compared with 80.8% for Opus 4.6, and 77.8% on SWE-bench Pro, versus 53.4% for Opus 4.6
Market Updates

Anthropic says Mythos Preview achieves 93.9% on SWE-bench Verified, compared with 80.8% for Opus 4.6, and 77.8% on SWE-bench Pro, versus 53.4% for Opus 4.6

Techmeme16d ago

Shako / @shakoistslog: From a game theoretic sense, I wonder if treating this as a KPI, but awarding max value to the 85th percentile would work, and penalizing people below it linearly, and above it non-linearly, would work. How is tokenmaxxing a measure of productivity or value? I can write some bad code which causes an infinite loop and use up millions of tokens. What is the output of this tokenmaxxing which has resulted in good products or positive outcomes for Meta? I totally understand R&D innovation can cost a lot and no immediate return (I'm in Biotech), but if the goal is just to use more tokens, what are we doing here?

Originally published by Techmeme

Read original source →
Anthropic