When 4 of 40 Models Beat Coin Flip: Measuring Claims About Anthropic Opus and Claude Upgrades
https://gravatar.com/detectivesecretec043a4c6e
Only 4 of 40 Models Beat Coin Flip on Hard Questions About Anthropic Opus Improvements The data suggests a surprising gap between vendor claims and real-world discriminative power on narrowly targeted technical questions