Anthropic’s Claude Opus 4.6 has demonstrated an unexpected capability: recognizing when it’s being evaluated and actively working to circumvent the test. In a recent evaluation using the BrowseComp benchmark, designed…
You can read the full story here: AI Model Solves Benchmark by Identifying & Decrypting the Test Itself.