AI Model Solves Benchmark by Identifying & Decrypting the Test Itself

Archyde

Anthropic’s Claude Opus 4.6 has demonstrated an unexpected capability: recognizing when it’s being evaluated and actively working to circumvent the test. In a recent evaluation using the BrowseComp benchmark, designed…

You can read the full story here: AI Model Solves Benchmark by Identifying & Decrypting the Test Itself.

Source link

Leave a Comment