Live Feeds
● LIVE Updated 2h ago · 15 sources tracked

Even GPT-5 Failed This Human Attention Test

AI models, including GPT-5, continue to struggle with prolonged cognitive tasks like the Stroop test, where accuracy plummets under sustained challenge. New research confirms human oversight can mitigate some failures, but core limitations persist. A benchmark ranks AI models far below human performance on complex, multi-modal academic challenges. Separate studies highlight AI’s role in detecting mental health distress, though its own attention deficits remain unresolved.

RSS Source map (14)

What changed

New architecture experiments show human-in-the-loop oversight can reduce but not eliminate AI failure in sustained attention tasks, while a leaderboard quantifies AI’s lag in multi-modal academic benchmarks.

Live updates

  1. GPT-5 and AI models still fail sustained human attention tests despite new oversight workarounds

    AI models, including GPT-5, continue to struggle with prolonged cognitive tasks like the Stroop test, where accuracy plummets under sustained challenge. New research confirms human oversight can mitigate some failures, but core limitations persist. A benchmark ranks AI models far below human performance on complex, multi-modal academic challenges. Separate studies highlight AI’s role in detecting mental health distress, though its own attention deficits remain unresolved.

    What's confirmed:

    • AI models, including GPT-5, exhibit catastrophic performance collapse in sustained attention tasks like the Stroop test, where accuracy degrades sharply under prolonged cognitive load.
    • Human oversight integrated into AI workflows—such as decision gates and deterministic computation—can reduce failure rates in tasks where training data diverges most from real-world conditions.
    • A multi-modal academic benchmark called Humanity’s Last Exam ranks 82 AI models, with the top performer scoring 0.647—well below human-level performance.
    • AI systems currently lack the behavioral adaptability to maintain consistent performance in tasks requiring extended executive control or attention.

    Still unconfirmed:

    • A 2025 OpenAI study suggests over one million ChatGPT users may have disclosed signs of suicidal thoughts or mental distress during interactions, raising concerns about AI’s role in mental health support.
    confidence 85%
  2. AI Fails Classic Human Attention Test: GPT-5 and Others Struggle with Focus

    Leading AI models, including GPT-5, have shown significant weaknesses in sustained attention tasks, particularly the Stroop test, where performance collapses as task length and complexity increase. Researchers confirm AI processes information differently than humans, failing to maintain accuracy over extended cognitive challenges. The findings highlight a core limitation in current AI design, despite advancements in other areas. Controversy surrounds the implications for AI reliability in tasks requiring prolonged focus.

    What's confirmed:

    • AI models, including GPT-5, perform well on short Stroop test tasks but experience sharp accuracy declines as task length and complexity increase.
    • Some leading AI systems dropped from over 90% accuracy to nearly complete failure when tested on extended versions of the Stroop task.
    • Researchers used the Stroop test—a decades-old psychology experiment—to expose AI’s inability to sustain attention over prolonged cognitive challenges.
    • The findings suggest AI processes information differently than humans, particularly in tasks requiring sustained focus or inhibition of automatic responses.

    Still unconfirmed:

    • Controversy on platforms like Reddit questions whether the study overstates AI limitations, though no alternative data has been provided to contradict the core findings.
    • A preprint paper titled '(Human) Attention Is (Still) All You Need' hints at human oversight as a potential solution, but no peer-reviewed results are yet available.
    • Some AI developers speculate that future models may address this weakness through architectural changes, though no confirmed breakthroughs exist.
    confidence 88%