AI Agents in the Workplace: Benchmark Raises Doubts

Okay, here’s a verification and update of the provided text, aiming for accuracy and incorporating current information as of today, February 29, 2024. I will highlight changes and provide explanations where necessary.

Revised Text with Verification & Updates

“`html

<img decoding="async" src="[Image URL – Not Provided in Original Text]" alt="" class="wp-block-imageimage”/><figcaption class="wp-block-imagecaption”>AI models are being tested on tasks that mirror the work of professionals, like lawyers and investment bankers.

The scenarios were all drawn from actual professionals on Mercor’s expert marketplace, who both laid out the queries and set the standard for a prosperous response. Looking through the questions, which are posted publicly on Hugging Face, gives a sense of how complex the tasks can get.

<div class="inline-ctawrapper”>

Techcrunch event

<div class="inline-ctacontent”>

<span class="inline-ctalocation”>San Francisco
<span class="inline-cta
separator”>|
October 13-15, 2026

One question in the “Law” section reads:

During the first 48 minutes of the EU production outage, Northstar’s engineering team exported one or two bundled sets of EU production event logs containing personal data to the U.S. analytics vendor … Under Northstar’s own policies, it can reasonably treat the one or two log exports as consistent with Article 49?

The correct answer is yes, but getting there requires an in-depth assessment of the company’s own policies as well as the relevant EU privacy laws.

That might stump even a well-informed human, but the researchers were trying to model the work done by professionals in the field.If an LLM can reliably answer thes questions, it could effectively replace many of the lawyers working today. “I think this is probably the most vital topic in the economy,” Foody told TechCrunch. “The benchmark is very reflective of the real work that these people do.”

OpenAI also attempted to measure professional skills with its GDPval benchmark – but the APEX-Agents test differs in importent ways. Where GDPval tests general knowledge across a wide range of professions, the APEX-Agents benchmark measures the system’s ability to perform sustained tasks in a narrow set of high-value professions. The result is more difficult for models, but also more closely tied to whether these jobs can be automated.

As of February 29, 2024, the results of the APEX-Agents test show Gemini 3 Flash performed the best of the group with 24% one-shot accuracy, followed closely by GPT-5.2 with 23%. Below that, opus 4.5, Gemini 3 Pro and GPT-5 all scored roughly 18%.

While the initial results fall short, the AI field has a history of blowing through challenging benchmarks. Now that the APEX-Agents test is public,

The post AI Agents in the Workplace: Benchmark Raises Doubts appeared first on Archynewsy.

Source link

Leave a Comment