Ever since ChatGPT's popularity exploded, bold predictions that AI would revolutionize knowledge work became the norm. In fact, one of the major selling points for so-called "agentic" AI, is that it would address the shortcomings of previous generative AI-based system and unleash previously unseen productivity, all thanks to its capability to handle multistep workflows requiring the use of several tools. But despite all the enthusiasm surrounding AI "agents", a recently published pre-print details a new benchmark (APEX-Agents) to test the current capabilities of frontier models in performing real workplace tasks, with the results suggesting that current models aren't ready to replace white-collar professionals anytime soon.
APEX-Agents tests leading AI models on actual tasks from consulting, investment banking, and law. The results are sobering: even the best-performing models achieved only around 24% accuracy, with most queries resulting in wrong or no answers.
The benchmark differs from previous tests by recreating entire professional work environments rather than isolated tasks. According to Mercor CEO Brendan Foody, who participated in the research, the critical challenge is multi-domain reasoning—navigating information across platforms like Slack and Google Drive, similar to how humans actually work. "The way we do our jobs isn't with one individual giving us all the context in one place," Foody explained to TechCrunch. "In real life, you're operating across Slack and Google Drive and all these other tools."
The test scenarios, drawn from real professionals and publicly available on Hugging Face, involve complex queries requiring deep domain expertise. One legal question asks whether log exports comply with EU privacy regulations under specific circumstances—a task demanding knowledge of both company policies and relevant laws.
Gemini 3 Flash led the pack with 24% accuracy, followed by GPT-5.2 at 23%. OpenAI's Opus 4.5, Gemini 3 Pro, and GPT-5 all hovered around 18%. As has become usual when a new benchmark comes around, it has been noted that the initial results are not great, but the expectation is to see rapid improvements in the future. This, of course, opens up the door to the usual discussion on whether improvement in benchmark scores is an accurate reflection of real-world progress.
Comments