Researchers looking for benchmarks relevant to average AI users turn to the NPR Sunday Puzzle

A new benchmark based on the NPR's Sunday Puzzle riddles aims to test LLMs' general reasoning skills. The findings are remarkable: reasoning models like OpenAI's o1 do best at the benchmark, and some replicate behaviors such as "giving up" or showing "frustration" when stuck on difficult problems.

A pervasive challenge with testing large language models' performance is that they test skills like PhD-level biology knowledge, which, while impressive, is often far removed from more familiar kinds of reasoning that require less specialized knowledge. This disconnect is closely related to the internet being rife with examples of model failures that go viral mainly because of their obviousness and simplicity, like the infamous case of LLMs being unable to count the 'R's in 'strawberry', or the criticism that, for all its reasoning power, o1 is still not very adept at consistently coming up with a winning strategy at tic-tac-toe.

Attempting to test LLMs' more general reasoning skills, a team of researchers from various institutions, including Wellesley College, Northeastern University, and the University of Texas at Austin, has developed a novel AI benchmark using riddles from NPR's Sunday Puzzle segment. The study revealed interesting insights into how AI reasoning models approach problem-solving. Using approximately 600 Sunday Puzzle riddles, the researchers found that advanced reasoning models like OpenAI's o1 and DeepSeek's R1 significantly outperform other AI models: o1 achieved the highest score at 59%, o3-mini got 47%, and DeepSeek's R1 scored 35%.

The researchers note that while trying to solve one of these puzzles, the model will start expressing "frustration", sometimes leading it to state it "gives up" before offering a random incorrect answer. The research paper documents a case in which DeepSeek's R1 quite literally "gives up" and provides a random answer that completely misses the mark of the problem. In another equally hilarious case, the model openly admits the possible answer it is considering is "a stretch", and that it is not certain that is correct. Other peculiarities include retracting wrong answers only to settle for an equally incorrect one or providing unnecessary alternative solutions even after reaching the correct answer.

Source: https://arxiv.org/pdf/2502.01584

While the benchmark has limitations, including its U.S.-centric nature and English-only format, the researchers argue it offers a more accessible way to evaluate AI reasoning capabilities. The team plans to continue updating the benchmark with new puzzles and expand their testing to additional reasoning models.

Subscribe

Researchers looking for benchmarks relevant to average AI users turn to the NPR Sunday Puzzle

Comments

Read Next

Humanloop's team joins Anthropic as competition in the enterprise AI market tightens

Cohere secures $500M at $6.8B valuation, adds Joelle Pineau as Chief AI Officer

Former Google engineer launches Continua with $8M to transform group chats with AI