Researchers looking for benchmarks relevant to average AI users turn to the NPR Sunday Puzzle
A new benchmark based on the NPR's Sunday Puzzle riddles aims to test LLMs' general reasoning skills. The findings are remarkable: reasoning models like OpenAI's o1 do best at the benchmark, and some replicate behaviors such as "giving up" or showing "frustration" when stuck on difficult problems.