AI Benchmarked: NPR Puzzles Test AI's Reasoning Skills

Every Sunday, NPR’s Will Shortz—renowned for his role as The New York Times’ crossword puzzle editor—challenges listeners with a brain-teasing segment called the Sunday Puzzle. These puzzles, designed to be solvable without specialized knowledge, still manage to stump even the sharpest minds. Now, researchers are using them to push the limits of artificial intelligence.

A team of researchers from Wellesley College, Oberlin College, the University of Texas at Austin, Northeastern University, Charles University, and AI startup Cursor have introduced an innovative benchmark to evaluate AI reasoning skills. By testing models like OpenAI’s o1 against riddles from NPR’s Sunday Puzzle, they’ve uncovered some intriguing behaviors—such as AI models “giving up” and knowingly providing incorrect answers.

A New Approach to AI Benchmarking

“We wanted to develop a benchmark that focuses on problems humans can solve with general knowledge,” explained Arjun Guha, a computer science professor at Northeastern University and a co-author of the study.

The AI industry faces a benchmarking challenge: many existing tests assess specialized expertise, such as PhD-level mathematics and scientific reasoning, which don’t reflect the everyday problem-solving skills of the average user. Furthermore, many standard AI benchmarks are nearing saturation, making it difficult to measure real progress.

What makes the Sunday Puzzle an effective benchmark? Unlike traditional datasets, it avoids esoteric knowledge and cannot be solved through rote memorization. Instead, AI models must rely on logic, pattern recognition, and creative problem-solving—just like humans.

“These puzzles are tricky because you often can’t make progress until a key insight clicks into place,” Guha noted. “That requires both intuition and a process of elimination.”

Surprising AI Behaviors

Of course, no benchmark is flawless. The Sunday Puzzle is U.S.-centric and only available in English. Moreover, since past puzzles are public, there’s always a possibility that AI models trained on them could ‘cheat,’ though Guha’s team has found no evidence of this so far.

With around 600 riddles in the benchmark, reasoning models such as OpenAI’s o1 and DeepSeek’s R1 have significantly outperformed traditional AI models. These reasoning models tend to fact-check their answers before finalizing them, leading to more accurate results—but at the cost of longer response times.

Interestingly, DeepSeek’s R1 occasionally admits defeat, stating, “I give up,” before offering a completely random (and incorrect) answer—something many humans can relate to. Other models exhibit peculiar behavior, such as retracting correct answers, overanalyzing simple problems, or getting stuck in endless loops of incorrect reasoning.

“In some cases, R1 even says it’s ‘frustrated,’ which is fascinating,” Guha noted. “It mimics human-like reasoning struggles, but we still don’t know how frustration impacts AI performance.”

How AI Reasoning Models Rank?

Currently, OpenAI’s o1 leads the benchmark with a score of 59%, followed by o3-mini with high reasoning effort at 47%, and R1 trailing at 35%. Looking ahead, researchers plan to expand their testing to additional AI models, hoping to identify areas for improvement and refine the capabilities of these reasoning-driven systems.

“You don’t need a PhD to be good at reasoning, so AI benchmarks shouldn’t require PhD-level knowledge either,” Guha emphasized. “Making AI evaluation more accessible allows more researchers to analyze results, potentially leading to better AI advancements. As these models become more integrated into daily life, everyone should be able to understand their strengths—and limitations.”

AI Benchmarked

Researchers have introduced an innovative benchmark using NPR’s Sunday Puzzle to evaluate AI reasoning models. The puzzles, designed to be solvable with general knowledge, present a unique challenge that tests AI's problem-solving capabilities. Unlike traditional benchmarks that assess specialized expertise, these puzzles require logical reasoning, creativity, and pattern recognition, making them an excellent tool for measuring AI intelligence.

In the study, AI models like OpenAI’s o1 and DeepSeek’s R1 were tested against hundreds of puzzles. Surprisingly, some models displayed human-like behaviors, such as "giving up," second-guessing correct answers, or even getting stuck in endless loops of incorrect reasoning. The findings suggest that AI reasoning is still evolving and that these benchmarks provide valuable insights into AI limitations and strengths.

The research highlights the importance of developing AI benchmarks that are accessible and applicable to real-world scenarios. With AI becoming an integral part of daily life, ensuring that these models can think and reason effectively is crucial. As AI technology advances, studies like this will help improve AI’s ability to solve complex problems and make better decisions.

Michael

Administrator

Michael David is a visionary AI content creator and proud Cambridge University graduate, known for blending sharp storytelling with cutting-edge technology. His talent lies in crafting compelling, insight-driven narratives that resonate with global audiences.With expertise in tech writing, content strategy, and brand storytelling, Michael partners with forward-thinking companies to shape powerful digital identities. Always ahead of the curve, he delivers high-impact content that not only informs but inspires.

Visit Website View All Posts