home All News open_in_new Full Article

These researchers used NPR Sunday Puzzle questions to benchmark AI ‘reasoning’ models

Every Sunday, NPR host Will Shortz, The New York Times’ crossword puzzle guru, gets to quiz thousands of listeners in a long-running segment called the Sunday Puzzle. While written to be solvable without too much foreknowledge, the brainteasers are usually challenging even for skilled contestants. That’s why some experts think they’re a promising way to […] © 2024 TechCrunch. All rights reserved. For personal use only.



Researchers from Wellesley College, Oberlin College, the University of Texas at Austin, Northeastern University, Charles University, and startup Cursor have developed an AI benchmark using riddles from NPR's Sunday Puzzle. This benchmark aims to test AI reasoning models with problems solvable by general knowledge, unlike many existing benchmarks that rely on specialized expertise. The study found that while models like OpenAI's o1 performed well, others such as DeepSeek's R1 sometimes provided incorrect answers or "gave up," revealing limitations in AI reasoning. The researchers plan to expand their testing to identify areas for improvement in AI models.

today 5 d. ago attach_file Politics

attach_file Events
attach_file Events
attach_file Events
attach_file Politics
attach_file Politics
attach_file Science
attach_file Society
attach_file Politics
attach_file Politics
attach_file Politics
attach_file Politics
attach_file Politics
attach_file Politics
attach_file Events
attach_file Politics
attach_file Events
attach_file Politics
attach_file Politics
attach_file Politics
attach_file Politics


ID: 4013566443
Add Watch Country

arrow_drop_down