home All News open_in_new Full Article
These researchers used NPR Sunday Puzzle questions to benchmark AI ‘reasoning’ models
Researchers from Wellesley College, Oberlin College, the University of Texas at Austin, Northeastern University, Charles University, and startup Cursor have developed an AI benchmark using riddles from NPR's Sunday Puzzle. This benchmark aims to test AI reasoning models with problems solvable by general knowledge, unlike many existing benchmarks that rely on specialized expertise. The study found that while models like OpenAI's o1 performed well, others such as DeepSeek's R1 sometimes provided incorrect answers or "gave up," revealing limitations in AI reasoning. The researchers plan to expand their testing to identify areas for improvement in AI models.
today 5 d. ago attach_file Politics
attach_file
Events
attach_file
Events
attach_file
Politics
attach_file
Science
attach_file
Politics
attach_file
Politics
attach_file
Politics
attach_file
Events
attach_file
Politics
attach_file
Politics
attach_file
Politics
attach_file
Politics
ID: 4013566443