Introduction
Sakana AI has launched Sudoku-Bench, a novel reasoning evaluation dataset designed to test whether large language models (LLMs) can think like humans when solving creative problems. Unlike traditional benchmarks that can be mastered through training data, Sudoku-Bench focuses on measuring the "aha" or "eureka" moments that characterize genuine creative problem-solving.
The benchmark evaluates models on 100 unique Sudoku variant puzzles without allowing tool use or code execution, forcing LLMs to rely on meta-reasoning and creative insight rather than brute-force computational approaches. This makes Sudoku-Bench particularly valuable for assessing reasoning capabilities that go beyond pattern matching and memorization.
Developed by researchers Jeffrey Seely, Yuki Imajuku, Tianyu Zhao, Edoardo Cetin, and Llion Jones, Sudoku-Bench represents a significant step forward in AI reasoning evaluation. The benchmark is now available on Hugging Face and includes a public leaderboard tracking model performance across different puzzle sizes and evaluation modes.
What is Sudoku-Bench?
Core concept and design philosophy
Sudoku-Bench is built on the premise that Sudoku variants are unique and creative puzzles that can reveal whether an LLM possesses human-like reasoning capabilities. Each puzzle in the benchmark is designed to be seemingly unbreakable at first glance—even to expert solvers—but admits progress after a creative discovery is made.
The benchmark explicitly prohibits tool use and code execution, requiring models to solve puzzles through pure reasoning and natural language understanding. This design choice ensures that the evaluation measures genuine problem-solving ability rather than computational brute-force approaches.
Puzzle composition
The benchmark includes 100 carefully curated puzzles distributed across three grid sizes:
- 15 4x4 puzzles: Smaller grids that test fundamental reasoning patterns
- 15 6x6 puzzles: Medium complexity puzzles requiring intermediate reasoning skills
- 70 9x9 puzzles: Full-size puzzles that demand sophisticated problem-solving strategies
Each puzzle is unique, featuring either a unique ruleset or requiring a solving tactic never seen before. This diversity makes the domain of Sudoku variants more resistant to memorization compared to other benchmarks, similar in spirit to the ARC-AGI benchmark but with key differences in how constraints are presented and solved.
Evaluation methodology
Two evaluation modes
Sudoku-Bench supports two distinct evaluation configurations:
Single-Shot Mode
In Single-Shot mode, the LLM attempts to solve the entire puzzle grid in one response. This mode tests the model's ability to reason through the complete problem space and produce a correct solution without iterative refinement.
Multi-Step Mode
In Multi-Step mode, the LLM is prompted to provide one or more cell placements in each turn. The user displays the updated board, and the interaction continues until the LLM solves the puzzle or makes an incorrect move. This mode allows for iterative reasoning and tests the model's ability to make incremental progress.
Performance metrics
The evaluation measures performance using two primary metrics:
-
Average Solve Rate (ASR): The percentage of puzzles for which the model produces the complete and correct final solution grid. This is the primary metric for overall success and measures the model's ability to reach valid solutions.
-
Average Correct Placements (ACP): Used specifically for Multi-Step mode, this metric tracks the average number of correct cell values placed before the puzzle is solved, an incorrect placement is made, or another termination condition (such as an API error or reaching a maximum number of steps) occurs.
Leaderboard and transparency
Sakana AI maintains a public leaderboard at pub.sakana.ai/sudoku/ that tracks model performance across different puzzle sizes and evaluation modes. The leaderboard shows performance for both Single-Shot and Multi-Step configurations, allowing researchers to compare models across different reasoning approaches.
Why Sudoku-Bench matters
Measuring creative reasoning
Traditional reasoning benchmarks often focus on domains that can be readily mastered through sufficient training data. However, Sudoku variants are constantly evolving, with puzzle authors creating new rulesets and requiring novel solving tactics. This makes the domain inherently resistant to memorization and forces models to demonstrate genuine reasoning capabilities.
The benchmark is designed to measure the "eureka moment"—that critical insight where a solver realizes how constraints interact in non-obvious ways to create intermediate results and ultimately a break-in to the solution. Such insights are difficult to find in standard benchmarks but are essential for evaluating true reasoning ability.
Comparison with other benchmarks
Sudoku-Bench differs from other reasoning benchmarks in important ways:
Compared to ARC-AGI:
- ARC-AGI puzzles ask solvers to discover puzzle constraints by presenting a few examples, after which execution is often straightforward
- In Sudoku variants, all constraints are explicitly given as part of the puzzle, but a direct application of each rule in isolation typically yields no progress
- A creative process is required to see how constraints interact in non-obvious ways
Compared to vanilla Sudoku:
- Standard Sudoku puzzles have a single, compact tokenized representation that applies to all training and test samples
- Sudoku variants are so varied that natural language is needed to encode and represent each puzzle
- The rules and constraints of each variant are specified in natural language, making the benchmark best suited for LLM evaluation
Resistance to memorization
The constantly evolving nature of Sudoku variants means that each puzzle is unique—either through a unique ruleset or by requiring a solving tactic never seen before. This makes the domain more resistant to memorization compared to other benchmarks, ensuring that performance reflects genuine reasoning ability rather than training data overlap.
Dataset structure and availability
Challenge subsets
The Sudoku-Bench dataset includes multiple subsets:
challenge_100: The main evaluation set of 100 puzzles used for the leaderboardnikoli_100: 100 standard Sudoku puzzles created by hand by Nikoli, a Japanese puzzle companyctc: 2,565 puzzles from the Cracking the Cryptic YouTube channel
Text-only representation
All puzzles in Sudoku-Bench are presented in text-only format, making them accessible to language models without requiring vision capabilities. However, the dataset is naturally applicable for vision-language models (VLMs) as well, which may be required for certain puzzles whose visual elements are too complex to fit into text.
The benchmark explicitly selected the 100 puzzles of challenge_100 to be ones that admit a text-only representation, ensuring broad accessibility while maintaining evaluation rigor.
Integration with SudokuPad
The dataset includes tools for integration with SudokuPad, an application created by Sven Neumann. This integration allows for more interactive puzzle solving, including features like pencil marks and color-coding cells, which can be particularly useful for vision-language models.
Reasoning traces from Cracking the Cryptic
The dataset includes thousands of hours of reasoning traces from Cracking the Cryptic, including:
- Text transcriptions of human reasoning
- Sequences of actions in SudokuPad
- Extracted directly from YouTube videos
These traces provide valuable training and evaluation data for understanding how humans approach creative problem-solving in Sudoku variants.
Technical details and implementation
Natural language encoding
Unlike benchmarks that use compact tokenized representations, Sudoku-Bench requires natural language to encode and represent each puzzle. The rules and constraints of each Sudoku variant are specified in natural language, making the benchmark inherently suited for LLM evaluation.
This design choice reflects the reality that creative problem-solving often requires understanding complex, nuanced instructions that cannot be easily reduced to formal logic or simple patterns.
Evaluation without tools
A key design principle of Sudoku-Bench is that models must solve puzzles without tool use or code execution. This constraint ensures that the evaluation measures pure reasoning ability rather than computational power or access to external solving algorithms.
Models are evaluated based solely on their ability to understand the puzzle constraints, reason through the problem space, and produce correct solutions using only their language understanding and reasoning capabilities.
Example prompts
The benchmark provides example prompts for each puzzle, making it easy for researchers to evaluate their models. These prompts are designed to be clear and comprehensive, providing all necessary information about the puzzle rules and constraints while leaving the reasoning and solution to the model.
Research implications
Advancing reasoning evaluation
Sudoku-Bench represents a significant advancement in reasoning evaluation by focusing on creative problem-solving rather than pattern matching or memorization. The benchmark's emphasis on "eureka moments" provides a more accurate measure of whether models can truly reason like humans.
The constantly evolving nature of Sudoku variants ensures that the benchmark will remain relevant as models improve, since puzzle authors will continue to create new challenges that require novel reasoning approaches.
Understanding model limitations
By forcing models to solve puzzles without tools, Sudoku-Bench helps researchers understand the inherent limitations of current LLM reasoning capabilities. Models that perform well on the benchmark demonstrate genuine reasoning ability, while those that struggle reveal areas where further research is needed.
The benchmark's design also helps identify whether models are relying on memorization or pattern matching rather than true reasoning, since each puzzle is unique and requires creative insight.
Future directions
The researchers acknowledge that they do not currently have a private test set, though they may consider one in the future. However, they note that dozens of high-quality Sudoku variants are published daily on the web, so any puzzle after an LLM's training cutoff date can be considered out-of-domain for testing purposes.
This approach allows for continuous evaluation as new puzzles are created, ensuring that the benchmark remains challenging even as models improve.
Why it matters for AI development
Testing genuine reasoning
Sudoku-Bench provides a rigorous test of whether AI systems can think creatively and solve problems that require genuine insight rather than computational brute force. This is essential for developing AI systems that can handle novel situations and adapt to new challenges.
The benchmark's focus on creative problem-solving aligns with goals of developing more capable AI systems that can reason about complex, real-world problems that don't have straightforward algorithmic solutions.
Benchmarking progress
As AI systems become more capable, benchmarks like Sudoku-Bench help track progress in reasoning capabilities. The public leaderboard allows researchers to compare different models and approaches, fostering healthy competition and driving innovation in reasoning research.
The benchmark's design ensures that improvements in performance reflect genuine advances in reasoning ability rather than simply better training data or larger models.
Real-world applications
While Sudoku variants may seem like abstract puzzles, the reasoning capabilities they test are directly relevant to real-world applications. Creative problem-solving, meta-reasoning, and the ability to see non-obvious connections between constraints are essential for many practical AI applications, from scientific discovery to engineering design.
Conclusion
Sakana AI's Sudoku-Bench represents a significant step forward in evaluating AI reasoning capabilities. By focusing on creative problem-solving through unique Sudoku variants, the benchmark tests whether LLMs can think like humans using meta-reasoning and creativity rather than brute-force computational approaches.
The benchmark's design—prohibiting tool use, emphasizing creative insights, and featuring constantly evolving puzzles—ensures that it measures genuine reasoning ability rather than memorization or pattern matching. With 100 carefully curated puzzles across multiple grid sizes and two evaluation modes, Sudoku-Bench provides a comprehensive assessment of model reasoning capabilities.
As AI systems continue to advance, benchmarks like Sudoku-Bench will play a crucial role in understanding the true extent of model reasoning abilities and identifying areas where further research is needed. The public availability of the dataset and leaderboard ensures that the research community can benefit from this valuable evaluation tool.
Explore more about AI reasoning, AI benchmarks, and creative problem-solving in our Glossary, and learn about other AI research developments in our Blog.