Sakana AI Launches Sudoku-Bench for AI Reasoning

Sakana AI introduces Sudoku-Bench, a creative reasoning benchmark testing human-like problem-solving through Sudoku variants without tool use.

by HowAIWorks Team
Sakana AISudoku-BenchAI ReasoningAI BenchmarkCreative Problem SolvingLLM EvaluationAI ResearchMeta-ReasoningPuzzle SolvingAI Testing

Introduction

Sakana AI has launched Sudoku-Bench, a novel reasoning evaluation dataset designed to test whether large language models (LLMs) can think like humans when solving creative problems. Unlike traditional benchmarks that can be mastered through training data, Sudoku-Bench focuses on measuring the "aha" or "eureka" moments that characterize genuine creative problem-solving.

The benchmark evaluates models on 100 unique Sudoku variant puzzles without allowing tool use or code execution, forcing LLMs to rely on meta-reasoning and creative insight rather than brute-force computational approaches. This makes Sudoku-Bench particularly valuable for assessing reasoning capabilities that go beyond pattern matching and memorization.

Developed by researchers Jeffrey Seely, Yuki Imajuku, Tianyu Zhao, Edoardo Cetin, and Llion Jones, Sudoku-Bench represents a significant step forward in AI reasoning evaluation. The benchmark is now available on Hugging Face and includes a public leaderboard tracking model performance across different puzzle sizes and evaluation modes.

What is Sudoku-Bench?

Core concept and design philosophy

Sudoku-Bench is built on the premise that Sudoku variants are unique and creative puzzles that can reveal whether an LLM possesses human-like reasoning capabilities. Each puzzle in the benchmark is designed to be seemingly unbreakable at first glance—even to expert solvers—but admits progress after a creative discovery is made.

The benchmark explicitly prohibits tool use and code execution, requiring models to solve puzzles through pure reasoning and natural language understanding. This design choice ensures that the evaluation measures genuine problem-solving ability rather than computational brute-force approaches.

Puzzle composition

The benchmark includes 100 carefully curated puzzles distributed across three grid sizes:

  • 15 4x4 puzzles: Smaller grids that test fundamental reasoning patterns
  • 15 6x6 puzzles: Medium complexity puzzles requiring intermediate reasoning skills
  • 70 9x9 puzzles: Full-size puzzles that demand sophisticated problem-solving strategies

Each puzzle is unique, featuring either a unique ruleset or requiring a solving tactic never seen before. This diversity makes the domain of Sudoku variants more resistant to memorization compared to other benchmarks, similar in spirit to the ARC-AGI benchmark but with key differences in how constraints are presented and solved.

Evaluation methodology

Two evaluation modes

Sudoku-Bench supports two distinct evaluation configurations:

Single-Shot Mode

In Single-Shot mode, the LLM attempts to solve the entire puzzle grid in one response. This mode tests the model's ability to reason through the complete problem space and produce a correct solution without iterative refinement.

Multi-Step Mode

In Multi-Step mode, the LLM is prompted to provide one or more cell placements in each turn. The user displays the updated board, and the interaction continues until the LLM solves the puzzle or makes an incorrect move. This mode allows for iterative reasoning and tests the model's ability to make incremental progress.

Performance metrics

The evaluation measures performance using two primary metrics:

  • Average Solve Rate (ASR): The percentage of puzzles for which the model produces the complete and correct final solution grid. This is the primary metric for overall success and measures the model's ability to reach valid solutions.

  • Average Correct Placements (ACP): Used specifically for Multi-Step mode, this metric tracks the average number of correct cell values placed before the puzzle is solved, an incorrect placement is made, or another termination condition (such as an API error or reaching a maximum number of steps) occurs.

Leaderboard and transparency

Sakana AI maintains a public leaderboard at pub.sakana.ai/sudoku/ that tracks model performance across different puzzle sizes and evaluation modes. The leaderboard shows performance for both Single-Shot and Multi-Step configurations, allowing researchers to compare models across different reasoning approaches.

Why Sudoku-Bench matters

Measuring creative reasoning

Traditional reasoning benchmarks often focus on domains that can be readily mastered through sufficient training data. However, Sudoku variants are constantly evolving, with puzzle authors creating new rulesets and requiring novel solving tactics. This makes the domain inherently resistant to memorization and forces models to demonstrate genuine reasoning capabilities.

The benchmark is designed to measure the "eureka moment"—that critical insight where a solver realizes how constraints interact in non-obvious ways to create intermediate results and ultimately a break-in to the solution. Such insights are difficult to find in standard benchmarks but are essential for evaluating true reasoning ability.

Comparison with other benchmarks

Sudoku-Bench differs from other reasoning benchmarks in important ways:

Compared to ARC-AGI:

  • ARC-AGI puzzles ask solvers to discover puzzle constraints by presenting a few examples, after which execution is often straightforward
  • In Sudoku variants, all constraints are explicitly given as part of the puzzle, but a direct application of each rule in isolation typically yields no progress
  • A creative process is required to see how constraints interact in non-obvious ways

Compared to vanilla Sudoku:

  • Standard Sudoku puzzles have a single, compact tokenized representation that applies to all training and test samples
  • Sudoku variants are so varied that natural language is needed to encode and represent each puzzle
  • The rules and constraints of each variant are specified in natural language, making the benchmark best suited for LLM evaluation

Resistance to memorization

The constantly evolving nature of Sudoku variants means that each puzzle is unique—either through a unique ruleset or by requiring a solving tactic never seen before. This makes the domain more resistant to memorization compared to other benchmarks, ensuring that performance reflects genuine reasoning ability rather than training data overlap.

Dataset structure and availability

Challenge subsets

The Sudoku-Bench dataset includes multiple subsets:

  • challenge_100: The main evaluation set of 100 puzzles used for the leaderboard
  • nikoli_100: 100 standard Sudoku puzzles created by hand by Nikoli, a Japanese puzzle company
  • ctc: 2,565 puzzles from the Cracking the Cryptic YouTube channel

Text-only representation

All puzzles in Sudoku-Bench are presented in text-only format, making them accessible to language models without requiring vision capabilities. However, the dataset is naturally applicable for vision-language models (VLMs) as well, which may be required for certain puzzles whose visual elements are too complex to fit into text.

The benchmark explicitly selected the 100 puzzles of challenge_100 to be ones that admit a text-only representation, ensuring broad accessibility while maintaining evaluation rigor.

Integration with SudokuPad

The dataset includes tools for integration with SudokuPad, an application created by Sven Neumann. This integration allows for more interactive puzzle solving, including features like pencil marks and color-coding cells, which can be particularly useful for vision-language models.

Reasoning traces from Cracking the Cryptic

The dataset includes thousands of hours of reasoning traces from Cracking the Cryptic, including:

  • Text transcriptions of human reasoning
  • Sequences of actions in SudokuPad
  • Extracted directly from YouTube videos

These traces provide valuable training and evaluation data for understanding how humans approach creative problem-solving in Sudoku variants.

Technical details and implementation

Natural language encoding

Unlike benchmarks that use compact tokenized representations, Sudoku-Bench requires natural language to encode and represent each puzzle. The rules and constraints of each Sudoku variant are specified in natural language, making the benchmark inherently suited for LLM evaluation.

This design choice reflects the reality that creative problem-solving often requires understanding complex, nuanced instructions that cannot be easily reduced to formal logic or simple patterns.

Evaluation without tools

A key design principle of Sudoku-Bench is that models must solve puzzles without tool use or code execution. This constraint ensures that the evaluation measures pure reasoning ability rather than computational power or access to external solving algorithms.

Models are evaluated based solely on their ability to understand the puzzle constraints, reason through the problem space, and produce correct solutions using only their language understanding and reasoning capabilities.

Example prompts

The benchmark provides example prompts for each puzzle, making it easy for researchers to evaluate their models. These prompts are designed to be clear and comprehensive, providing all necessary information about the puzzle rules and constraints while leaving the reasoning and solution to the model.

Research implications

Advancing reasoning evaluation

Sudoku-Bench represents a significant advancement in reasoning evaluation by focusing on creative problem-solving rather than pattern matching or memorization. The benchmark's emphasis on "eureka moments" provides a more accurate measure of whether models can truly reason like humans.

The constantly evolving nature of Sudoku variants ensures that the benchmark will remain relevant as models improve, since puzzle authors will continue to create new challenges that require novel reasoning approaches.

Understanding model limitations

By forcing models to solve puzzles without tools, Sudoku-Bench helps researchers understand the inherent limitations of current LLM reasoning capabilities. Models that perform well on the benchmark demonstrate genuine reasoning ability, while those that struggle reveal areas where further research is needed.

The benchmark's design also helps identify whether models are relying on memorization or pattern matching rather than true reasoning, since each puzzle is unique and requires creative insight.

Future directions

The researchers acknowledge that they do not currently have a private test set, though they may consider one in the future. However, they note that dozens of high-quality Sudoku variants are published daily on the web, so any puzzle after an LLM's training cutoff date can be considered out-of-domain for testing purposes.

This approach allows for continuous evaluation as new puzzles are created, ensuring that the benchmark remains challenging even as models improve.

Why it matters for AI development

Testing genuine reasoning

Sudoku-Bench provides a rigorous test of whether AI systems can think creatively and solve problems that require genuine insight rather than computational brute force. This is essential for developing AI systems that can handle novel situations and adapt to new challenges.

The benchmark's focus on creative problem-solving aligns with goals of developing more capable AI systems that can reason about complex, real-world problems that don't have straightforward algorithmic solutions.

Benchmarking progress

As AI systems become more capable, benchmarks like Sudoku-Bench help track progress in reasoning capabilities. The public leaderboard allows researchers to compare different models and approaches, fostering healthy competition and driving innovation in reasoning research.

The benchmark's design ensures that improvements in performance reflect genuine advances in reasoning ability rather than simply better training data or larger models.

Real-world applications

While Sudoku variants may seem like abstract puzzles, the reasoning capabilities they test are directly relevant to real-world applications. Creative problem-solving, meta-reasoning, and the ability to see non-obvious connections between constraints are essential for many practical AI applications, from scientific discovery to engineering design.

Conclusion

Sakana AI's Sudoku-Bench represents a significant step forward in evaluating AI reasoning capabilities. By focusing on creative problem-solving through unique Sudoku variants, the benchmark tests whether LLMs can think like humans using meta-reasoning and creativity rather than brute-force computational approaches.

The benchmark's design—prohibiting tool use, emphasizing creative insights, and featuring constantly evolving puzzles—ensures that it measures genuine reasoning ability rather than memorization or pattern matching. With 100 carefully curated puzzles across multiple grid sizes and two evaluation modes, Sudoku-Bench provides a comprehensive assessment of model reasoning capabilities.

As AI systems continue to advance, benchmarks like Sudoku-Bench will play a crucial role in understanding the true extent of model reasoning abilities and identifying areas where further research is needed. The public availability of the dataset and leaderboard ensures that the research community can benefit from this valuable evaluation tool.

Explore more about AI reasoning, AI benchmarks, and creative problem-solving in our Glossary, and learn about other AI research developments in our Blog.

Sources

Frequently Asked Questions

Sudoku-Bench is a reasoning evaluation dataset from Sakana AI that tests whether LLMs can think like humans using meta-reasoning and creativity to solve unique Sudoku variant puzzles without relying on brute-force search or tool use.
The benchmark includes 100 puzzles total: 15 4x4 puzzles, 15 6x6 puzzles, and 70 9x9 puzzles, all designed to test creative problem-solving abilities.
Sudoku-Bench measures 'eureka moments' in creative problem-solving. Each puzzle is unique with evolving rulesets, making it resistant to memorization. Unlike ARC-AGI, all constraints are explicitly given, but creative insight is needed to see how constraints interact.
Models are evaluated in two modes: Single-Shot (solving the entire puzzle in one response) and Multi-Step (providing one or more cell placements per turn). Metrics include Average Solve Rate (ASR) and Average Correct Placements (ACP).
Sudoku-Bench is designed for LLMs since each puzzle's rules and constraints are specified in natural language. The benchmark includes text-only representations, though it can also be used with vision-language models for puzzles with complex visual elements.
The full benchmark data is available on Hugging Face, and the code repository is on GitHub at github.com/SakanaAI/Sudoku-Bench. A leaderboard showing model performance is available at pub.sakana.ai/sudoku/.

Continue Your AI Journey

Explore our lessons and glossary to deepen your understanding.