RAO-AGI
RAO-AGI is a symbolic reasoning benchmark evaluating the tactical inference capabilities of Large Language Models. By utilizing serialized Connect Four board states, the benchmark isolates spatial reasoning from deep-tree search, focusing on single-step deterministic completions.
System Specification
The environment is a discrete grid of dimension 7 × 6. States are serialized as a list of strings representing rows from index 0 (top) to index 5 (bottom).
Representation
AActive agent (current turn)BPassive agent (adversary).Null space (empty)
Data Schema
Tasks are defined via JSON objects. Solutions are withheld in evaluation instances to ensure objective benchmarking.
{
"id": "train_003",
"board": [
".......", // Row 0
".......",
".......",
"...AB..",
"..ABB..",
".ABBB.." // Row 5
],
"current_player": "A",
"columns": ["0", "1", "2", "3", "4", "5", "6"],
"solution": "4"
}
Dataset Composition
The benchmark consists of 100 tactical scenarios distributed across two splits. Each scenario represents a "forced" tactical situation requiring one move to win or block.
Evaluation Harness
The eval/run_eval.py script facilitates standardized inference across multiple model providers.
It manages prompt templating, response parsing, and error handling.
CLI Reference
| Argument | Type | Description |
|---|---|---|
| --provider | choice |
Target backend: ollama, groq, anthropic, openai. |
| --model | str |
Specific model identifier (e.g., llama3.2). |
| --split | choice |
Dataset partition: training or evaluation. |
| --tasks | int |
Limit processing to the first N tasks. |
| --output | path |
Path to save the resulting submission JSON. |
| --prompt | choice |
Format: minimal (direct answer) or cot (reasoning). |
| --verbose | flag |
Logs raw model responses to stderr. |
Submission & Scoring
Submissions are flat JSON objects mapping task IDs to column indices.
Scoring is performed by score.py, which calculates accuracy as a percentage of correct moves.
Reference Visualizer
The following visualizer demonstrates the symbolic representation rendered as a tactical board. A person possessing basic knowledge of the game should achieve 100% accuracy on all tasks.
Fig 1. Sample State (Diagonal Win Opportunity at Column 4)