RAO-AGI

RAO-AGI is a symbolic reasoning benchmark evaluating the tactical inference capabilities of Large Language Models. By utilizing serialized Connect Four board states, the benchmark isolates spatial reasoning from deep-tree search, focusing on single-step deterministic completions.

System Specification

The environment is a discrete grid of dimension 7 × 6. States are serialized as a list of strings representing rows from index 0 (top) to index 5 (bottom).

Representation

A Active agent (current turn)
B Passive agent (adversary)
. Null space (empty)

Data Schema

Tasks are defined via JSON objects. Solutions are withheld in evaluation instances to ensure objective benchmarking.

{
  "id": "train_003",
  "board": [
    ".......", // Row 0
    ".......",
    ".......",
    "...AB..",
    "..ABB..",
    ".ABBB.."  // Row 5
  ],
  "current_player": "A",
  "columns": ["0", "1", "2", "3", "4", "5", "6"],
  "solution": "4"
}

Dataset Composition

The benchmark consists of 100 tactical scenarios distributed across two splits. Each scenario represents a "forced" tactical situation requiring one move to win or block.

Training Tasks: 50

Evaluation Tasks: 50

Tactical Categories: 7

Random Baseline: 14.2%

Evaluation Harness

The eval/run_eval.py script facilitates standardized inference across multiple model providers. It manages prompt templating, response parsing, and error handling.

CLI Reference

Argument	Type	Description
--provider	`choice`	Target backend: `ollama`, `groq`, `anthropic`, `openai`.
--model	`str`	Specific model identifier (e.g., `llama3.2`).
--split	`choice`	Dataset partition: `training` or `evaluation`.
--tasks	`int`	Limit processing to the first N tasks.
--output	`path`	Path to save the resulting submission JSON.
--prompt	`choice`	Format: `minimal` (direct answer) or `cot` (reasoning).
--verbose	`flag`	Logs raw model responses to `stderr`.

$python eval/run_eval.py --provider groq --model llama-3.3-70b-versatile --output sub.json

Submission & Scoring

Submissions are flat JSON objects mapping task IDs to column indices. Scoring is performed by score.py, which calculates accuracy as a percentage of correct moves.

$python score.py sub.json --solutions-dir data/training

Reference Visualizer

The following visualizer demonstrates the symbolic representation rendered as a tactical board. A person possessing basic knowledge of the game should achieve 100% accuracy on all tasks.

Fig 1. Sample State (Diagonal Win Opportunity at Column 4)