RAO-AGI

RAO-AGI is a symbolic reasoning benchmark evaluating the tactical inference capabilities of Large Language Models. By utilizing serialized Connect Four board states, the benchmark isolates spatial reasoning from deep-tree search, focusing on single-step deterministic completions.

System Specification

The environment is a discrete grid of dimension 7 × 6. States are serialized as a list of strings representing rows from index 0 (top) to index 5 (bottom).

Representation

Data Schema

Tasks are defined via JSON objects. Solutions are withheld in evaluation instances to ensure objective benchmarking.

{
  "id": "train_003",
  "board": [
    ".......", // Row 0
    ".......",
    ".......",
    "...AB..",
    "..ABB..",
    ".ABBB.."  // Row 5
  ],
  "current_player": "A",
  "columns": ["0", "1", "2", "3", "4", "5", "6"],
  "solution": "4"
}

Dataset Composition

The benchmark consists of 100 tactical scenarios distributed across two splits. Each scenario represents a "forced" tactical situation requiring one move to win or block.

Training Tasks: 50
Evaluation Tasks: 50
Tactical Categories: 7
Random Baseline: 14.2%

Evaluation Harness

The eval/run_eval.py script facilitates standardized inference across multiple model providers. It manages prompt templating, response parsing, and error handling.

CLI Reference

Argument Type Description
--provider choice Target backend: ollama, groq, anthropic, openai.
--model str Specific model identifier (e.g., llama3.2).
--split choice Dataset partition: training or evaluation.
--tasks int Limit processing to the first N tasks.
--output path Path to save the resulting submission JSON.
--prompt choice Format: minimal (direct answer) or cot (reasoning).
--verbose flag Logs raw model responses to stderr.
$python eval/run_eval.py --provider groq --model llama-3.3-70b-versatile --output sub.json

Submission & Scoring

Submissions are flat JSON objects mapping task IDs to column indices. Scoring is performed by score.py, which calculates accuracy as a percentage of correct moves.

$python score.py sub.json --solutions-dir data/training

Reference Visualizer

The following visualizer demonstrates the symbolic representation rendered as a tactical board. A person possessing basic knowledge of the game should achieve 100% accuracy on all tasks.

Fig 1. Sample State (Diagonal Win Opportunity at Column 4)