Reimplementing Fogel’s Tic-Tac-Toe Paper

The goal of this section is to reimplement the paper Using Evolutionary Programming to Create Neural Networks that are Capable of Playing Tic-Tac-Toe by David Fogel in 1993. This paper preceded Blondie24, and many of the principles introduced in this paper were extended in Blondie24. As such, reimplementing ths paper provides good scaffolding as we work our way up to reimplement Blondie24.

The information needed to reimplement this paper is outlined below.

Neural Network Architecture

The neural net consists of the following layers:

  • Input Layer: $9$ linearly-activated nodes and $1$ bias node
  • Hidden Layer: $H$ sigmoidally-activated nodes and $1$ bias node ($H$ is variable, as will be described later)
  • Output Layer: $9$ sigmoidally-activated nodes


Converting Board to Input

A tic-tac-toe board is converted to input by flattening it into a vector and replacing $\textrm X$ with $1,$ empty squares with $0,$ and $\textrm O$ with $-1.$

For example, given a board

$\begin{align*} \begin{bmatrix} \textrm{X} & \textrm{O} & \square \\ \square & \textrm{X} & \textrm{O} \\ \square & \square & \square \end{bmatrix}, \end{align*}$


we first concatenate consecutive rows to flatten the board into the following $9$-element vector:

$\begin{align*} \left< \textrm{X}, \textrm{O}, \square, \square, \textrm{X}, \textrm{O}, \square, \square, \square \right>, \end{align*}$


Then, we replace $\textrm X$ with $1,$ empty squares with $0,$ and $\textrm O$ with $-1$ to get the final input vector:

$\begin{align*} \left< 1, -1, 0, 0, 1, -1, 0, 0, 0 \right> \end{align*}$


Converting Output to Action

The output layer consists of $9$ nodes, one for each board square. To convert the output values into an action, we do the following:

  1. Discard any values that correspond to a board square that has already been filled. (This will prevent illegal moves.)
  2. Identify the empty board square with the maximum value. We move into this square.

Evolution Procedure

The initial population consists of $50$ networks. In each network, the number of hidden nodes $H$ is randomly chosen from the range ${ 1, 2, \ldots, 10 }$ and the initial weights are randomly chosen from the range $[-0.5, 0.5].$


Replication

A network replicates by making a copy of itself and then modifying the copy as follows:

  • Each weight is incremented by $N(0,0.5),$ a value drawn from the normal distribution with mean $0$ and variance $0.05.$
  • With $0.5$ probability, we modify the network architecture by randomly choosing between adding or deleting a hidden node. If we add a node, then we initialize its associated weights with values of $0.$

Note that when modifying the architecture, we abort any decision that would lead the number of hidden nodes to exit the range ${ 1, 2, \ldots, 10 }.$ More specifically:

  • We abort the decision to delete a hidden node if the number of hidden nodes is $H=1.$
  • We abort the decision to add a hidden node if the number of hidden nodes is $H=10.$


Evaluation

In each generation, each network plays $32$ games against a near-perfect but beatable opponent. The evolving network is always allowed to move first, and receives a payoff of $+1$ for a win, $0$ for a tie, and $-10$ for a loss.

The near-perfect opponent follows the strategy below:


With 10% chance:
 - Randomly choose an open square to move into.

Otherwise:
 - If the next move can be chosen to win the game, do so.

 - Otherwise, if the next move can be chosen to block
   the opponent's win, do so.

 - Otherwise, if there are 2 open squares in a line with
   the opponent's marker, randomly move into one of those
   squares.

 - Otherwise, randomly choose an open square to move into.

Once the total payoff (over $32$ games against the near-perfect opponent) has been computed for each of the $50$ networks, a second round of evaluation occurs to select the networks that will proceed to the next generation and replicate.

In the second round of evaluation, each network is given a score that represents how its total payoff (from matchups with the near-perfect opponent) compares to the total payoffs of some other networks in the same generation. Specifically, each network is compared to $10$ other networks randomly chosen from the generation, and its score is incremented for each other network that has a lower total payoff.

The network(s) with the maximum score proceed to the next generation and replicate.

Performance Curve

Generate a performance curve as follows:

  1. Run the above procedure for $800$ generations, keeping track of the maximum total payoff (i.e. the best player's total payoff) at each generation.
  2. Then repeat $19$ more times, for a total of $20$ trials of $800$ generations each.
  3. Finally, plot the mean maximum total payoff (averaged over the $20$ trials) as a function of the number of generations.

The resulting curve should resemble the following shape:

icon


Leave a Comment