Recipe: Reinforcement Learning Tic-Tac-Toe




  • Easy

Time Requirements

  • Theory: 10 minutes
  • Writing code: 30 minutes (unless pre-written)
  • Results: 5 minutes
  • Conclusions: <1 minute


  • Terminal
  • Python (version 3)
  • Temporal difference equation
  • Exploration versus exploitation trade-off


Quick Theory (10 minutes)

Reinforcement learning is an area of machine learning concerned with how software agents ought to take actions in an environment so as to maximise reward. Temporal difference learning, one such technique, is applicable when the environment, or game, is deterministic: that means that the state you find yourself in after taking an action is solely dependent on the action taken.

Tic-tac-toe is one such game. Deciding to place an "X" or "O" in an empty square guarantees that the square now holds an "X" or "O". If there were a non-zero probability that a different empty square would gain the "X" or "O" instead, the game would be non-deterministic and Q-learning, a closely related technique, would be more appropriate.

The temporal difference learning equation is:

$$V(s) \leftarrow V(s) + \alpha (r + V(s') - V(s))$$

where $s$ is the current state, $s'$ is the next state, $\alpha$ is a fixed learning rate, $r$ is the reward received at state $s$ and $V(s)$ is the value of state $s$, corresponding to the expected future reward due to being in state $s$. This equation, used to update the estimated value of a state, is actually very intuitive: the updated value depends on the value of the next state compared with the current one. In short: if the next state is very favourable, then the current one should be somewhat favourable too, and vice-versa.

When training the bots in this way, it is important to consider the exploration versus exploitation trade-off. In an unknown environment, agents should first try out many different actions to explore the consequences, even if a seemingly good action has already been identified. When the environment becomes more familiar, that is to say the agent becomes more confident in the consequences of its actions, the agent should exploit its past experience and lean towards taking those actions with the best expected outcome.

To implement this, the bots follow the following decision policy during training:
$$\pi(s_i) = \frac{e^{V(s_i)/\tau}}{\sum_{s_j\in S'} e^{V(s_j)/\tau}}$$ where $s_i$, $s_j$ are states in the set $S'$ of possible next states, $\pi(s_i)$ is the probability of choosing the action which leads to state $s_i$ and $\tau$ is a variable which governs the exploration-vs-exploitation of the learning algorithm: high $\tau$ during early training leads to a more uniform probability distribution over possible actions; low $\tau$ in the later training stages gives high probability to choosing the action leading to the next state with the highest $V(s)$.

Finally, during testing or playing versus a human, the bots simply choose the action which leads to the next state with the highest $V(s)$.

Code (30 minutes)

Here's a script I made earlier.


In the code above, I chose to train the "X" and "O" bots against each other for 20000 games - which I estimate would take 2 humans 48 hours straight to do (prove me wrong, I dare you)! Each bot estimated the value $V(s)$ of approximately 5500 states.

The results are rather surprising: it turns out that, after training, the preferred first square for "X" is the central square!

I was genuinely surprised. But it's right there in the output file:

Fig 1: Training results output file showing the 9 possible starting states and their values.

An obvious improvement also comes to mind when scanning the output file: the symmetry of the game should mean that only three different values of $V(s)$ should arise for the first state achieved after "X"'s go: either a corner state, an edge state or the centre state. However, it is pleasing that the $V(s)$ for the two sets of 4 equivalent starting states are in rough agreement with each other.

When the bots are played against each other for 20000 test games (and the value estimates $V(s)$ continue to be updated), the picture is rather different: there is no difference between any first move for "X"! This is as expected, as optimal players/agents can and will draw playing tic-tac-toe. The test results file confirms this (NB: before this equilibrium is reached a few games are won by the bots as they exploit the imperfect training of each other).

Fig 2: Test results output file showing the 9 possible starting states and their values.

Fig 3: Plot of percentage of "X" wins, "O" wins and draws during the first 1000 of 20000 test games. Once equilibrium is reached, both bots play optimally and draw every time.


Tic-tac-toe is a deterministic game in which perfect players or agents can always draw. Temporal difference learning is shown to be applicable in training simple bots to play the game, and these bots perform as expected when exploiting the game environment given their past (training) experience.

I hope you enjoyed this recipe, thanks for reading! -M