### Recipe: Reinforcement Learning Tic-Tac-Toe

##
**Overview**

###
**Difficulty**

####
- Easy

###
**Time Requirements**

####
- Theory: 10 minutes
- Writing code: 30 minutes (unless pre-written)
- Results: 5 minutes
- Conclusions: <1 minute

###
**Ingredients**

####
- Terminal
- Python (version 3)
- Temporal difference equation
- Exploration versus exploitation trade-off

##
**Method**

###
**Quick Theory** (10 minutes)

####
**Reinforcement learning** is an area of machine learning concerned with how software agents ought to take actions in an environment so as to maximise reward.
**Temporal difference learning**, one such technique, is applicable when the environment, or game, is deterministic: that means that the state you find yourself in after taking an action is solely dependent on the action taken.

Tic-tac-toe is one such game. Deciding to place an "X" or "O" in an empty square guarantees that the square now holds an "X" or "O". If there were a non-zero probability that a different empty square would gain the "X" or "O" instead, the game would be non-deterministic and ** Q-learning**, a closely related technique, would be more appropriate.

The temporal difference learning equation is:

$$V(s) \leftarrow V(s) + \alpha (r + V(s') - V(s))$$
####
where $s$ is the current state, $s'$ is the next state, $\alpha$ is a fixed learning rate, $r$ is the reward received at state $s$ and $V(s)$ is the value of state $s$, corresponding to the expected future reward due to being in state $s$. This equation, used to update the estimated value of a state, is actually very intuitive: the updated value depends on the value of the next state compared with the current one. In short: if the next state is very favourable, then the current one should be somewhat favourable too, and vice-versa.

When training the bots in this way, it is important to consider the **exploration versus exploitation** trade-off. In an unknown environment, agents should first try out many different actions to explore the consequences, even if a seemingly good action has already been identified. When the environment becomes more familiar, that is to say the agent becomes more confident in the consequences of its actions, the agent should exploit its past experience and lean towards taking those actions with the best expected outcome.

To implement this, the bots follow the following decision policy during training:

$$\pi(s_i) = \frac{e^{V(s_i)/\tau}}{\sum_{s_j\in S'} e^{V(s_j)/\tau}}$$
where $s_i$, $s_j$ are states in the set $S'$ of possible next states, $\pi(s_i)$ is the probability of choosing the action which leads to state $s_i$ and $\tau$ is a variable which governs the exploration-vs-exploitation of the learning algorithm: high $\tau$ during early training leads to a more uniform probability distribution over possible actions; low $\tau$ in the later training stages gives high probability to choosing the action leading to the next state with the highest $V(s)$.

Finally, during testing or playing versus a human, the bots simply choose the action which leads to the next state with the highest $V(s)$.

###

**Code** (30 minutes)

#### Here's a script I made earlier.

##

**Results**

####
In the code above, I chose to train the "X" and "O" bots against each other for **20000 games** - which I estimate would take 2 humans 48 hours straight to do (prove me wrong, I dare you)! Each bot estimated the value $V(s)$ of approximately **5500 states**.

**The results are rather surprising**: it turns out that, after training, the preferred first square for "X" is the central square!

I was genuinely surprised. But it's right there in the output file:

## Comments

## Post a Comment