Recipe

## Method

### Quick Theory (10 minutes)

#### Reinforcement learning is an area of machine learning concerned with how software agents ought to take actions in an environment so as to maximise reward. Temporal difference learning, one such technique, is applicable when the environment, or game, is deterministic: that means that the state you find yourself in after taking an action is solely dependent on the action taken. Tic-tac-toe is one such game. Deciding to place an "X" or "O" in an empty square guarantees that the square now holds an "X" or "O". If there were a non-zero probability that a different empty square would gain the "X" or "O" instead, the game would be non-deterministic and Q-learning, a closely related technique, would be more appropriate. The temporal difference learning equation is:

$$V(s) \leftarrow V(s) + \alpha (r + V(s') - V(s))$$

## Results

#### In the code above, I chose to train the "X" and "O" bots against each other for 20000 games - which I estimate would take 2 humans 48 hours straight to do (prove me wrong, I dare you)! Each bot estimated the value $V(s)$ of approximately 5500 states.

The results are rather surprising: it turns out that, after training, the preferred first square for "X" is the central square!

I was genuinely surprised. But it's right there in the output file: