Recipe: Reinforcement Learning ConnectX




  • Easy

Time Requirements

  • Theory: 10 minutes
  • Writing code: 35 minutes for 2 servings (Connect3 and Connect4)
  • Results: 10 minutes
  • Conclusions: 2 minutes


  • Terminal
  • Python (version 3)
  • Temporal difference equation
  • Exploration versus exploitation trade-off


Quick Theory (10 minutes)

The theory is exactly the same as in my previous blog post.

Code (35 minutes)

In this recipe, two variants are considered: Connect3 on a 4x4 grid and Connect4 on the standard 6x7 grid. Note: the time and memory requirements are exponential in grid size and in number to connect, and so Connect4 requires more time to run and more memory to store its position valuations (don't know what these are? Read the theory). For example, 100k trials takes ~5mins to run and 1.1GB to store results.

Check my GitHub for a ConnectX script (handles Connect3 & 4).

Results (10 minutes)


Playing on a 4x4 grid means the bots have to be trained for longer than for the tic-tac-toe game in my last blog post. I found that training for 50k games and testing for 10k games leads to very good results. So good, in fact, that the bot that plays first can force a win every time! Don't believe me? Here's how:

Fig 1: Player 1 (red) can always force a win in Connect3 on a 4x4 board. Shown above are the variants in which player 2 delays the inevitable for as long as possible.

This is confirmed by the state valuations of the possible initial positions below. In the code I've set the reward for a won position as 100; that an initial state has exactly this as expected value means the bot has learned that it can win every time by playing that starting move.

Fig 2: Prospects look pretty good for player 1 (X)!

What about playing Connect3 on a larger board, I hear you ask? Easy: if player 2 does not play one of the above three variants, player 1 simply slots their next piece adjacent to their first and simultaneously creates two winning next moves, of which player 2 can only take away one.

So there you have it, another nifty party trick under the belt (if you go to those kind of parties).


Connect4 is altogether a different beast; as a 6x7 grid gives rise to many more possible games. This is where I ran into the limits of the computing power I have readily accessible: I trained the bots for 2 hours (2M games) and tested them for another hour (800k games). Any longer proved just too much for my laptop. Still, I thought, approximately 3M games played and 160M positions analysed must lead to a somewhat tricky bot to beat? I was very wrong.

I played the bot four times (granted, not a huge number) as both player 1 (twice) and player 2 (twice). All four games went my way. Humans 4 - 0 Skynet, I thought as I high-fived myself. To give it credit, the bot did quite well up to about 12 moves in, where it started reaching the edge of its experience (the states achieved weren't in its dictionary of {state : value}) leading to it playing random moves. When I thought about it, it became obvious why this was happening.

A lower bound for the number of possible states achieved after just 12 moves is:

$$7^{6} \times 6^{6} \approx \ 5.49 \times 10^{9}$$

which is far superior to the 160M states the bots had encountered during training and testing. The bots only managed to scratch the surface of even just the first stages of the game.

Conclusions (2 minutes)

Temporal difference learning works well on Connect3, whereby the bots learn the optimal strategy to win every time as player 1. Connect4, however, requires too much memory and training time on an average laptop for the method to work well. Little of the overall state space is searched, and the bots play random moves once a game moves outside the realm of their experience. This is a key and clear issue of this reinforcement learning method: although a bot can learn a winning state, it does not deduce the reason for its win. In effect, it assumes nothing about how wins come about and keeps open-minded even when every win it has encountered has been due to having four counters in a row. It believes, and continues to believe that any state whatsoever may bring about a reward. This flexibility unfortunately hinders bot performance in a game where only a small subspace of the total state space is ever visited during training.

Thanks for reading! -M


Popular Posts