Tic Tac Toe – A Monte Carlo Approach

Tic Tac Toe, a.k.a nought and crosses, is a very simple game that most of us would have played and mastered at a very young age. Just in case that you are one of the outliers who has never experienced this magnificent game, here is how it is played. It is a game for 2 players in a 3×3 grid. Each player takes turns to place his/her marker to occupy 1 of the 9 boxes. The aim of the game is to place three of your own markers in a horizontal, vertical, or diagonal row.


Looks like a very simple game, and it is! So how are we going to go about solving it? In traditional artificial intelligence, this game is regarded to have a small ‘search space’ thus a full tree search is feasible. What this means is that it is very possible for a modern day computer to map out all the possible legal moves and select the one that would lead to the highest probability of winning (or the lowest probability of losing depending on the optimisation criteria).  For this exact problem, a full tree search might be the best solution. But what about for a game that is more complex, such as Go? it would be impossible to map out every legal move. Therefore we need to consider other methods.

When we talk about Go and artificial intelligence, the first thing that would appear in most people’s mind is likely to be AlphaGo. AlphaGo is an artificial intelligence agent developed by Google’s Deepmind. This agent is so powerful that it is capable of defeating an opponent who currently ranks within the top 5 in the world – Lee Sedol. At a high level, AlphaGo achieved this by learning from past experiences rather than hard-coded rules. It first learned from the professional players, then repeatedly play against itself to generate even more data to learn from. This has inspired me to try implementing a similar strategy for my Tic Tac Toe game and measure the results.

For this experiment, I will try to use a Monte Carlo method to generate the necessary ‘data’. The computer plays a large number of games with random moves against itself. Board positions and end results were logged for each game. Dividing the number of appearance in victorious games by the total number of appearances, the winning probabilities of a board position can be calculated. The theory behind is that by experience, we know that placing the marker at certain positions would give us a bigger advantage. We learn about this by playing and observing a few games and compared the results. Similarly, by running a large number of simulations, the computer can also recognise positions that usually lead to a win.

Then a trained ‘smart’ computer can play against a human player and pick its move according to its experience (the winning probabilities calculated earlier). The diagram below shows what the computer actually ‘sees’ and how it ‘thinks’.

Screen Shot 2017-04-14 at 01.11.46

In order to determine how effective this strategy is, the A.I will be evaluated in a series of sample games. Each sample consists of 100 games. The difference between samples is the amount of training that the agent receives. The training number ranges from 0 (completely random games) to 23000 (games played using 23000 previous games as reference). However before we jump in and start training, It is always a good idea to see how the baseline game agent perform. This would help us understand the game better and notice any underlying mechanics. By conducting 100 samples of 100 random games each (total 10000 games played), a histogram of the means of win rates is plotted.


If we were to guess the win rate naively without testing, it would seem that 50% win rate for each player would be a reasonable estimate. However, as we see in the histogram above, this is not the case. The player with marker ‘X’ (who is the first-hand player) usually wins 60% of the game. The second player with marker ‘O’ only wins 30%. The remaining games were ties. By only looking at the diagram we can almost immediately deduce the results are statistically significant as the distributions barely overlap hence their means would not even come close. This is a very interesting insight and it shows that the game of Tic Tac Toe demonstrates a very significant first-hand advantage. The result also serves as a very important foundation for the upcoming tests.

In order to show that training is effective, the evaluation would be done using the method described below:

  1. Set player ‘X’ (First hand) to be a random agent and player ‘O’ (Second-hand) to be a smart agent. If player ‘O” wins more after training, it would be a stronger argument for the training effectiveness.
  2. Initial phrase training. There are 10 samples and the number of training games for each sample is different. They range from 100 to 1000. Then 100 testing games were played for each sample and the results were recorded
  3. Late phrase training. Similar to above but the number of training games ranges from 2000 to 23000 with intervals of 2000 games.

The diagram below shows the result for the initial phrase. There is very little effect up to 200 training games. However, from 300 training games onward, the smart agent started to overcome the natural disadvantage and resulted in 40% win rates; which are similar to the random agent with the advantage.


When we look at the test as a whole, we can see that as the number of training increases beyond 1000, the performance of the smart agent continued to improve up to a win rate of about 75% when it starts to oscillate. Another interesting fact to observe is that the number of ties seems to have increased slightly and oscillates at around 25%. The random agent on the other hand only won less than 5% of the game.


In conclusion, we can say that the Monte Carlo simulation method is relatively effective in learning how to play the game. However, the fact that the first-hand random agent can still win some games with high training cycles, shows there is a disadvantage of this method. The fact that Monte Carlo simulation being totally random means that it is possible for some board positions would not appear at all in the training samples. Hence it is impossible for the smart agent to learn from. Even if the board position did appear for a few times, the nature of random games can potentially make bad position to win or vice versa. This in turns confuses the smart agent and mislead its judgement. The other point to notice is how we optimise our agent. In this example, we only care about maximising winning probability. It is possible that a board position to have high possibilities for both winning and losing. Hence the agent, aiming for the win, did not realise that it would lose out in the next move.

This concludes my little project in this simple A.I set up and I hope you have enjoyed it.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s