In this post I’m going to explain a simple AI that can learn to win a game with no previous knowledge of the rules, goals, or decision process of its opponent.
The Game
Nim is a game where two players take turns picking up sticks from a board that contains a number of sticks. Each player can take one, two, or three sticks. In this case, the goal is to avoid taking the last stick.
This makes a nice example for tabular Q-Learning because there are a limited number of game states, there are the same available actions in each state, and the states and actions both have finite integer values. It’s also deterministic, so it is easy to create an AI that can learn the game well enough to win every time.
Q-Learning
Machine learning is a way of taking some information, and then “learning” something about it through math. For example, if you have the prices of a bunch of homes in an area, and their square footage, you could predict the value of another home without having to explicitly write out the relationship between size and value in your code.
Reinforcement learning is where an AI uses machine learning to make decisions in an environment. As an example, you could have a robot navigating a rocky landscape. It could keep track of its actions, like reaching a target location, or getting stuck, and you could assign “rewards” to those actions. After some time driving around the environment, it could figure out that driving up on a rock is bad, and driving toward the target is good. Eventually it could be trained to navigate the environment while avoiding obstacles.
The key parts of that to remember are: rewards, state, and action.
Q-Learning is a type of reinforcement learning where the AI keeps track of the “quality” of an action in a state. As the AI explores the environment, the quality values for each state and action explored are updated based on the quality value of the state reached and what rewards were given.
You can find more information about Q-Learning on page 131 of Richard S. Sutton’s “Reinforcement Learning” (link to PDF on author’s site).
Nim.py
import random
import matplotlib.pyplot as plt
# Window size for rolling average in win rate graph
GRAPH_SMOOTHING = 5
# Number of sticks
GAME_SIZE = 10
# Max number of sticks that can be removed in a single turn
MAX_STICKS_REMOVABLE = 3
ACTION_SPACE = [i for i in range(1,MAX_STICKS_REMOVABLE + 1)]
Here I import pyplot to graph the win rate of the AI. I also set some default values. The action space is an array of possible moves, which is the same for any state in this game: remove one, two, or three sticks.
class Game:
'''
Keeps track of the game state.
'''
def __init__(self):
'''
Sets the initial number of sticks.
'''
self.sticks = GAME_SIZE
def remove(self, count):
'''
Remove a number of sticks from the board.
:param count: Number of sticks to remove.
'''
self.sticks -= count
def getState(self):
'''
Converts the number of sticks to a "state id" for the learner.
:return: state id.
'''
return self.sticks - 1
def getActionSpace(self):
'''
:return: Possible actions in the game.
'''
return ACTION_SPACE
def isOver(self):
'''
:return: True if all sticks have been removed and the game is "done".
'''
return self.sticks <= 0
This is the code for Nim itself. It keeps track of the state of the game, and makes the possible actions available to the learner.
class QLearner:
def __init__(self):
'''
Initialize the Q-Table
'''
self.q = []
for i in range(GAME_SIZE):
self.q.append([0] * len(ACTION_SPACE))
Here’s the interesting stuff: the code for the Q-Learner. The constructor initializes a 2D array, with one dimension being every possible game state, and another being every possible action.
class QLearner:
...
def getMove(self, state):
potential_actions = self.q[state]
action_chosen = potential_actions.index(max(potential_actions))
return action_chosen
This function takes a state and chooses the action for that state that will lead to the maximum reward.
In other circumstances you might want the learner to explore the environment, so it would have a choice between a random action and one with high value. Typically to explore or go for high value would be randomly chosen, weighted by a ratio, which would be reduced over time towards taking the optimal action.
In reinforcement this is called the “exploration vs. exploitation tradeoff”.
class QLearner:
...
def learn(self, state, action, new_state, reward, is_over):
if is_over:
self.q[state][action] = reward
else:
self.q[state][action] = reward + max(self.q[new_state])
This is the critical part, the code that learns. It considers the previous state, the action taken, the new state, the reward, and whether the game is over. Remember that not every action results in a reward, so this value may be zero.
If the game is over, it knows exactly what the value of taking an action in the previous state was: the value is the reward.
If the game is not over, then it sets the expected value of the action in the previous state to be the reward, plus the average value of the actions in the new state.
The values built up by this function eventually become a sort of “map” for the AI, where it uses this information about the value of different actions to follow a “trail” of states with the highest value.
The cool thing about this is that it doesn’t have built in knowledge of the game. It doesn’t do any planning at the start of the game, it doesn’t know its opponent’s strategy, it doesn’t even know what it needs to do to win. This means that it’s general, that you could use this exact same code for something else, like tic-tac-toe.
A couple things have been left out of the learning code for simplicity. Learning rate, which would look like
self.q[state][action] = self.q[state][action] + learning_rate * (reward + max(self.q[new_state]) - self.q[state][action])
This would mean that the Q-Table would be updated in smaller increments, which is useful for things that have some randomness to them. Since Nim is a deterministic game, Nim.py uses a learning rate of 1.
Also left out is a discount factor:
self.q[state][action] = self.q[state][action] + learning_rate * (reward + discount_factor * max(self.q[new_state]) - self.q[state][action])
This is useful if you want a Q-Learner to value earlier rewards higher. For example, if you had a Q-Learner navigating a map, you could use discount factor to train it to navigate to nearest goals first.
The code section above ends up looking quite similar to a Bellman equation for Q-Learning, which describes how the Q table is updated by new information in math terms.
class RandomOpponent:
def __init__(self):
return
def getMove(self, state):
return random.choice(Game().getActionSpace())
def learn(self, state, action, new_state, reward, is_over):
return
class HumanOpponent:
def __init__(self):
return
def getMove(self, state):
return int(input("Sticks to remove? "))
def learn(self, state, action, new_state, reward, is_over):
return
These are a couple interchangeable opponents for the AI to play against. One makes random moves, the other lets you interact with the AI.
def runTrial(learner, number_of_trials, opponent_learner_type, verbose, make_graph, print_q_values):
q_wins = []
opponent = opponent_learner_type()
for _ in range(number_of_trials):
want_to_exit = False
game = Game()
state = game.getState()
while True:
reward = 0
action = learner.getMove(state)
move = game.getActionSpace()[action]
game.remove(move)
if game.isOver():
reward = -1
q_wins.append(0)
else:
opponent_move = opponent.getMove(game.getState())
if (opponent_move == -1):
want_to_exit = True
break
game.remove(opponent_move)
if game.isOver():
reward = 1
q_wins.append(1)
new_state = game.getState()
learner.learn(state, action, new_state, reward, game.isOver())
state = new_state
if game.isOver():
break
if want_to_exit:
break
This code runs the games. First it creates q_wins, which will have 1’s or 0’s, depending on whether the Q-learner won each game. Then it instantiates the opponent learner, which can be human or random.
Then there is a loop that runs for each game (also called a trial in the code). The game state is reset, and then another loop is started.
Inside that loop is the code that calls the Q-Learner’s decision and learning functions. The Q-Learner gets to move first, deciding how many sticks to pick up. If it picks up the final stick, the Q-Learner lost and the reward is negative, and the opponent’s move is skipped. If the Q-Learner didn’t pick up the final stick, the opponent then gets to take their turn. If the opponent picks up the last stick, the Q-Learner’s reward is positive.
After both players take their turns, the learner is given: the number of sticks it had before the turns, the action it took, the number of sticks it had after the turns, the reward, and whether the game is over.
That repeats until one of the players wins.
After the given number of games is played, a chart is generated that shows how frequently the Q-Learner won.
I’ve omitted some code that prints out what’s going on, and some code at the end of the function that generates a graph of the win rate.
The win rate is the proportion of wins in a rolling window through all the games played. Here’s an example chart showing how the Q-Learner starts out poorly, but eventually reaches a 100% win rate as it learns the game.

The Code
The complete source code for the Q Nim Learner can be found here.