Tic Tac Toe

Goals

With this, my main goal was to mess around with both using straight numpy to build a basic tic tac toe agent as well as to learn more about qlearning as well. This article helped me with getting started with the basics.

This was about a day long project to help get started and next steps are to make this user friendly and deploy this outside of a notebook environment! Cause let's be honest, these aren't super user friendly.

In [ ]:
from copy import deepcopy

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Agent Class

the main event of all this! each agent is set up to track moves taken over the course of the game and at the end, use reward to update the states throughout the game and update the reward function for that game based on the outcome.

Using Bellman Equation:

$\epsilon$: exploration rate - chance that a random move is taken vs best move.
$\alpha$: learning rate - basically weights everything and how much impact each reward function has.
$\gamma$: discount factor - acting to weight the impact the future reward has.

$$Q(S, A) = Q(S, A) + \alpha * (\gamma * reward - Q(S, A))$$

or in code:

self.states[move] += self.lr * (self.discount_factor * reward - self.states[move])

which can definitely be easier to read :)

In [ ]:
class Agent:
  def __init__(self, symbol, lr=0.2, exp_rate=0.4, discount_factor=0.1):
    """
    initialize a player and associated symbol.
    learning rate,
    exploration rate,
    discount factor
    """
    self.symbol = symbol
    self.lr = lr
    self.exp_rate = exp_rate
    self.discount_factor = discount_factor
    self.states = {}
    self.moves_taken = []

  def _get_values(self, hash):
    """
    """
    if hash not in self.states:
      values = 0
    else:
      values = self.states[hash]

    return values

  def move(self, board):
    """
    choose next move
    """
    explore = np.random.uniform(0, 1) < self.exp_rate
    avail_idx = np.argwhere(board.board == 0)
    next_move = None
    
    if explore: # look through all available positions and find best move
      max_value = -999
      for pos in avail_idx:
        next_board = deepcopy(board)
        next_board.update_board(pos[0], pos[1], self.symbol)
        board_hash = next_board.get_hash()
        value = self._get_values(board_hash)

        if value > max_value:
          max_value = value
          next_move = pos
    else:
      pos_idx = np.random.randint(len(avail_idx))
      next_move = avail_idx[pos_idx]

    return next_move

  def update_move_history(self, board_hash):
    """
    add move to front of the list.
    """
    self.moves_taken.insert(0, board_hash)

  def reward(self, reward):
    """
    update rewards at end of game for each state.

    Q(S,A)= Q(S,A)+α∗(γ∗maxaQ(S′,a)− Q(S,A))
    """
    # move history is reversed, reward is reward for the next move taken
    for move in self.moves_taken:
      if move not in self.states:
        self.states[move] = 0
      self.states[move] += self.lr * (self.discount_factor * reward - self.states[move])
      reward = self.states[move]

  def reset(self):
    """
    """
    self.moves_taken = []

Player class

A super basic class to hold the players moves and check for a win.

In [ ]:
class Player:
  def __init__(self, symbol):
    self.symbol = symbol

  def make_move(self, row, col, board):
    board.update_board(row, col, self.symbol)

    board.str_rep()
    board.check_win(3)

Gameboard

You can't play tic-tac-toe without a gameboard! And we definitely want to make sure the board has some basic representation so you don't have to keep track of a numpy array while playing.

Some basic functions we want:

  1. make the board! and track player symbols
  2. track what moves each players makes and update that symbol appropriately.
  3. get gameboard hash. the way the agent tracks what moves it has made is by storing a flattened, string representation of the gameboard at that time. we'll have the gameboard class itself take care of this.
  4. who won? after every move we want to check if anyone won or if there was a tie. the agent won't really learn anything if it never knows what the outcome was.
In [ ]:
class Gameboard:
  def __init__(self, n_rows=3, n_cols=3):
    self.n_rows = n_rows
    self.n_cols = n_cols
    self.board = np.zeros((self.n_rows, self.n_cols))
    self.symbols = {0: " ", 1: "X", -1: "O"}
    self.symbols_rev = {" ": 0, "X": 1, "O": -1}
    self.winner = None
    self.draw = None 
    self.hash = None

  def str_rep(self):
    """
    string representation of array, with symbols
    """
    for i in range(self.n_rows):
      row = "|"
      header = "|"
      for j in range(self.n_cols):
        header += "---|"
        row += f" {self.symbols[self.board[i, j]]} |"
      print(header)
      print(row)
    print(header)

  def get_hash(self):
    """
    basically, return flattened view of board
    """
    self.hash = str(self.board.flatten())
    return self.hash

  def update_board(self, row, col, sym):
    """
    update array with int representation of symbol if position is valid.
    """
    if not isinstance(sym, int):
      sym = self.symbols_rev[sym]

    if self.board[row, col] == 0:
      self.board[row, col] = sym
    else:
      print("move not valid, space not empty")

  def check_win(self, win_score):
    """
    check diagonals, rows, and columns for win_score (+ or -) or draw.
    """
    left_diag_score = np.sum(np.diag(self.board))
    right_diag_score = np.sum(np.diag(np.fliplr(self.board)))

    row_sums = np.sum(self.board, axis=1)
    col_sums = np.sum(self.board, axis=0)

    x_win =  win_score
    o_win = win_score * -1

    # check if x wins
    if x_win in row_sums or x_win in col_sums or x_win in [left_diag_score, right_diag_score]:
      self.winner = 1
      print("game over, X wins")
      return True

    # check if o wins
    if o_win in row_sums or o_win in col_sums or o_win in [left_diag_score, right_diag_score]:
      self.winner = -1
      print("game over, O wins")
      return True

    # check if draw
    if 0 not in self.board:
      self.draw = True
      self.winner = 0
      print("game over, draw")
      return False
    
    return False

  def reset(self):
    """
    back to square one
    """
    self.board = np.zeros((self.n_rows, self.n_cols))
    self.winner = None
    self.draw = None 
    self.hash = None

Helper functions

A couple functions to help with training as well as having the agent make a move when a person is playing against it.

To train, it's basically just set up to have two agents play against eachother for x rounds, and to appropriately check, update, and award each agent.

after this, it's time to play :)

In [ ]:
def train_agent(player1, player2, rounds):
  player1_wins = np.zeros(rounds)
  player2_wins = np.zeros(rounds)

  win_score = 3

  for i in range(rounds):
    board = Gameboard()
    symbol_map = board.symbols
    print(f"starting round {i+1}")
    while board.winner is None:
      for player in [player1, player2]:
        move = player.move(board)
        board.update_board(move[0], move[1], symbol_map[player.symbol])
        player.update_move_history(board.get_hash())
        winner = board.check_win(win_score)

        # in case player one fills last spot
        if board.winner is not None:
          break

    if winner and board.winner == player1.symbol:
      player1.reward(1)
      player2.reward(-1)
      player1_wins[i] = 1
    elif winner and board.winner == player2.symbol:
      player1.reward(-1)
      player2.reward(1)
      player2_wins[i] = 1
    elif not winner and board.winner == 0:
      player1.reward(-1)
      player2.reward(-1)

    player1.reset()
    player2.reset()

  return player1_wins, player2_wins

def agent_move(player, board):
  if board.winner is None:
    move = player.move(board)
    board.update_board(move[0], move[1], player.symbol)
    player.update_move_history(board.get_hash())

  board.str_rep()
  win = board.check_win(3)

  if win and board.winner == player.symbol:
    player.reward(1)
    player.reset()
  else:
    player.reward(-1)
    player.reset()

Let's Play!

Training

Starting out by training two agents to play against each other. For initial training going for 100 rounds before playing with first agent.

In [ ]:
player1 = Agent(symbol=1)
player2 = Agent(symbol=-1)

p1_wins, p2_wins = train_agent(player1, player2, 100)
In [ ]:
np.sum(p1_wins), np.sum(p2_wins)
Out[ ]:
(3148.0, 1375.0)
In [ ]:
np.sum(p1_wins) / (np.sum(p1_wins) + np.sum(p2_wins))
Out[ ]:
0.6959982312624364

Me vs Agent

currently pretty manual process. planning to automate this and set something up for player to actually play against the agent and more than just a string represented gameboard.

In [ ]:
me = Player(symbol=-1)
game = Gameboard()
In [ ]:
agent_move(player1, game)
|---|---|---|
|   |   |   |
|---|---|---|
|   |   |   |
|---|---|---|
| X |   |   |
|---|---|---|
In [ ]:
me.make_move(1, 0, game)
|---|---|---|
|   |   |   |
|---|---|---|
| O |   |   |
|---|---|---|
| X |   |   |
|---|---|---|
In [ ]:
agent_move(player1, game)
|---|---|---|
|   |   |   |
|---|---|---|
| O |   |   |
|---|---|---|
| X |   | X |
|---|---|---|
In [ ]:
me.make_move(1, 1, game)
|---|---|---|
|   |   |   |
|---|---|---|
| O | O |   |
|---|---|---|
| X |   | X |
|---|---|---|
In [ ]:
agent_move(player1, game)
|---|---|---|
|   |   |   |
|---|---|---|
| O | O |   |
|---|---|---|
| X | X | X |
|---|---|---|
game over, X wins
In [ ]: