import numpy as np
import random
from collections import defaultdict
12 Reinforcement Learning
12.1 Demonstration: Tic-Tac-Toe
In this demonstration, we’ll develop a reinforcement learning agent that learns to play Tic-Tac-Toe using the Q-learning algorithm. We’ll start with an overview of the work plan and then present the code step by step, explaining each part in detail.
12.1.1 Work Plan Overview
Import Required Libraries: Import necessary Python libraries for the implementation.
Define the Default Q-Value Function: Create a function to initialize default Q-values for unseen states.
Implement the
TicTacToe
Game Class: Define the game environment, including the board, moves, and win conditions.Implement the
QLearningAgent
Class: Develop the agent that will learn optimal strategies using Q-learning.Define the Game Playing Function: Write a function to simulate games between the agent and an opponent.
Define the Training Function: Create a function to train the agent over multiple episodes.
Define the Evaluation Function: Assess the agent’s performance after training.
Enable Human Interaction: Allow a human player to play against the trained agent.
Main Function: Tie all components together and provide a user interface.
12.1.2 Code Implementation
Let’s go through the code step by step.
12.1.2.1 Import Required Libraries
We start by importing the necessary libraries.
12.1.2.2 Define the Default Q-Value Function
We define a function that returns a NumPy array of zeros, which initializes the Q-values for new states.
def default_q_value():
return np.zeros(9)
This function ensures that every new state encountered by the agent has an initial Q-value of zero for all possible actions.
12.1.2.3 Implement the TicTacToe
Game Class
We create a class to represent the Tic-Tac-Toe game environment.
class TicTacToe:
def __init__(self):
self.board = [' '] * 9
self.current_winner = None
def reset(self):
self.board = [' '] * 9
self.current_winner = None
return self.get_state()
def available_actions(self):
return [i for i, spot in enumerate(self.board) if spot == ' ']
def get_state(self):
return tuple(self.board)
def make_move(self, square, letter):
if self.board[square] == ' ':
self.board[square] = letter
if self.winner(square, letter):
self.current_winner = letter
return True
return False
def winner(self, square, letter):
# Check rows, columns, and diagonals for a win
= square // 3
row_ind = self.board[row_ind*3:(row_ind+1)*3]
row if all(s == letter for s in row):
return True
= square % 3
col_ind = [self.board[col_ind+i*3] for i in range(3)]
col if all(s == letter for s in col):
return True
# Check diagonals
if square % 2 == 0:
= [self.board[i] for i in [0,4,8]]
diag1 if all(s == letter for s in diag1):
return True
= [self.board[i] for i in [2,4,6]]
diag2 if all(s == letter for s in diag2):
return True
return False
def is_full(self):
return ' ' not in self.board
def print_board(self):
# Helper function to print the board
for row in [self.board[i*3:(i+1)*3] for i in range(3)]:
print('| ' + ' | '.join(row) + ' |')
def print_board_nums(self):
# Helper function to show the number mapping to board positions
= [str(i) for i in range(9)]
number_board for row in [number_board[i*3:(i+1)*3] for i in range(3)]:
print('| ' + ' | '.join(row) + ' |')
Explanation:
__init__
: Initializes the game board and sets the current winner toNone
.reset
: Resets the board for a new game and returns the initial state.available_actions
: Returns a list of indices where moves can be made.get_state
: Returns a tuple representing the current state of the board.make_move
: Places a letter (‘X’ or ‘O’) on the board if the move is valid.winner
: Checks if the last move resulted in a win.is_full
: Checks if the board is full, indicating a draw.print_board
andprint_board_nums
: Helper methods to display the board and the numbering for positions.
12.1.2.4 Implement the QLearningAgent Class
We define a class for the agent that will learn using Q-learning.
class QLearningAgent:
def __init__(self, alpha=0.1, gamma=0.9, epsilon=1.0, epsilon_decay=0.9995):
self.q_table = defaultdict(default_q_value)
self.alpha = alpha # Learning rate
self.gamma = gamma # Discount factor
self.epsilon = epsilon # Exploration rate
self.epsilon_decay = epsilon_decay
self.epsilon_min = 0.01 # Minimum exploration rate
def choose_action(self, state, available_actions):
# ε-greedy action selection
if np.random.rand() < self.epsilon:
return random.choice(available_actions)
else:
= self.q_table[state]
state_values # Select action with highest Q-value among available actions
= [(action, state_values[action]) for action in available_actions]
q_values = max(q_values, key=lambda x: x[1])[1]
max_value = [action for action, value in q_values if value == max_value]
max_actions return random.choice(max_actions)
def learn(self, state, action, reward, next_state, done):
= self.q_table[state][action]
old_value = np.max(self.q_table[next_state]) if not done else 0
next_max # Q-learning update rule
= (1 - self.alpha) * old_value + self.alpha * (reward + self.gamma * next_max)
new_value self.q_table[state][action] = new_value
# Decay the exploration rate
if self.epsilon > self.epsilon_min:
self.epsilon *= self.epsilon_decay
Explanation:
__init__
: Initializes the Q-table and sets the hyperparameters for learning.choose_action
: Implements the ε-greedy policy for choosing actions.- With probability \(\epsilon\), the agent explores by selecting a random action.
- Otherwise, it exploits by choosing the action with the highest estimated Q-value.
learn
: Updates the Q-values based on the reward received and the maximum Q-value of the next state.
12.1.2.5 Define the Game Playing Function
We create a function to simulate a game between the agent and an opponent.
def play_game(agent, env, human_vs_agent=False):
= env.reset()
state = False
done if human_vs_agent:
print("Positions are as follows:")
env.print_board_nums()= 'X' # Agent always plays 'X'
current_player
while not done:
if current_player == 'X':
= env.available_actions()
available_actions = agent.choose_action(state, available_actions)
action 'X')
env.make_move(action, = env.get_state()
next_state if human_vs_agent:
print("\nAgent's Move:")
env.print_board()if env.current_winner == 'X':
1, next_state, True)
agent.learn(state, action, if human_vs_agent:
print("Agent wins!")
return 1 # Agent wins
elif env.is_full():
0.5, next_state, True)
agent.learn(state, action, if human_vs_agent:
print("It's a draw.")
return 0.5 # Draw
else:
0, next_state, False)
agent.learn(state, action, = next_state
state = 'O'
current_player else:
= env.available_actions()
available_actions if human_vs_agent:
= False
valid_square while not valid_square:
= input("Your move (0-8): ")
user_input try:
= int(user_input)
action if action not in available_actions:
raise ValueError
= True
valid_square except ValueError:
print("Invalid move. Try again.")
'O')
env.make_move(action, = env.get_state()
state else:
= random.choice(available_actions)
action 'O')
env.make_move(action, if env.current_winner == 'O':
-1, env.get_state(), True)
agent.learn(state, action, if human_vs_agent:
env.print_board()print("You win!")
return -1 # Agent loses
elif env.is_full():
0.5, env.get_state(), True)
agent.learn(state, action, if human_vs_agent:
print("It's a draw.")
return 0.5 # Draw
else:
= 'X' current_player
Explanation:
- Game Loop: Alternates turns between the agent and the opponent (or human player).
- Agent’s Turn:
- Chooses an action using the ε-greedy policy.
- Updates the Q-table based on the outcome.
- Opponent’s/Human’s Turn:
- If
human_vs_agent
isTrue
, prompts the human for input. - Otherwise, the opponent makes a random move.
- The agent updates its Q-table based on the outcome.
- If
12.1.2.6 Define the Training Function
We define a function to train the agent over multiple episodes.
def train_agent(episodes=50000):
= QLearningAgent()
agent = TicTacToe()
env for episode in range(episodes):
play_game(agent, env)if (episode + 1) % 10000 == 0:
print(f"Episode {episode + 1}/{episodes} completed.")
return agent
Explanation:
- Initialization: Creates a new agent and game environment.
- Training Loop: The agent plays the game repeatedly to learn from experience.
- Progress Updates: Prints a message every 10,000 episodes to track training progress.
12.1.2.7 Define the Evaluation Function
We create a function to evaluate the agent’s performance after training.
def evaluate_agent(agent, games=1000):
= TicTacToe()
env = 0
wins = 0
draws = 0
losses for _ in range(games):
= play_game(agent, env)
result if result == 1:
+= 1
wins elif result == 0.5:
+= 1
draws else:
+= 1
losses print(f"Out of {games} games: {wins} wins, {draws} draws, {losses} losses.")
Explanation:
- Evaluation Loop: The agent plays a specified number of games without learning.
- Outcome Tracking: Records the number of wins, draws, and losses.
- Performance Display: Prints the results after evaluation.
12.1.2.8 Enable Human Interaction
We create a function to allow a human to play against the agent.
def play_against_agent(agent):
= TicTacToe()
env =True) play_game(agent, env, human_vs_agent
12.1.2.9 Main Function
We define the main function to provide a user interface.
def main():
print("Tic-Tac-Toe with Reinforcement Learning Agent")
print("1. Train Agent")
print("2. Evaluate Agent")
print("3. Play Against Agent")
= input("Select an option (1-3): ")
choice
if choice == '1':
= int(input("Enter number of training episodes: "))
episodes = train_agent(episodes)
agent # Save the trained agent
import pickle
with open('trained_agent.pkl', 'wb') as f:
pickle.dump(agent, f)print("Agent trained and saved as 'trained_agent.pkl'.")
elif choice == '2':
# Load the trained agent
import pickle
try:
with open('trained_agent.pkl', 'rb') as f:
= pickle.load(f)
agent
evaluate_agent(agent)except FileNotFoundError:
print("No trained agent found. Please train the agent first.")
elif choice == '3':
# Load the trained agent
import pickle
try:
with open('trained_agent.pkl', 'rb') as f:
= pickle.load(f)
agent
play_against_agent(agent)except FileNotFoundError:
print("No trained agent found. Please train the agent first.")
else:
print("Invalid option selected.")
12.1.2.10 Example Session
Training and Evaluating the Agent:
agent = train_agent(50000)
evaluate_agent(agent, games=1000)
Playing Against the Agent:
# play_against_agent(agent)