12 Advanced Topics

12.1 Web Scraping

This section was written by Melanie Desroches, a senior majoring in statistics and minoring in computer science. The goal of this section is to introduce web-scraping so that it can be utilized for data science. This will include what web-scraping is, how to web-scrape with Python using examples, and how to web-scrape ethically.

12.1.1 What is Web-Scraping

As data scientists, we often want to collect data from a variety of sources. In the age of the internet, a lot of the data we may want to collect is available on a website. However, this data is often times not available in an easily downloadable format. This is where web-scraping becomes valuable. Web-scraping is an automated process used to gather data from websites. This allows us to access and collect large amounts of data directly from web pages if the information is not avalible for download.

Websites are primarily structured with HTML (Hypertext Markup Language), which organizes and displays content. Web scrapers parse through this HTML code to identify and extract relevant information. Therefore, it important to have a basic understanding of HTML in order to identify what part of the website you are trying to scrape. The contents of a web page are broken up and identified by elements. Here are some examples of common elements that are important for web-scraping:

<body> : identifies the website body
<table> : identifies a table
<tbody> : identifies the body of the table
<tr> : indentifies the row of a table

12.1.2 How to Web-Scrape with Python

There are many ways to web-scrape with Python. We will cover the two main packages, Beautiful Soup and Selenium.

12.1.2.1 Beautiful Soup

The Beautiful Soup Python Library simplifies the process of parsing and navigating HTML and XML documents, making it easier to extract data from websites. Beautiful Soup is ideal for scraping data from static websites. Static websites do not change based on user actions or require server-side interactions to update content dynamically. Basically, what you see is what you get. Static websites tend to be pretty simple so scraping from them is relatively easy.

Beautiful Soup can be installed by running

pip install beautifulsoup4

in your terminal.

12.1.2.2 Selenium

Selenium is used for web browser automation and dynamic websites. Dynamic sites often use backend programming to pull data from a database, customize it, and render it in real time based on user requests. This makes Selenium great at performing web-scraping tasks that involve multiple pages or performing actions within those pages. Because dynamic websites tend to be a bit more complex, you need to use a package like Selenium that is more equiped for the complex structure.

Selenium can be installed by running

pip install selenium

in your terminal.

12.1.2.3 Web Driver

To control a web browser, Selenium also requires a WebDriver. We recommend Chrome Driver because it is cross-platform; follow the instructions for developers to set up your Chrome Driver.

After setting up Chrome Driver, you can check its availability from a terminal:

chromedriver --version

A commonly seen error on Mac is

Error: “chromedriver” cannot be opened because the developer cannot be verified. Unable to launch the chrome browser

This can be fixed by running:

xattr -d com.apple.quarantine $(which chromedriver)

See explanation from StackOverflow.

12.1.2.4 Beautiful Soup vs Selenium

Both Beautiful Soup and Selenium are helpful tools in web-scraping. But they both have their strengths and weaknesses. Beautiful Soup is lightweight, easy to learn, and perfect for working with static HTML content. However, Beautiful Soup is more limited when it comes to dynamic websites, which are much more common nowadays. Selenium is better for interacting with dynamic web content that loads JavaScript or requires actions like clicking, scrolling, or filling forms. That said, Selenium can be slower and more resource-intensive since it opens a browser window to simulate real user actions.

12.1.2.5 A Step-by Step Guide to Web-Scraping

Find the website URL with the information you want to select
Send an HTTP request to the URL and confirm you have access to the page. Generally, 200-299 means the request has been granted and 400-499 means that your request is not allowed.
Use the “Inspect” tool in your browser to identify the tags, classes, or elements associated with the data you want to extract. This can be done by right-clicking on the web page and pressing select. If you hover your clicker over the different sections of HTML, the parts of the website that section is associated with will become highlighted. Use this to find the element that is associated with the data that you want to scrape.
Use a parsing library like Beautiful Soup or Selenium to process the HTML response. Beautiful Soup requires the use of the requests package in order to send a request. Selenium uses the webdriver to send the request.
Clean and store the relevant infomation.

12.1.3 Examples using NYC Open Data

Since this class has used the NYC Open Data, let’s build on this data set in order to get some additional information that is not already available.

12.1.3.1 Beautiful Soup and NYPD Precincts

Say you want to get the adresses of all of the NYPD Precincts in New York City. This information is available in table format on the NYPD website. Since the NYPD Precincts aren’t changed, the website is static, making Beautiful Soup the best package to use to scrape this website.

Start by making sure you have Beautiful Soup and Requests installed. The requests package can be installed using

pip install requests

Import the requests package, BeautifulSoup from bs4, and pandas (to create a new data frame). We have already identified the url that will be scraping data from. In the code below, there is a dictionary called headers. This is optional. Headers can help make your requests look more like a browser. If you choose to use a header, include it when you send your request to the url. Otherwise, the request can be sent to the url using requests.get().

import requests
from bs4 import BeautifulSoup
import pandas as pd

# URL of the NYPD precincts page
url = "https://www.nyc.gov/site/nypd/bureaus/patrol/precincts-landing.page"

# Send a GET request to the page
headers = {
    "User-Agent": "Mozilla/5.0 Chrome/87.0"
}
response = requests.get(url, headers=headers)

# Check if the request was successful
print(response.status_code)
if response.status_code != 200:
    print(f"Failed to retrieve page: Status code {response.status_code}")

Since the response to the request was 200, which means the request was successful, we are clear to move onto the next step which is parsing the table.

To start parsing, you have to call Beautiful Soup. When you pass response.text into Beautiful Soup, it takesthe raw HTML of the webpage as a string. html.parser specifies the parsing engine used by Beautiful Soup to process the HTML. It is a built-in Python HTML parser that is fast and works well for most cases.

To identify which parts of the website you want to webscrape, you can right click on the website and click inspect. This will show you the HTML of the page. The table can be found under the <table> element with the class rt. Using this information, have Beautiful Soup find the table using .find(). Within the table, the rows are indentifies by <tr> within the HTML. In each row, the name of the precinct and address is found in the <td> element with the data labels Precinct and Address respectively. From this, use .find_all('tr') to find all the rows in the table and then within each row, extract the precinct and address.

# Parse the HTML content
soup = BeautifulSoup(response.text, 'html.parser')

# Find the table with class "rt" which holds the precinct data
table = soup.find("table", {"class": "rt"})
    
# Lists to hold the extracted data
precinct_names = []
addresses = []
    
# Extract each row of the table (each row corresponds to one precinct)
for row in table.find_all("tr"):
  # Find the "Precinct" and "Address" columns by data-label attribute
  precinct_cell = row.find("td", {"data-label": "Precinct"})
  address_cell = row.find("td", {"data-label": "Address"})
        
  # If both cells are found, store their text content
  if precinct_cell and address_cell:
    precinct_names.append(precinct_cell.get_text(strip=True))
    addresses.append(address_cell.get_text(strip=True))

The extracted information can be stripped so that only the relevant text is included and then added to their relevant list. Now that the data has been collected and cleaned, a new dataframe can be created.

# Create a DataFrame with the extracted data
precincts_df = pd.DataFrame({
  "Precinct": precinct_names,
  "Address": addresses
})

# Display the DataFrame
print(precincts_df)

          Precinct                   Address
0     1st Precinct         16 Ericsson Place
1     5th Precinct       19 Elizabeth Street
2     6th Precinct        233 West 10 Street
3     7th Precinct        19 1/2 Pitt Street
4     9th Precinct         321 East 5 Street
..             ...                       ...
72  115th Precinct  92-15 Northern Boulevard
73  120th Precinct       78 Richmond Terrace
74  121st Precinct       970 Richmond Avenue
75  122nd Precinct      2320 Hylan Boulevard
76  123rd Precinct           116 Main Street

[77 rows x 2 columns]

12.1.3.2 Selenium and Weather Data

Say you want to see if the weather makes an impact of the number or severity of crashes in New York City. Weather data in New York City can be found on Wundergroud. Since information on weather is always being monitored and collected, the data that we want for a specific time period in being held in the websites database. Therefore, the website is dynamic and Selenium cna be used for web scraping.

The first step is to set up Selenium and the WebDriver. In this example, I use Chrome Driver. Options can be initialized with chrome_options = Options() for the Chrome browser. The options I used were --headless (which allows the browser to run without a visible window) and --disable-gpu (which can improve performance in headless mode).

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options

# Set up ChromeDriver
chrome_options = Options()
chrome_options.add_argument("--headless")  # Run in headless mode (no browser UI)
chrome_options.add_argument("--disable-gpu")  # Disable GPU acceleration
chrome_options.add_argument("--no-sandbox")  # Required for some environments

Next we need find the path of the Chrome Driver. The following code is a cross-platform solution.

# Path to your ChromeDriver executable
# config_file_path = "config.txt"
# with open(config_file_path, 'r') as file:
#         chrome_driver_path = file.read().strip()

import os

def find_application_path(app_name):
    for path in os.environ["PATH"].split(os.pathsep):
        full_path = os.path.join(path, app_name)
        if os.path.isfile(full_path) and os.access(full_path, os.X_OK):
            return full_path
    return None

chrome_driver_path = find_application_path("chromedriver")

The driver then can be initialized with the path and driver options.

service = Service(chrome_driver_path)

# Initialize the ChromeDriver
driver = webdriver.Chrome(service=service, options=chrome_options)

Once Selenium and the webdriver is set up, go to the page and find the target data. Same as with the Beautiful Soup example, go to the url and identify the table that you want to webscrape. In this case, I want the table at the bottom of the page that lists the daily observations of the temperature, dew point, humidity, wind speed, pressure, and precipitation. The table is identified as <table> with the class <days>. In Selenium, driver.get(url) opens the webpage in the Edge browser. Once the table has loaded, (By.CSS_SELECTOR, "table.days") selects the main data table by its CSS selector “table.days”, ensuring we’re targeting the right element.

# Define the target URL
url = f"https://www.wunderground.com/history/weekly/us/ny/new-york-city/KLGA/date/2024-6-30"

# Load the page
driver.get(url)

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

# Wait for table to load
wait = WebDriverWait(driver, 15)
table = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "table.days")))

Within the table, the rows are indentified by tr in tbodyand the columns are in td.

# Initialize lists for each data type
dates = []
max_temps = []
min_temps = []
humidity_values = []
wind_speeds = []
pressure_values = []
precip_values = []

# Get all rows
rows = table.find_elements(By.CSS_SELECTOR, "tbody tr")
for row in rows:
    # Get all 'td' elements in the row
    columns = row.find_elements(By.TAG_NAME, "td")  
    # Extract text from each column
    row_data = [col.text.strip() for col in columns]  
    # Print the content of the row
    print("Row Data:", row_data)  # This will print the content of each row

Row Data: ['Jun\n30\n1\n2\n3\n4\n5\n6', 'Jun', '30', '1', '2', '3', '4', '5', '6', 'Max Avg Min\n101 79.7 73\n79 72.7 65\n83 75.1 67\n83 75.8 68\n85 77.3 72\n90 81.0 74\n92 78.7 72', 'Max', 'Avg', 'Min', '101', '79.7', '73', '79', '72.7', '65', '83', '75.1', '67', '83', '75.8', '68', '85', '77.3', '72', '90', '81.0', '74', '92', '78.7', '72', 'Max Avg Min\n74 70.3 59\n56 53.1 46\n56 49.5 45\n58 51.5 47\n69 63.0 52\n73 70.8 69\n73 70.9 69', 'Max', 'Avg', 'Min', '74', '70.3', '59', '56', '53.1', '46', '56', '49.5', '45', '58', '51.5', '47', '69', '63.0', '52', '73', '70.8', '69', '73', '70.9', '69', 'Max Avg Min\n91 74.9 36\n65 50.9 38\n55 41.3 29\n70 43.9 28\n85 61.8 48\n87 72.1 55\n91 78.0 54', 'Max', 'Avg', 'Min', '91', '74.9', '36', '65', '50.9', '38', '55', '41.3', '29', '70', '43.9', '28', '85', '61.8', '48', '87', '72.1', '55', '91', '78.0', '54', 'Max Avg Min\n20 10.2 5\n23 14.5 8\n12 8.7 6\n18 10.3 0\n15 9.3 3\n14 7.4 0\n17 7.6 0', 'Max', 'Avg', 'Min', '20', '10.2', '5', '23', '14.5', '8', '12', '8.7', '6', '18', '10.3', '0', '15', '9.3', '3', '14', '7.4', '0', '17', '7.6', '0', 'Max Avg Min\n30.0 29.9 29.8\n30.1 30.0 29.9\n30.2 30.2 30.1\n30.2 30.1 30.0\n30.0 29.9 29.8\n29.8 29.8 29.7\n29.9 29.8 29.8', 'Max', 'Avg', 'Min', '30.0', '29.9', '29.8', '30.1', '30.0', '29.9', '30.2', '30.2', '30.1', '30.2', '30.1', '30.0', '30.0', '29.9', '29.8', '29.8', '29.8', '29.7', '29.9', '29.8', '29.8', 'Total\n0.06\n0.11\n0.00\n0.00\n0.00\n0.03\n0.11', 'Total', '0.06', '0.11', '0.00', '0.00', '0.00', '0.03', '0.11']
Row Data: ['Jun']
Row Data: ['30']
Row Data: ['1']
Row Data: ['2']
Row Data: ['3']
Row Data: ['4']
Row Data: ['5']
Row Data: ['6']
Row Data: ['Max', 'Avg', 'Min']
Row Data: ['101', '79.7', '73']
Row Data: ['79', '72.7', '65']
Row Data: ['83', '75.1', '67']
Row Data: ['83', '75.8', '68']
Row Data: ['85', '77.3', '72']
Row Data: ['90', '81.0', '74']
Row Data: ['92', '78.7', '72']
Row Data: ['Max', 'Avg', 'Min']
Row Data: ['74', '70.3', '59']
Row Data: ['56', '53.1', '46']
Row Data: ['56', '49.5', '45']
Row Data: ['58', '51.5', '47']
Row Data: ['69', '63.0', '52']
Row Data: ['73', '70.8', '69']
Row Data: ['73', '70.9', '69']
Row Data: ['Max', 'Avg', 'Min']
Row Data: ['91', '74.9', '36']
Row Data: ['65', '50.9', '38']
Row Data: ['55', '41.3', '29']
Row Data: ['70', '43.9', '28']
Row Data: ['85', '61.8', '48']
Row Data: ['87', '72.1', '55']
Row Data: ['91', '78.0', '54']
Row Data: ['Max', 'Avg', 'Min']
Row Data: ['20', '10.2', '5']
Row Data: ['23', '14.5', '8']
Row Data: ['12', '8.7', '6']
Row Data: ['18', '10.3', '0']
Row Data: ['15', '9.3', '3']
Row Data: ['14', '7.4', '0']
Row Data: ['17', '7.6', '0']
Row Data: ['Max', 'Avg', 'Min']
Row Data: ['30.0', '29.9', '29.8']
Row Data: ['30.1', '30.0', '29.9']
Row Data: ['30.2', '30.2', '30.1']
Row Data: ['30.2', '30.1', '30.0']
Row Data: ['30.0', '29.9', '29.8']
Row Data: ['29.8', '29.8', '29.7']
Row Data: ['29.9', '29.8', '29.8']
Row Data: ['Total']
Row Data: ['0.06']
Row Data: ['0.11']
Row Data: ['0.00']
Row Data: ['0.00']
Row Data: ['0.00']
Row Data: ['0.03']
Row Data: ['0.11']

As you can see the output is pretty messy. From this step, we need to find the important parts and strip it of the text. This can be done by identifying the indicies of the rows that we want, using find_elements to find the corresponding tag, and then stripping the text to add it to the relevant list.

# Process the first row which contains all the dates
date_row = rows[0].text.split('\n')
dates = [date for date in date_row if date.isdigit()][:7]  # Get first 7 dates

# Find temperature values (rows 10-16 contain the actual temperature data)
temp_rows = rows[10:17]  # Get rows 10-16
for row in temp_rows:
  cells = row.find_elements(By.TAG_NAME, "td")
  if len(cells) >= 3:
    max_temps.append(cells[0].text.strip())
    min_temps.append(cells[2].text.strip())

# Find humidity values (rows 18-24)
humidity_rows = rows[18:25]
for row in humidity_rows:
  cells = row.find_elements(By.TAG_NAME, "td")
  if len(cells) >= 2:
    humidity_values.append(cells[1].text.strip())

# Find wind speed values (rows 26-32)
wind_rows = rows[26:33]
for row in wind_rows:
  cells = row.find_elements(By.TAG_NAME, "td")
  if len(cells) >= 1:
    wind_speeds.append(cells[0].text.strip())

# Find pressure values (rows 42-48)
pressure_rows = rows[42:49]
for row in pressure_rows:
  cells = row.find_elements(By.TAG_NAME, "td")
  if len(cells) >= 1:
    pressure_values.append(cells[0].text.strip())

# Find precipitation values (rows 50-56)
precip_rows = rows[50:57]
for row in precip_rows:
  cells = row.find_elements(By.TAG_NAME, "td")
  if len(cells) >= 1:
    precip_values.append(cells[0].text.strip())

Once all the relevant data has been collected and cleaned, it can be added to a new dataframe.

import pandas as pd

# Create DataFrame
weather_data = pd.DataFrame({
    'Date': dates,
    'Max Temperature (°F)': max_temps,
    'Min Temperature (°F)': min_temps,
    'Humidity (%)': humidity_values,
    'Wind Speed (mph)': wind_speeds,
    'Pressure (in)': pressure_values,
    'Precipitation (in)': precip_values
})

print(weather_data)
driver.quit()

  Date Max Temperature (°F) Min Temperature (°F) Humidity (%)  \
0   30                  101                   73         70.3   
1    1                   79                   65         53.1   
2    2                   83                   67         49.5   
3    3                   83                   68         51.5   
4    4                   85                   72         63.0   
5    5                   90                   74         70.8   
6    6                   92                   72         70.9   

  Wind Speed (mph) Pressure (in) Precipitation (in)  
0               91          30.0               0.06  
1               65          30.1               0.11  
2               55          30.2               0.00  
3               70          30.2               0.00  
4               85          30.0               0.00  
5               87          29.8               0.03  
6               91          29.9               0.11

Lastly, driver.quit() closes the browser.

12.1.4 A Note on Data Ethics

While web scraping is not explicitly illegal, it can get you in hot water if you are not careful. Web scraping is a powerful tool and it should be treated as such. Just because you can web scrape doesn’t always mean you should.

12.1.4.1 Why Web-Scraping can be un-ethical

There are several reasons that web scraping may be deemed unethical.

The website you are trying to web scrape may not allow it.
The information being scraped is considered private information or intellectual property.
Sending too many requests at once can overwhelm the server and crash the website.

12.1.4.2 Some Tips to Help You Scrape Ethically

You can check if a website allows web scraping in either the terms of use section of the website or by checking the websites .robots.txt to see who is allowed to use the website and what parts are available for scraping.
Always be mindful of what kind of information you are trying to collect and if it is private information/intellectual property
Never scrape from a website that requires login or payment
Spread out the time of the requests in order to prevent the website from crashing. If using Selenium, use WebDriverWait from selenium.webdriver.support.ui to wait for the page to load. Otherwise, use the time package to space out the requests.

12.2 Reinforcement Learning

This section was written by Qianruo Tan.

12.2.1 Introduction

Reinforcement learning (RL) is a type of machine learning.
Deep Q-Networks (DQN)
Agents learn by interacting with the environment.
The goal of it is to maximize cumulative reward over time.
Reinforcement Learning (RL) VS Supervised Learning (SL)

Supervised Learning (SL) and Reinforcement Learning (RL) share some similarities in their frameworks, but they differ fundamentally in how they learn and improve over time. In a supervised learning setting, training a neural network, such as learning to play a game, requires a substantial dataset with labeled examples. This involves recording the gameplay process, including inputs like key presses and even eye movements.

The network learns by observing how certain actions (inputs) lead to specific outcomes (labels), essentially mimicking human actions to predict what to do next. However, it only imitates what it’s been taught and does not learn to improve on its own beyond the data it’s given.

In contrast, reinforcement learning does not rely on a predefined dataset with target labels. Instead, it uses a policy network that transforms input frames from the game into actions. The training process in RL starts with a network that has no prior knowledge. It receives input frames directly from the game environment and must figure out what actions to take to maximize rewards over time. The simplest way to train this policy network is through policy gradients, where the model learns by interacting with the environment and receiving feedback.

However, RL has a downside: if the model encounters a failure, it might incorrectly generalize that a certain strategy is bad, leading to what’s known as the credit assignment problem—it struggles to properly attribute which actions led to success or failure. This means the network may become overly cautious, reducing its exploration of potentially good strategies that initially resulted in failure.

12.2.2 Actual usages & Scopes & Limitations

12.2.2.1 Actual usages

Gaming and Simulations (AlphaGo)
Cooperate with Bayesian Optimization
Robotics and Automation
Self-driving Cars
Finance and Trading
Personalization and Recommendations

12.2.2.2 Scopes & Limitations

Reinforcement Learning vs. Evolutionary Methods

Evolutionary methods cannot utilize real-time feedback from actions, making them less efficient in dynamic environments where immediate learning is advantageous.

Policy Gradient Methods

Unlike evolutionary methods, policy gradients interact with the environment to improve performance, allowing more efficient use of detailed feedback from individual interactions.

Misunderstanding of Optimization

Optimization in RL is about improving performance incrementally, not guaranteeing the best possible outcome in every scenario.

12.2.3 Q-Learning

12.2.3.1 Q-learning Overview

Aims to learn the optimal action-value function \(( q_*(s, a) )\), regardless of the policy being followed during learning.
The main objective is to approximate this function through a learning process where the agent interacts with its environment, receiving rewards and updating its knowledge of state-action pairs.

12.2.3.2 Q-learning Update Rule

The core formula of Q-learning is based on the Bellman equation for the action-value function: \[ [Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha \left[ R_{t+1} + \gamma \max_{a} Q(S_{t+1}, a) - Q(S_t, A_t) \right]], \] where - \(( Q(s, a) )\): The current Q-value for taking action \(( a )\) in state \(( s )\).

\(( \alpha )\): The learning rate, set to 0.1 in your code, which controls how much new information overrides the old information.
\(( R )\): The reward received after taking action \(( a )\) in state \(( s )\).
\(( \gamma )\): The discount factor, set to 0.9 in your code, which determines the importance of future rewards.
\(( \max_{a'} Q(s', a') )\): The maximum Q-value for the next state \(( s' )\) across all possible actions \(( a' )\). This term represents the best possible future reward from the next state.
\(( Q(s, a) )\) (the initial term on the right side): The old Q-value for the current state-action pair, which is being updated.

12.2.3.3 Key Components of Q-learning

Off-policy Learning:

Q-learning is an off-policy method, meaning the agent can learn the optimal policy independently. Used to select actions during training (e.g., \(( \epsilon )\)-greedy strategy).

12.2.3.4 Common supporting tools

Environment and State Management:

Functions get_next_state() and get_reward()

Q-Table Initialization:

q_table = np.zeros((grid_size, grid_size, 4))

12.2.3.5 Common supporting tools

Neural Network: A computational model inspired by the human brain, consisting of layers of interconnected nodes (neurons). It learns to recognize patterns in data by adjusting weights through multiple training iterations.

Deep Q-Learning (DQN): An algorithm that combines Q-Learning with deep neural networks to approximate the Q-value function, allowing the agent to choose optimal actions in complex environments with high-dimensional state spaces.

Policy Network: A neural network designed to output the best action to take in a given state. It maps input states directly to actions, enabling the agent to decide what to do without relying on a value function.

Policy Gradient: An optimization technique in reinforcement learning where the agent learns the best policy by directly adjusting its parameters to maximize the cumulative reward, rather than estimating value functions.

12.2.3.6 Common supporting tools

Behavior Policy: The strategy that the agent uses to explore the environment and collect experiences. It often includes some randomness to encourage exploration of different actions.

Target Policy: The policy that the agent is trying to optimize. In some algorithms, like Q-learning, it is used to determine the best action to take based on learned values.

Epsilon-Greedy Policy: A strategy used to balance exploration and exploitation. With a small probability (epsilon), the agent chooses a random action to explore, and with the remaining probability (1 - epsilon), it chooses the best-known action based on current knowledge.

12.2.4 An Example

12.2.4.1 Grid

The environment is a 4×4 grid. The agent starts in (0,0) and aims to reach (3,3).

12.2.4.2 Environment Setup

The following chunk is used to set up the environment, which the agent is in. I also set up a seed, for the reproduction.

## Import packages
import numpy as np
import random

## Set the random seed for reproducibility
random.seed(3255)
np.random.seed(3255)

## Define the environment
grid_size = 4
start_state = (0, 0)
goal_state = (3, 3)
obstacles = []

12.2.4.3 Hyperparameters

The following chunk is used to set key reinforcement learning hyperparameters, including the learning rate, discount factor, exploration rate, and the number of training episodes.

## Define the hyperparameters for our value function
alpha = 0.1  # Learning rate
gamma = 0.9  # Discount factor
epsilon = 0.2  # Exploration rate
num_episodes = 1000  # Try 1000 times

12.2.4.4 Q-table Initialization

The following chunk initializes a Q-table with zeros to store the value estimates for each state-action pair in a grid, and defines the available actions (up, down, left, right) for the agent to choose from.

## Initialize the Q-table
q_table = np.zeros((grid_size, grid_size, 4))  
## 4 possible actions: up, down, left, right
## The output Q-table follows thie 4 directions

## Define the action space
actions = ["up", "down", "left", "right"]

12.2.4.5 State Transitions and Reward Calculation

The following chunk defines functions to determine the agent’s next state based on an action (preventing it from leaving the grid boundaries) and to assign a numerical reward depending on whether the state is a goal, an obstacle, or a general location.

def get_next_state(state, action):
    i, j = state
    if action == "up":
        return (max(i - 1, 0), j)
    elif action == "down":
        return (min(i + 1, grid_size - 1), j)
    elif action == "left":
        return (i, max(j - 1, 0))
    elif action == "right":
        return (i, min(j + 1, grid_size - 1))
    
def get_reward(state):
    if state == goal_state:
        return 10
    elif state in obstacles:
        return -5
    else:
        return -1

Move up: decrease row index (i) by 1, but don’t go out of bounds (minimum row index is 0) Move down: increase row index (i) by 1, but don’t go out of bounds (maximum row index is grid_size - 1) Move left: decrease column index (j) by 1, but don’t go out of bounds (minimum column index is 0) Move right: increase column index (j) by 1, but don’t go out of bounds (maximum column index is grid_size - 1)

12.2.4.6 Action Selection (Epsilon-Greedy Policy)

The following chunk implements an epsilon-greedy strategy for action selection, randomly exploring with probability epsilon or exploiting the action with the highest Q-value otherwise.

def choose_action(state):
    ## Epsilon-greedy action selection
    if random.uniform(0, 1) < epsilon:
        return random.choice(actions)  # Explore
    else:
        ## Exploit (choose the action with the highest Q-value)
        state_q_values = q_table[state[0], state[1], :]
        return actions[np.argmax(state_q_values)]

state represents the agent’s current position on the grid (given as a tuple (i, j)). And there are 4 x 4 = 16 possibilities. The function uses this state to decide whether to explore (choose a random action) or exploit (choose the best-known action based on the Q-table). The Q-values for that particular state are retrieved using q_table [state[0], state[1], :].

12.2.4.7 Q-L Algorithm (Main Loop)

The following chunk runs the Q-learning algorithm over multiple episodes, choosing actions based on an epsilon-greedy policy, observing the resulting reward and next state, then updating the Q-values using the Q-learning equation until it reaches the goal state.

## Q-learning Algorithm
for episode in range(num_episodes):
    state = start_state
    while state != goal_state:
        action = choose_action(state)
        next_state = get_next_state(state, action)
        
        ## Get the reward for moving to the next state
        reward = get_reward(next_state)
        
        ## Update Q-value using the Q-learning formula
        current_q_value = q_table[
            state[0], state[1], actions.index(action)
            ]
        max_future_q_value = np.max(
            q_table[next_state[0], next_state[1], :]
            )
        new_q_value = current_q_value + alpha * (
            reward + gamma * max_future_q_value - current_q_value
            )
        q_table[state[0], state[1], actions.index(action)] = new_q_value
        
        state = next_state

state is initially set to start_state (e.g., (0, 0)). Within the loop, the choose_action(state) function is called to select an action based on the agent’s current position. The agent then moves to a new next_state using the chosen action, and the current state is updated to next_state. Throughout this loop, state always represents the agent’s current position on the grid as it progresses toward the goal_state.

12.2.4.8 Result

The following chunk prints out the final learned Q-values, showing the agent’s estimated values for each state-action pair after training.

## Display the learned Q-values
print("Learned Q-Table:")
print(q_table)

Learned Q-Table:
[[[ 0.61922635  1.7634678   0.61163903  1.8098    ]
  [ 1.7988677   3.122       0.60561032  3.11191456]
  [ 1.21191045  4.57986592  0.81220571  0.9359679 ]
  [ 0.60944922  5.3724181  -0.53983336  0.25476897]]

 [[-0.44402797 -0.38478777  0.66294145  3.12143333]
  [ 1.78608503  4.53127148  1.7430146   4.58      ]
  [ 3.07526195  6.2         3.0995568   6.18414026]
  [ 1.73344552  7.99997146  3.22568878  2.73360563]]

 [[-0.45222003 -0.77836467 -0.67934652  3.05464906]
  [ 1.92390397  0.26491207  0.18598014  6.19978945]
  [ 4.55721914  7.8668745   4.54619388  8.        ]
  [ 6.16475403 10.          6.18052412  7.95907134]]

 [[-0.42453346 -0.3940399  -0.3940399  -0.04672549]
  [ 0.50001115 -0.12967628 -0.2248309   4.06371215]
  [-0.1         3.00770286  0.          9.98689979]
  [ 0.          0.          0.          0.        ]]]

This is the directly output of this example, there are three layers of bracket, each of them have different meanings. First layer of brackets: Represents the rows of the grid. Each layer represents a row in the Q-table(i.e., a row position in the qrid environment). Second layer of brackets: Represents the columns of the grid. Each subarray represents a specificstate in that row (i.e., a specific position in the qrid, such as (0,0), (1,1), etc.). Third layer of brackets: Represents the Q-values for each action in that state. Each elementrepresents the Q-value of a specific action in that state (e.g. the four actions: up, down, left, right)

12.2.4.9 Route visualization

The following chunk visualizes the learned policy on a grid by drawing arrows in each cell to indicate the best action according to the Q-table, while highlighting the start, goal, and obstacle positions.

import matplotlib.pyplot as plt

# Define directional arrows for actions: up, down, left, right
directions = {0: '↑', 1: '↓', 2: '←', 3: '→'}

# Visualization of the best actions on the grid
fig, ax = plt.subplots(figsize=(6, 6))
ax.set_xticks(np.arange(0.5, grid_size, 1))
ax.set_yticks(np.arange(0.5, grid_size, 1))
ax.grid(True, which='both')
ax.set_xticklabels([])
ax.set_yticklabels([])

# Highlight Start, Goal, and Obstacles
ax.add_patch(
    plt.Rectangle(
        start_state, 1, 1, fill=True, color='yellow', alpha=0.3
        )
    )
ax.add_patch(
    plt.Rectangle(
        goal_state, 1, 1, fill=True, color='green', alpha=0.3
        )
    )

# Highlight obstacles in red
for obstacle in obstacles:
    ax.add_patch(
        plt.Rectangle(
            obstacle, 1, 1, fill=True, color='red', alpha=0.5
            )
        )

# Add arrows (text-based) for the best actions
for i in range(grid_size):
    for j in range(grid_size):
        if (i, j) in obstacles:
            continue  # Skip drawing arrows for obstacle locations
        q_values = q_table[i, j, :]
        best_action_idx = np.argmax(q_values)
        best_action = directions[best_action_idx]  
        # Text-based direction arrows (↑, ↓, ←, →)

        # Adding text-based arrows at the grid cells
        ax.text(
            j + 0.5, i + 0.5, best_action, ha='center', va='center', fontsize=20
            )

# Draw grid lines and adjust axis
plt.xlim(0, grid_size)
plt.ylim(0, grid_size)
plt.gca().invert_yaxis()  
# Invert Y-axis to display it in the correct orientation
plt.show()

12.2.4.10 Grid with obstacles

Because of the reusing and stating cause the output become exactly the same. I rewrote the example with obstacles.

import numpy as np
import matplotlib.pyplot as plt

## Set random seed for reproducibility
random.seed(3255)
np.random.seed(3255)

## Define the environment
grid_size = 4
start_state = (0, 0)
goal_state = (3, 3)
obstacles = [(1, 1), (2, 2)]  # Ensure obstacles are unique

## Define directions for visualization
directions = {0: '↑', 1: '↓', 2: '←', 3: '→'}

12.2.4.11 Run the grid with obstacles

The following defines reward and next-state functions that incorporate obstacles and goals, and then visualizes the learned Q-values on a grid by marking start, goal, and obstacle cells, as well as displaying the agent’s best action in each non-obstacle cell using directional arrows.

## Function to get reward
def get_reward(state):
    if state == goal_state:
        return 10
    elif state in obstacles:
        return -5
    else:
        return -1

## Function to get the next state based on the action
def get_next_state(state, action):
    i, j = state
    if action == "up":
        next_state = (max(i - 1, 0), j)
    elif action == "down":
        next_state = (min(i + 1, grid_size - 1), j)
    elif action == "left":
        next_state = (i, max(j - 1, 0))
    elif action == "right":
        next_state = (i, min(j + 1, grid_size - 1))
    
    ## Prevent moving into obstacles
    if next_state in obstacles:
        return state  
        # Stay in the same position if the next state is an obstacle
    else:
        return next_state

## Visualization function
def plot_grid_with_obstacles():
    fig, ax = plt.subplots(figsize=(6, 6))
    ax.clear()  # Clear the plot to avoid overlap
    ax.set_xticks(np.arange(0.5, grid_size, 1))
    ax.set_yticks(np.arange(0.5, grid_size, 1))
    ax.grid(True, which='both')
    ax.set_xticklabels([])
    ax.set_yticklabels([])

    ## Highlight Start, Goal, and Obstacles
    ax.add_patch(
        plt.Rectangle(
            start_state, 1, 1, fill=True, color='yellow', alpha=0.3
            )
        )
    ax.add_patch(
        plt.Rectangle(
            goal_state, 1, 1, fill=True, color='green', alpha=0.3
            )
        )

    ## Highlight obstacles in red
    for obstacle in set(obstacles):  # Use a set to ensure uniqueness
        ax.add_patch(
            plt.Rectangle(
                obstacle, 1, 1, fill=True, color='red', alpha=0.5
                )
            )

    ## Add arrows for the best actions from Q-table
    for i in range(grid_size):
        for j in range(grid_size):
            if (i, j) in obstacles:
                continue  # Skip arrows for obstacle locations
            q_values = q_table[i, j, :]
            best_action_idx = np.argmax(q_values)
            best_action = directions[best_action_idx]  
            # Directional arrows (↑, ↓, ←, →)
            ax.text(
                j + 0.5, i + 0.5, best_action, ha='center', va='center', fontsize=20
                )

    ## Draw grid lines and adjust axis
    plt.xlim(0, grid_size)
    plt.ylim(0, grid_size)
    plt.gca().invert_yaxis()  
    # Invert Y-axis for correct orientation
    plt.show()

## Call the visualization function
plot_grid_with_obstacles()

12.2.5 Rewards and Penalties

The following chunk trains a Q-learning agent over multiple episodes, updating Q-values as it navigates from a start state to a goal state, stores the cumulative reward obtained in each episode, and finally plots how the total earned reward per episode evolves over time.

## Initialize list to store cumulative rewards for each episode
cumulative_rewards = []

for episode in range(num_episodes):
    state = start_state
    episode_reward = 0  # Track total reward for the current episode
    
    while state != goal_state:
        action = choose_action(state)
        next_state = get_next_state(state, action)
        
        reward = get_reward(next_state)
        episode_reward += reward  # Accumulate reward for this episode
        
        # Update Q-value
        current_q_value = q_table[
            state[0], state[1], actions.index(action)
            ]
        max_future_q_value = np.max(
            q_table[next_state[0], next_state[1], :]
            )
        new_q_value = current_q_value + alpha * 
                        (
                            reward + gamma * max_future_q_value - current_q_value
                            )
        q_table[state[0], state[1], actions.index(action)] = new_q_value
        
        state = next_state  # Move to the next state
    
    cumulative_rewards.append(episode_reward)  
    # Store cumulative reward for this episode

## Visualization of Cumulative Rewards
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 5))
plt.plot(range(num_episodes), cumulative_rewards, 
            label='Cumulative Reward per Episode')
plt.xlabel("Episode")
plt.ylabel("Cumulative Reward")
plt.title("Cumulative Rewards and Penalties Over Episodes")
plt.legend()
plt.show()

12.2.5.1 Track Cumulative Rewards Over Episodes

The following chunk trains a Q-learning agent by repeatedly choosing actions, updating its Q-values based on observed rewards, recording total rewards per episode, and then visualizes how cumulative rewards evolve over the episodes.

## Initialize list to store cumulative rewards for each episode
cumulative_rewards = []

for episode in range(num_episodes):
    state = start_state
    episode_reward = 0  # Track total reward for the current episode
    
    while state != goal_state:
        action = choose_action(state)
        next_state = get_next_state(state, action)
        
        reward = get_reward(next_state)
        episode_reward += reward  # Accumulate reward for this episode
        
        # Update Q-value
        current_q_value = q_table[
            state[0], state[1], actions.index(action)
            ]
        max_future_q_value = np.max(
            q_table[next_state[0], next_state[1], :]
            )
        new_q_value = current_q_value + alpha * (
            reward + gamma * max_future_q_value - current_q_value
            )
        q_table[state[0], state[1], actions.index(action)] = new_q_value
        
        state = next_state  # Move to the next state
    
    cumulative_rewards.append(episode_reward)  
    # Store cumulative reward for this episode

## Visualization of Cumulative Rewards
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 5))
plt.plot(
    range(num_episodes), cumulative_rewards, label='Cumulative Reward per Episode'
    )
plt.xlabel("Episode")
plt.ylabel("Cumulative Reward")
plt.title("Cumulative Rewards and Penalties Over Episodes")
plt.legend()
plt.show()