12.1 Filling Missing Data With Multiple Imputation
This section was prepared by Jacob Schlessel, a junior double majoring in Statistical Data Science and Economics.
12.1.1 Introduction
It is rare to work with a dataset that does not have missing values. This section will introduce multiple imputation as a method of filling these values and justify its use over other methods of imputation. The following topics will be covered:
Assessing Missingness
Benefits of Imputation
Limitations of Imputation
Methods of Single Imputation
Multiple Imputation
12.1.2 Assessing Missingness
Before filling in missing values, we must determine whether imputation makes sense given the context of the data. As a starting point, we can classify the missingness of the data into one of three different types. To illustrate the differences between these cases, we will use the following example:
A lab technician conducts a study on the relationship between age and blood pressure. Measurements are taken from a randomly selected group of 1000 volunteers, with all age groups being included in the study.
Missing completely at random (MCAR): the ideal case - missingness is independent of both observed and unobserved data
A lab technician accidentally forgets to record 10 patients’ blood pressure values because of a computer glitch
Missingness is random noise - estimates are not biased
Missing at random (MAR): the second best case - missingness depends only observed variables
Older patients are more likely to skip the blood pressure test because the cuff feels uncomfortable
Results are biased if you ignore the missing values
Missing not at random (MNAR): the missing values depend on the unobserved(missing) value itself
Patients with very high blood pressure intentionally refuse the test for fear of bad news
Imputation is not possible - requires advanced techniques like sensitivity analyses or pattern-mixture modes
Once we understand what type of missingness exists in the data, we have two options:
Complete Case Analysis: Drop all rows with missing data and only include complete rows in analysis
Imputation: Replace missing values with specific values, include filled values in analysis
12.1.2.1 Application in Python:
We wll use the airquality dataset from statsmodels as an example of working with missing data.
import statsmodels.api as smimport pandas as pd# Load the datasetdf = sm.datasets.get_rdataset("airquality").data# Summarize missing datamissing_count = df.isna().sum()missing_percent = (df.isna().mean() *100).round(2)missing_summary = pd.DataFrame({'Missing Count': missing_count,'Missing %': missing_percent}).sort_values('Missing %', ascending=False)print(missing_summary)
We cannot perform imputation if the data is MNAR. Therefore, to disprove that a dataset is MNAR, a good starting point is to look for evidence for MCAR or MAR, as a dataset cannot exhibit two different types of missingness. We do not actually need to know which of these two cases apply since imputation works with both, we just need to be able to confidently claim that the missing values are not associated with unobserved values and thus will not affect inference.
To find evidence of MCAR or MAR, we can use the following framework to analyze patterns of missingness:
Define the missingness target (the variable with missing values)
Perform statistical tests (t-tests for numeric, chi-squared tests for categorical) - note: this is not always conclusive
Find correlations between the target and other variables. If the correlation is above 0.2, this could be evidence of MAR and warrants further investigation
Use domain knowledge to perform in-depth analysis to rule out the possibility of MNAR missingness. Note: it is impossible to definitively prove or disprove MNAR based on the data because by definition, we do not have data to analyze.
Null hypothesis: The means for each variable are the same for the in the missing/nonmissing Ozone groups
Alternative hypothesis: The means for each variable are different for the in the missing/nonmissing Ozone groups
from scipy.stats import ttest_ind# Create indicator: 1 if Ozone is missing, else 0df['Ozone_missing'] = df['Ozone'].isna().astype(int)for col in ['Solar.R', 'Wind', 'Temp']: group1 = df.loc[df['Ozone_missing'] ==0, col].dropna() group2 = df.loc[df['Ozone_missing'] ==1, col].dropna() stat, p = ttest_ind(group1, group2, equal_var=False) mean1 = group1.mean() mean2 = group2.mean()print(f"{col}:")print(f" Mean (Ozone observed) = {mean1:.2f}")print(f" Mean (Ozone missing) = {mean2:.2f}")print(f" p-value = {p:.4f}\n")
Solar.R:
Mean (Ozone observed) = 184.80
Mean (Ozone missing) = 189.51
p-value = 0.7846
Wind:
Mean (Ozone observed) = 9.86
Mean (Ozone missing) = 10.26
p-value = 0.5446
Temp:
Mean (Ozone observed) = 77.87
Mean (Ozone missing) = 77.92
p-value = 0.9787
The pvalues are all large, meaning the means are the same for each variable regardless of whether ozone is missing.
Since Month is a categorical variable, we can perform a chi-squared test to check if the number of missing values depend on the month:
Null hypothesis: The number of missing Ozone values does not depend on the month Alternative hypothesis: The number of missing Ozone values does depend on the month in some way
from scipy.stats import chi2_contingency# Create contingency table: Month × Missingnesscontingency = pd.crosstab(df["Month"], df["Ozone_missing"])print("Contingency Table (Month vs. Ozone Missingness):\n")print(contingency)# Perform Chi-square test of independencechi2, p, dof, expected = chi2_contingency(contingency)print("\nChi-square test results:")print(f"Chi-square statistic = {chi2:.3f}")print(f"Degrees of freedom = {dof}")print(f"p-value = {p:.4f}")
June has noticeably more missing Ozone values than the other months, so we confirm that this difference is statistically significant by observing a pvalue of 0. We can confirm this visually as well:
import matplotlib.pyplot as plt# Plot number of missing ozone observations by monthmissing_counts = ( df[df["Ozone_missing"] ==1]["Month"] .value_counts() .sort_index())missing_counts = ( missing_counts.reindex([5, 6, 7, 8, 9], fill_value=0))plt.figure(figsize=(6, 4))plt.bar( missing_counts.index.astype(str), missing_counts.values, color="lightsteelblue", edgecolor="black")plt.xlabel("Month (May = 5, ..., September = 9)")plt.ylabel("Number of Missing Ozone Values")plt.title("Missing Ozone Values by Month")plt.grid(axis="y", linestyle="--", alpha=0.6)plt.tight_layout()plt.show()
Since the number of missing values of Ozone depends on observed data, we have determined that the missingness is MAR and we can continue with imputation.
We can follow the same process for analyzing the missingness of Solar.R.
df['Solar_missing'] = df['Solar.R'].isna().astype(int) #1 = Missing#t test for numeric variablesfor col in ['Ozone', 'Wind', 'Temp']: g1 = df.loc[df['Solar_missing']==0, col].dropna() g2 = df.loc[df['Solar_missing']==1, col].dropna() stat, p = ttest_ind(g1, g2, equal_var=False)print(col)print(f"Mean (Solar.R observed) = {g1.mean():.2f}")print(f"Mean: (Solar.R missing) = {g2.mean():.2f}")print(f"p-value = {p:.4f}\n")#chi squared test for categorical variable (month)cont = pd.crosstab(df['Month'], df['Solar_missing'])chi2, p, dof, exp = chi2_contingency(cont)print(f"Month vs Solar.R missingness: chi2={chi2:.2f}, p={p:.4f}")# Bar chart of counts by monthmissing_counts = ( df[df["Solar_missing"] ==1]["Month"] .value_counts() .sort_index())missing_counts = ( missing_counts.reindex([5, 6, 7, 8, 9], fill_value=0))plt.figure(figsize=(6, 4))plt.bar( missing_counts.index.astype(str), missing_counts.values, color="lightsteelblue", edgecolor="black")plt.xlabel("Month (May = 5, ..., September = 9)")plt.ylabel("Number of Missing Solar.R Values")plt.title("Missing Solar.R Values by Month")plt.grid(axis="y", linestyle="--", alpha=0.6)plt.tight_layout()plt.show()
Similarly to Ozone, it appears that Solar.R is MAR since the missing values are related to the month during which the data was measured.
12.1.2.2 Important Notes
Consider MNAR if domain knowledge suggests it
Make sure to evaluate case-specific context to ensure imputation is appropriate. For example, using statistical imputation methods for the missing zip codes in the NYC Crash Data would not have made sense
Proportion of missing data alone should not dictate whether imputation is used or not, need to use best judgement
12.1.3 Benefits of Imputation
Impututation replaces missing values with estimated ones to create a complete dataset. This process offers a number of benefits for modeling and analysis.
12.1.3.1 Preserving information
If we were to perform complete case analysis, we would lose the valuable information provided by the rows that may be missing values for one or two variables. By making an educated guess as to what the value would have been, we can
Maintain real-world relationships observed among variables
Preserve sample size to build models with higher precision
Conduct robust analysis with messy data
Enables use of statistical methods and machine learning models
12.1.3.2 Reducing bias
If the data are MAR, deletion will bias the results of any statistical modeling because the missing cases are systematically different from observed cases. To ensure that our data is representative of what we are studying, we can use fill missing values using estimates based on relationships that exist in our observed data. Imputation preserves covariances and joint distributions, so the result is lower bias in model coefficients.
Note: not all imputation methods have the same effect on reducing bias. Multiple imputation tends to minimize bias more than single imputation methods when data is MAR.
12.1.4 Limitations of Imputation
Imputation does not completely solve issues caused by missing data, it only alleviates them. It is important to note the following limitations of imputation to better understand when it is appropriate to use:
Simple imputation methods distort distributions and can impact visualizations
Need to acknowledge that imputing missing data adds some uncertainty
Cannot correct cases where data are MNAR
Can be very computationally expensive depending on method of imputation and complexity of data
12.1.5 Methods of Single Imputation
Single Imputation: Replacing each missing value with a single, fixed estimate
Types of single imputation:
Mean Imputation: Replacing missing values with the mean of the observed data
Median Imputation: Replacing missing values with the median of observed data
Mode Imputation: Replace missing values with the mode (for filling categorical data)
Hot-Deck Imputation: Replace a missing value with an observed value (a “donor” from a similar case)
For each method, we will fit a regression model with Temp as the response variable and Ozone, Solar.R, Wind, and Month as predictors. We will find the value of the coefficients and their standard errors after applying each method of imputation to compare the results.
12.1.5.1 Mean Imputation
We will replace all missing Ozone and Solar.R values with the respective means from the observed data:
# Encode Month as categorical and create dummy variablesdf["Month"] = df["Month"].astype("category")df= pd.get_dummies(df, columns=["Month"], drop_first=True)df = df.astype(float)# Assign predictor and response variablesx_var = ["Ozone", "Solar.R", "Wind", "Month_6", "Month_7", "Month_8", "Month_9"]# Apply mean imputation and fit the regression modeldf_mean = df.copy()for col in ["Ozone", "Solar.R"]: df_mean[col] = df_mean[col].fillna(df_mean[col].mean())y = df_mean["Temp"]X = sm.add_constant(df_mean[x_var])model = sm.OLS(y, X).fit()results = pd.DataFrame({"Coefficient": model.params.round(4),"Std_Error": model.bse.round(4)})print(results)
Is sufficient if less than 5% of a particular variable is missing
Limitations:
Underestimates variances -> does not add any spread to the data, standard errors and confidence intervals are too small
Reduces natural variability
Performs poorly if data is not MCAR
Shrinks covariance -> coefficients are biased towards zero
12.1.5.2 Median Imputation
We can fit the same model, this time using median imputation:
df_median = df.copy()# Apply median imputation and fit the modelfor col in ["Ozone", "Solar.R"]: df_median[col] = df_median[col].fillna(df_median[col].median())# 4) Define predictors and responsey = df_median["Temp"]X = sm.add_constant(df_median[x_var])# 5) Fit OLS regressionmodel = sm.OLS(y, X).fit()# 6) Display coefficients and standard errors onlyresults = pd.DataFrame({"Coefficient": model.params.round(4),"Std_Error": model.bse.round(4)})print(results)
Flattens the distribution by clustering values at the center
12.1.5.3 Hot-Deck Imputation
Hot-Deck imputation chooses the value from a similar observation in the current dataset to fill the missing values with. The “donor” can be chosen randomly or conditionally. Here. we will fill missing values with similar observations from the same month, since earlier we found that Month was related to the number of missing values of both Ozone and Solar.R.
import numpy as npdf_hot = df.copy()rng = np.random.default_rng(42)def hot_deck_impute(df, target_col, month_dummies):# Reconstruct Month number from dummies (5–9) df = df.copy() df["Month_num"] =5for i, col inenumerate(month_dummies, start=6): df.loc[df[col] ==1, "Month_num"] = i# Impute missing values within same monthfor i in df[df[target_col].isna()].index: month = df.at[i, "Month_num"] donor_pool = df.loc[ (df["Month_num"] == month) & (~df[target_col].isna()), target_col ]if donor_pool.empty: donor_pool = df.loc[~df[target_col].isna(), target_col] df.at[i, target_col] = rng.choice(donor_pool.values)return df.drop(columns="Month_num")# Apply to missing targets and fit modelmonth_dummies = ["Month_6", "Month_7", "Month_8", "Month_9"]for col in ["Ozone", "Solar.R"]: df_hot = hot_deck_impute(df_hot, col, month_dummies)y = df_hot["Temp"]X = sm.add_constant(df_hot[x_var])model = sm.OLS(y, X).fit()results = pd.DataFrame({"Coefficient": model.params.round(4),"Std_Error": model.bse.round(4)})print(results)
Requires enough donor candidates to avoid highly biased results
Can propagate outliers
12.1.6 Multiple Imputation
Instead of replacing the missing cases with a single guess, multiple imputation creates multiple guesses to incorporate some randomness that mimics real world uncertainty.
Multiple Imputation by Chained Equations (MICE): the most widely used implementation of multiple imputation
Define one target variable and start with simple guesses (mean/median imputation) for missing cases of the non-target variable
Fit a regression model for the target using all other variables as predictors
Impute the missing values of the target variable with the estimate generated by the regression
Repeat for all variables that have missing data, using your newly found imputations as training data
Repeat the process m times using different random draws to create m complete datasets. For each cycle, the coefficients for the regression models are sampled from the posterior distribution - essentially, the coefficients are randomly selected from their respective distributions of their plausible values
Fit the final model m times using the complete datasets and pool results using Rubin’s Rules to combine estimates and obtain valid inference
Rubin’s Rules: formulas for combining m regression results into one final set of estimates, correctly accounts for within-imputation variance and between-imputation variance.
Pooled Estimate (Mean of Coefficients)
\[
\bar{Q} = \frac{1}{m} \sum_{i=1}^{m} Q_i
\]
\(Q_i\): estimate (regression coefficient) from imputed dataset i
\(\bar{Q}\): overall pooled estimate (average of all estimates)
Within-Imputation Variance: captures the uncertainty within each model
\[
\bar{U} = \frac{1}{m} \sum_{i=1}^{m} U_i
\]
\(U_i\): estimated variance from dataset i
Between-Imputation Variance: captures uncertainty across imputations
Notice that the standard errors for this method are larger than those from the single imputation methods. This is good -> we are correctly accounting for the increased uncertainty caused by missing data
12.1.7 Conclusion
Imputation is an excellent tool for producing valid inference from datasets with missing data
Before performing any type of imputation, you should get very familiar with your data and assess what type of missingness it exhibits
The model results generated from using MICE have less bias but increased variance
This section was prepared by Sahil Patel, an undergraduate junior majoring in Computer Science, Statistical Data Science, and Economics.
12.2.1 An Introduction to Web Scraping
12.2.1.1 What is Web Scraping
Web scraping is a way to automatically gather information from websites. Instead of spending hours copying and pasting data, you can use scripts to collect large amounts of structured information quickly and consistently. Once you have the data, it can be analyzed, stored, or used for research and decision making.
Web scraping is particularly useful because most web data is created for humans to read, not for machines to process. By scraping, we can turn that information into a format that computers can work with, allowing us to explore trends, track changes, or gain insights that would be really hard to collect manually.
12.2.1.2 Use Cases of Web Scraping
Here are some of the use cases of web scarping:
Market Analysis
Companies use web scraping to track products, prices, and customer reviews, helping them understand market trends and stay competitive.
Finance and Investment
Analysts gather stock prices, financial news, and reports to make informed investment decisions or to analyze market sentiment.
Academic Research
Researchers collect articles, datasets, and public records to support studies and uncover insights across a variety of fields.
Social Media and Marketing
Marketers and analysts monitor trends, hashtags, and audience engagement to understand consumer behavior and improve campaigns.
12.2.1.3 Static versus Dynamic Web Pages
Before scraping, it is important to understand the kind of web pages you are working with:
Static Web Pages
The content is already in the HTML when the page loads. Everything you need is there, so extracting it is relatively straightforward.
Dynamic Web Pages
The content is generated by JavaScript or updated after the page loads. Fetching the HTML alone may not give you what you want. You may need to interact with the page to see the full content.
12.2.1.4 Legal and Ethical Considerations
Even though web scraping can be extremely powerful, it comes with important responsibilities and best practices that should not be ignored.
Respect robots.txt: Before scraping a site, check its robots.txt file to understand which pages or sections the website owner allows you to access, and make sure to follow these guidelines.
Avoid overloading servers: Sending too many requests too quickly can strain a website’s server. Introduce pauses between requests, use random delays, and keep your scraping activity at a reasonable pace to avoid causing problems for the website.
Do not collect sensitive information: Avoid scraping personal, confidential, or protected information unless you have explicit permission to do so, as this could violate privacy laws or ethical standards.
Check copyright and licensing: Just because you can access the data does not mean you have the right to use or redistribute it freely. Always review the site’s terms of use and any applicable copyright rules.
Be transparent: When using scraped data in reports, analyses, or projects, clearly cite your sources and acknowledge where the information came from. Transparency helps maintain credibility and respect for the original content creators.
12.2.2 Basics of Web Requests
Before you can extract any data from a website, you need to first fetch the content of the page. In Python, this is typically done using the requests library, which allows you to send HTTP requests and receive responses. Understanding how these requests work is key to effective web scraping.
If you haven’t done so already, you will need install requests to your terminal.
pip install requests
12.2.2.1 HTML Basics
To scrape web pages effectively, it is important to understand the basic structure of HTML, which is the language used to create web pages. Web pages are made up of elements enclosed in tags. Some common tags include:
<body>: this tag identifies the main body of the website, which contains the content that is visible to users.
<table>: This tag identifies a table element, which is used to organize data in rows and columns.
<tbody>: This tag identifies the body of the table, which contains all the rows of the table.
<tr>: This tag identifies a table row, which contains individual cells (<td>) with data.
12.2.2.2 Response Status Codes
Every HTTP request returns a status code that indicates whether it was successful.
200 means the request was successful, and the server returned the requested content.
404 means the page was not found, which usually indicates the URL is incorrect.
403 means access is forbidden, which often happens when a server blocks automated requests.
500 indicates a server error on the website.
Checking these codes allows you to handle errors gracefully and avoid scraping pages that are unavailable or blocked.
12.2.2.3 HTTP Methods
The two most common HTTP methods used in web scraping are GET and POST.
GET requests are used to retrieve data from a web server. This is the method you will use most often because you are usually trying to download the content of a page.
POST requests are used to send data to a server, often when submitting forms or interacting with a website. Some websites require POST requests to access certain content or search results.
12.2.2.4 Using Headers and User-Agent Strings
Web servers sometimes treat automated requests differently from requests made by real browsers. By adding headers, you can make your requests look more like they are coming from a normal user.
A common header to include is the User-Agent, which identifies the browser and device making the request. Here is an example of what a header can look like:
import requestsurl ='https://uconn.edu'headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 ''(KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36'}response = requests.get(url, headers=headers)print(response.status_code) # prints the HTTP status codeprint(response.text[:500]) # prints the first 500 characters of the HTML
BeautifulSoup is one of the most popular Python libraries for web scraping. It helps you collect and organize information from web pages written in HTML or XML. When you look at a web page’s source code, it often looks like a confusing block of text. BeautifulSoup turns that chaos into a clean, organized structure that you can easily search and navigate. It is a reliable tool for collecting data from static websites such as news articles, course lists, or research archives.
12.2.3.1 Installing and Importing BeautifulSoup
To use BeautifulSoup, you first need to install it.
pip install beautifulsoup4
Once installed, import it into your Python script:
from bs4 import BeautifulSoup
BeautifulSoup works well alongside the requests library, creating a simple and efficient way to download and process web pages.
12.2.3.2 Loading HTML into BeautifulSoup
Once you have downloaded the HTML content of a web page, the next step is to load it into BeautifulSoup.
url ="https://decisiondrivers.com/nyc/zip-code/"headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 ""(KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36"}response = requests.get(url, headers=headers)print("Status code:", response.status_code)soup = BeautifulSoup(response.text, "html.parser")
Status code: 200
The requests.get() function fetches the page, and BeautifulSoup() parses it into a structure that can be easily searched. The "html.parser" argument tells BeautifulSoup how to interpret the HTML. The resulting soup object becomes your workspace for exploring and extracting information. You can now search for tags like <table>, <tr>, or <td> to extract the data you need.
Since the status code prints 200, that means the request worked successfully and the page’s content is ready to scrape.
12.2.3.3 Navigating and Searching the Parse Tree
BeautifulSoup organizes the HTML into a tree like structure that you can move through. Each tag, paragraph, or link becomes something you can directly access.
The most common ways to search are find(), find_all(), and select().
find() – finds the first instance of a tag
find_all() – finds all instances of a tag
select() – finds elements using CSS selectors
# Find the first heading on the pageheading = soup.find('h1')# Find all links on the pagelinks = soup.find_all('a')# Use a CSS selector to find specific itemsnav_links = soup.select('nav a')
12.2.3.4 Extracting Data
Before writing any scraping code, it’s helpful to understand where the data actually sits within the webpage. You can right-click anywhere on the site and choose Inspect to open your browser’s developer tools. This reveals the HTML structure of the page, where information is organized using tags such as <table>, <tr>, and <td>. These tags define how the data is arranged, helping you identify exactly which elements you’ll need to target with BeautifulSoup.
For example, on the [Decision Drivers NYC ZIP Codes] (https://decisiondrivers.com/nyc/zip-code/) page, the ZIP code information is not stored inside a table but listed under each borough heading. Every borough name appears as a heading tag, such as <h2>, followed by a list of ZIP codes written in plain text or inside <ul> and <li> tags. Instead of looping through table rows, you can search for each heading, record the borough name, and then capture the ZIP codes that appear beneath it. This structure works well with BeautifulSoup because it allows you to identify the borough from the heading and pair it with its corresponding ZIP codes by reading the text that follows.
import pandas as pdboroughs = []zips = []# Split the entire text into lines, then look for boroughs and ZIPslines = soup.get_text("\n", strip=True).splitlines()current_borough =Nonefor line in lines: text = line.strip()# If the line is a borough name, update the current boroughif text.upper() in ["MANHATTAN", "BROOKLYN", "QUEENS", "BRONX", "STATEN ISLAND"]: current_borough = text.title()continue# If the line is a 5-digit number, it’s a ZIP codeif current_borough and text.isdigit() andlen(text) ==5: boroughs.append(current_borough) zips.append(text)# Create a clean DataFramedf = pd.DataFrame({"Borough": boroughs, "ZIP": zips})df = df.drop_duplicates().sort_values(["Borough", "ZIP"]).reset_index(drop=True)# Display the first few rowsprint(df.head(10))print("\nTotal ZIP codes found:", len(df))
This is an example that will be useful for our midterm assignment. Here, we scrape the addresses of all the precincts in NYC.
url ="https://www.nyc.gov/site/nypd/bureaus/patrol/precincts-landing.page"headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 ""(KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36"}response = requests.get(url, headers=headers)print("Status code:", response.status_code)soup = BeautifulSoup(response.text, "html.parser")# Step 1: Locate the table containing the precinct informationtable = soup.find("table", class_="rt")# Step 2: If the table is found, extract all precinct rowsif table: rows = table.find_all("tr")else:print("No table found on this page.") rows = []# Step 3: Prepare a list to store precinct data as dictionariesprecinct_data = []# Step 4: Loop through each row and collect precinct name and addressfor row in rows: precinct_cell = row.find("td", attrs={"data-label": "Precinct"}) address_cell = row.find("td", attrs={"data-label": "Address"})if precinct_cell and address_cell: precinct_name = precinct_cell.text.strip() address = address_cell.text.strip() precinct_data.append({"Precinct": precinct_name,"Address": address })precincts_df = pd.DataFrame(precinct_data)# Step 6: Display a quick summaryprint(f"Extracted {len(precincts_df)} precincts successfully.")print(precincts_df.head())
Status code: 200
Extracted 78 precincts successfully.
Precinct Address
0 1st Precinct 16 Ericsson Place
1 5th Precinct 19 Elizabeth Street
2 6th Precinct 233 West 10 Street
3 7th Precinct 19 1/2 Pitt Street
4 9th Precinct 321 East 5 Street
12.2.4 An Introduction to Selenium
While BeautifulSoup works best for static websites, not all web pages are created equal. Many modern sites use JavaScript to load content after the initial HTML has already been sent to the browser. This means that if you use requests and BeautifulSoup, you might get an empty page or partial data. This is where you use Selenium.
Selenium is a powerful browser automation tool that lets you control an actual web browser, like Chrome or Firefox, through Python. It can click buttons, fill out forms, scroll through pages, and load JavaScript rendered content before scraping it. You can then pass the fully loaded page to BeautifulSoup for parsing.
12.2.4.1 Installing and Setting Up Selenium
To use Selenium, you first need to install it.
pip install selenium
Selenium also requires a WebDriver, which acts as a bridge between Python and your browser. If you are using Google Chrome, download ChromeDriver and make sure it is installed correctly. You can verify this by typing the following command in your terminal:
chromedriver--version
If you are on macOS and see the error “chromedriver cannot be opened because the developer cannot be verified,” you can fix it by running this command:
Now we can work on the code itself. This is how we set up Selenium:
from selenium import webdriverfrom selenium.webdriver.chrome.service import Servicefrom selenium.webdriver.chrome.options import Optionsimport os
Before launching Chrome, it’s common to set a few options that control how it behaves. For scraping, “headless” mode is useful because it runs Chrome without opening a window. It is faster and less distracting.
# Create Chrome options to control browser behavioroptions = Options()# Run Chrome invisibly (no window)options.add_argument("--headless")# Disable GPU acceleration for smoother headless operationoptions.add_argument("--disable-gpu")# Helps avoid permission issues in some environmentsoptions.add_argument("--no-sandbox")
Instead of typing out the full path manually, this function looks through your system’s PATH to find where chromedriver is installed.
def find_chromedriver():# Loop through all directories listed in the system PATHfor path in os.environ["PATH"].split(os.pathsep):# Build the potential path to chromedriver driver_path = os.path.join(path, "chromedriver")# Check if it exists and is executableif os.path.isfile(driver_path) and os.access(driver_path, os.X_OK):return driver_path # Return the path if foundreturnNone# Return None if not found# Run the function to locate chromedriverchromedriver_path = find_chromedriver()# If found, print its pathif chromedriver_path:print("Found ChromeDriver at:", chromedriver_path)# If not found, raise an error with instructionselse:raiseFileNotFoundError("ChromeDriver not found.")
Found ChromeDriver at: /usr/local/bin/chromedriver
Once you have the driver path and your browser settings, you can start Chrome.
# Import the Service class and connect the driverservice = Service(chromedriver_path)# Start the Chrome WebDriver with the defined optionsdriver = webdriver.Chrome(service=service, options=options)
At this point, Selenium is ready. You can visit a webpage by calling:
driver.get("https://example.com")
12.2.4.2 Using Selenium to Load Dynamic Pages
Here is a simple example that shows how Selenium can open a website, wait for it to load, and then pass the content to BeautifulSoup.
from selenium import webdriverfrom selenium.webdriver.chrome.options import Optionsfrom selenium.webdriver.common.by import Byfrom selenium.webdriver.support.ui import WebDriverWaitfrom selenium.webdriver.support import expected_conditions as ECimport pandas as pddef scrape_schedule(url, league): # league = "NFL" or "MLB" league = league.upper()# Identify the correct team link pattern based on the league team_href ="/nfl/team/_/name"if league =="NFL"else"/mlb/team/_/name"# Configure and launch Chrome options = Options() options.add_argument("--headless=new") options.add_argument("--window-size=1920,1080") driver = webdriver.Chrome(options=options)# Load the ESPN schedule page driver.get(url) wait = WebDriverWait(driver, 25)# Wait until key elements are visible (date headers and team links) wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "div.Table__Title"))) wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, f"a.AnchorLink[href*='{team_href}']"))) rows_out = []# Extract data by date section# Each schedule day starts with a <div class="Table__Title"> elementfor title in driver.find_elements(By.CSS_SELECTOR, "div.Table__Title"): date_text = title.text.strip()ifnot date_text:continue# Skip if the date header is empty# The table directly after the title contains the matchupstry: table = title.find_element(By.XPATH, "following::table[1]")except:continue# Skip if structure differs# Extract teams for each gamefor row in table.find_elements(By.XPATH, ".//tr[contains(@class,'Table__TR')]"): teams = [ a.text.strip()for a in row.find_elements(By.CSS_SELECTOR, f"td a.AnchorLink[href*='{team_href}']")if a.text.strip() ]iflen(teams) >=2:# Format as "Away @ Home" rows_out.append({"Date": date_text, "Matchup": f"{teams[0]} @ { teams[1]}"})# --- Step 5: Clean up and close browser --- driver.quit()# Build a clean DataFrame df = pd.DataFrame(rows_out, columns=["Date", "Matchup"]) df["League"] = leaguereturn df# --- Step 6: Run for both leagues ---nfl_df = scrape_schedule("https://www.espn.com/nfl/schedule", "NFL")mlb_df = scrape_schedule("https://www.espn.com/mlb/schedule", "MLB")print("===== NFL =====")print(nfl_df)print("\n===== MLB =====")print(mlb_df)
===== NFL =====
Date Matchup League
0 Thursday, November 20, 2025 Buffalo @ Houston NFL
1 Sunday, November 23, 2025 Pittsburgh @ Chicago NFL
2 Sunday, November 23, 2025 New England @ Cincinnati NFL
3 Sunday, November 23, 2025 New York @ Detroit NFL
4 Sunday, November 23, 2025 Minnesota @ Green Bay NFL
5 Sunday, November 23, 2025 Seattle @ Tennessee NFL
6 Sunday, November 23, 2025 Indianapolis @ Kansas City NFL
7 Sunday, November 23, 2025 New York @ Baltimore NFL
8 Sunday, November 23, 2025 Cleveland @ Las Vegas NFL
9 Sunday, November 23, 2025 Jacksonville @ Arizona NFL
10 Sunday, November 23, 2025 Philadelphia @ Dallas NFL
11 Sunday, November 23, 2025 Atlanta @ New Orleans NFL
12 Sunday, November 23, 2025 Tampa Bay @ Los Angeles NFL
13 Monday, November 24, 2025 Carolina @ San Francisco NFL
===== MLB =====
Date Matchup League
0 Saturday, November 1, 2025 Los Angeles @ Toronto MLB
Here is how the code works:
Setting Up Selenium
The first part initializes the Chrome browser in headless mode so it runs quietly in the background.
Waiting for the Page to Load
Selenium doesn’t automatically know when a website’s content is ready. Using WebDriverWait and expected_conditions, the script pauses until:
A date header (div.Table__Title) appears
A team link with a specific pattern (/nfl/team/_/name or /mlb/team/_/name) exists
Looping Through Each Day’s Games
Each date on the ESPN page appears as a header (Table__Title), followed by a table of games. The script finds each table using the XPath rule "following::table[1]", meaning the first table that comes after the header.
Extracting Teams from Each Row
Inside each table, every row (tr) represents one game. The team names are stored in <a> tags with URLs containing the league’s team name pattern. Selenium grabs these links and combines them into "Away @ Home" format.
Creating a DataFrame
Once all data is collected, it’s turned into a Pandas DataFrame with columns:
Date: the game date from the header
Matchup: formatted team pairing
League: either NFL or MLB
Closing the Browser
Always close the browser using driver.quit() to free up system resources.
12.3 Uniform Manifold Approximation and Projection
This section was prepared by Mark Zhang, an undergraduate at UCONN.
12.3.1 Introduction
Uniform Manifold Approximation and Projection (UMAP) was introduced by Leland McInnes, John Healy, and James Melville in 2018 (mcinnes2020umapuniformmanifoldapproximation?). UMAP is a powerful dimensionality reduction technique designed to visualize high-dimensional data in a low-dimensional space, such as 2D or 3D. Unlike linear methods like PCA, UMAP captures nonlinear structures, preserving local clusters while also maintaining the overall global patterns of the data. UMAP is utilized for clustering, data visualization, and data exploration.
12.3.2 Other Dimensionality Reduction Techniques
12.3.2.1 Principle Component Analysis (PCA)
linear technique; projects data along directions of maximum variance.
Preserves global structure fairly well but may lose local neighborhood. details.
good for visualization of 2-3 PCs but cannot capture nonlinear structures.
finds distances between pairs of points, maps onto t dist.
nonlinear technique; preserves local structure very well.
can distort global relationships.
often slower than UMAP for large datasets.
12.3.3 Algorithm Approach and Intuition
UMAP is a graph-based dimensionality reduction algorithm rooted in algebraic topology. It represents data as a fuzzy graph and finds a low-dimensional embedding that preserves both local neighborhoods and global structure.
12.3.3.1 Mathematical Foundations
Riemannian Manifold - UMAP assumes the data lies on a Riemannian manifold, a smoothly curved surface embedded in high-dimensional space.
Simplicial Sets - From algebraic topology, a simplex is the simplest geometric building block (a point, line, triangle, etc.), and a simplicial complex describes how these pieces connect to form a space. Geometrically a k-simplex is a very simple way to build a k-dimensional object.
12.3.3.2 Learning the manifold structure
UMAP builds a graph as a discrete approximation of the manifold’s structure. To aid in visualizing the concepts, I will show how they apply to a noisy sin wave data set.
Test data set of a noisy sin wave
How do we represent this data as a manifold? UMAP uses the concept of simplicial sets and topology and adapts them to fuzzy simplicial sets and metric space.
12.3.3.2.1 Finding nearest neighbors
For each point we find the nearest neighbors. The nneighbors hyperparameter controls how local or global we want our structure to be. A large amount of neighbors represent the manifold more accurately as a whole and a smaller number would capture fine details of the structure.
Fuzzy circles to 4th nearest neighbor
12.3.3.2.2 Constructing a neighborhood graph
UMAP represents high-dimensional data as a fuzzy simplicial set, capturing the local connectivity of points. For each point, a fuzzy membership weight is assigned to its neighbors: \[
w_{ij} = \exp\Bigg(-\frac{d_{ij} - \rho_i}{\sigma_i}\Bigg),
\]
where:
\(d_{ij}\) is the distance between points \(i\) and \(j\),
\(\rho_i\) is the distance to the closest neighbor of \(i\),
\(\sigma_i\) is a local scaling factor.
In other words, each point adjusts its influence based on the local density of its neighborhood. Dense regions, where points are very close together, have smaller \(\sigma_i\), which reduces the weight of individual connections. Sparse regions, with points farther apart, have larger \(\sigma_i\). , increasing the weight of each connection.
Since \(w_{i \to j}\) and \(w_{j \to i}\) can differ, UMAP combines them using a probabilistic union:
This produces a symmetric weight between 0 and 1 that preserves local topology, balances contributions across dense and sparse regions, and yields a uniform manifold representation.
Fuzzy weighted graph with UMAP-style combined weights
12.3.3.3 Finding a low-dimensional representation
After building the fuzzy simplicial set (the weighted neighborhood graph), UMAP tries to embed the points in a low-dimensional space (e.g., 2D or 3D) while preserving the relationships encoded in the graph. This is done by optimizing a cost function.
12.3.3.3.1 Minimum Distance
The min_dist parameter controls how close points can be in the low-dimensional embedding:
Small min_dist - points can be closer together, resulting in dense clusters that preserve fine local structure.
Large min_dist - points are forced farther apart, producing spread-out clusters that highlight global relationships.
In other words, min_dist adjusts the tightness of clusters in the embedding without changing the underlying topology of the data.
12.3.3.3.2 Minimizing the cost function (Cross-Entropy)
After constructing the fuzzy simplicial set and initializing the low-dimensional embedding, UMAP positions the points by minimizing a cross-entropy-like cost function:
To compute the similarity between points in the low-dimensional embedding, UMAP defines a corresponding low-dimensional weight for each pair of points:
\[
w_{ij}^{(\text{low})} = \frac{1}{1 + a \|y_i - y_j\|^{2b}},
\]
where:
\(\|y_i - y_j\|\) is the Euclidean distance between points \(y_i\) and \(y_j\) in the low-dimensional space (2D or 3D),
\(a\) and \(b\) are hyperparameters that control the shape of the attraction/repulsion curve.
The cross-entropy cost function is then defined as: \[
C = \sum_{i \neq j} \Big[ w_{ij} \log \frac{1}{w_{ij}^{(\text{low})}} +
(1 - w_{ij}) \log \frac{1}{1 - w_{ij}^{(\text{low})}} \Big],
\] where: - \(w_{ij}\) = high-dimensional fuzzy weight between points \(i\) and \(j\), - \(w_{ij}^{(\text{low})}\) = corresponding low-dimensional embedding weight.
Intuition:
If two points are strongly connected in high dimensions (large \(w_{ij}\)), the embedding tries to place them close together.
If two points are weakly connected (small \(w_{ij}\)), the embedding tries to place them far apart.
UMAP uses stochastic gradient descent to minimize this cost function, adjusting the coordinates of points in low dimensions so that the embedding preserves both local neighborhoods and global manifold structure.
12.3.4 Practical Example: UMAP on 28x28 digit data
First, install the necessary dependency:
pip install umap-learn
Before applying UMAP, let’s look at the raw data. The MNIST dataset from scikit-learn contains 70,000 images of handwritten digits, each represented as a 28×28 grayscale image (784 features per sample).
import matplotlib.pyplot as pltimport numpy as npfrom sklearn.datasets import fetch_openml# Load MNISTmnist = fetch_openml("mnist_784", version=1, as_frame=False)X = mnist.data[:2000]y = mnist.target[:2000].astype(int)# Reshape to 28×28 imagesimages = X.reshape(-1, 28, 28)# Show first 12 digits in a 3×4 gridfig, axes = plt.subplots(3, 4, figsize=(6, 6))for i, ax inenumerate(axes.flat): ax.imshow(images[i], cmap='gray') ax.set_title(y[i]) ax.axis('off')plt.tight_layout()plt.show()
Handwritten digit images (28×28) from MNIST dataset
We now project the 784-dimensional data into a 2D embedding using UMAP. The parameters below control the embedding:
n_neighbors: how many nearby points are used to estimate local structure.
min_dist: how tightly UMAP packs points in the low-dimensional space.
In the 2D UMAP embedding of the digits dataset, each point represents a digit image, colored by its label. The plot shows that UMAP successfully captures the structure of high-dimensional data: points of the same digit cluster tightly, while visually similar digits appear near each other, preserving meaningful relationships.
Same digits - form tight clusters, showing that UMAP preserves local neighborhood structure and keeps similar points together.
Similar digits (e.g., 3 & 8) - positioned near each other, reflecting their visual similarity in shape and style.
Different clusters - mostly separate, indicating clear class separation and that UMAP maintains distinctions between digit types.
Global layout - reflects the relative positions of all clusters , preserving overall structure of the dataset.
Takeaway: UMAP reveals both local and global patterns, making high-dimensional data interpretable in 2D.
12.3.5 Conclusion
UMAP is a nonlinear dimensionality reduction technique that effectively captures both local and global structure in high-dimensional data. By constructing a fuzzy neighborhood graph and optimizing a low-dimensional embedding, it preserves cluster relationships, highlights visually similar points, and maintains overall manifold structure. This makes UMAP a powerful, interpretable tool for visualization and exploratory data analysis.
This section was prepared by Wilson Tang, a undergraduate junior as of fall 2025 pursuing a single degree in statistical data science.
12.4.1 Motivations
R and Python are both languages we have been exposed to the most at UConn. I’ve had moments using one where I think about how the other might be useful, and so one might want to incorporate both languages. I want to show how we can properly incorporate R and Python together and when each language may be better.
Data science workflow:
Understanding the data science workflow will help us decide if we want to use R and Python together, or just individually.
Identify problem
Gather data
Clean & preprocess data
Data exploration
Feature engineering / external data
Statistical testing
Modeling & machine learning
Evaluation
Visualization
12.4.2 Corresponding packages
For most tasks, we can find corresponding packages in both Python and R that can generally accomplish the same tasks. Of course, do mind exceptions can occur for very specific tasks and you should make decisions based on the scope of your work.
Cleaning / Preprocessing Data
R: dplyr
Python: pandas
Exploratory Data Analysis
R: built-in functions, dplyr, ggplot2
Python: pandas, matplotlib, seaborn, plotnine
Feature Engineering
R: recipes (tidymodels), forcats, lubridate
Python: sklearn, pandas
Evaluation
R: yardstick, caret
Python: sklearn, yellowbrick
Visualization
R: ggplot2
Python: plotnine, matplotlib, seaborn
12.4.3 R / Python advantages
R:
Statistical testing: t-test, ANOVA, chi-squared, Shapiro-wilk are common tests built into R. Durbin Test and a fisher’s test that can work for any r x c contingency table.
Econometrics, time-series and biostatistics.
Python:
Being able to handle bigger data
Machine learning and AI
sklearn for machine learning
Tensor flow / Keras / Pytorch for deep learning
XGBoost / LightGBM / CatBoost for gradient boosting
Pipelines
Versatile data handling
APIs - Better for large scale / complex integration (‘requests’, ‘httpx’, ‘aiohttp’)
Web Scraping - Better for large scale / dynamic web scraping (‘Beautiful’, ‘Scrapy’, ‘Selenium’)
Databases - SQL, NoSQL and big platforms
Cloud data - R has limited options for this and sometimes rely on Python libraries
Real-time data - R can’t really do this
12.4.4 Setup both R and Python
Make sure you have your own python:
Windows comes with python, however we want to install our own because it will prevent errors in the future steps. (Microsoft sucks)
Find the Extensions tab on the left toolbar, search R language and install the R extension for VS code. And you can use Ctrl+Shift+P and type R: Create R terminal.
Calling R in Python:
We will use the package rpy2 to call R from Python, there are also other packages that can call R, check Additional resources.
Calling R from Python is a Python first method, and is generally recommended if the user is going to be using more python. However, one can also call Python from R as well. Check Additional resources for packages in R that allows us to call Python from R.
Reproducibility:
Since we are trying to use 2 languages at once, there will be packages from both R and Python. Of course, we would never want to compromise on reproducibility so what should we do? The best method would be to use Conda.
A single Conda environment can specify both versions of Python as well as R, while also allowing us to keep the packages of both in one place.
12.4.5 Example
Prerequisites:
In order to render these classnotes you will need to have a Python virtual environment with packages from requirements.txt and the package gamlss needs to be installed into your R global environment.
GAMLSS or Generalized Additive Models for Location, Scale, and Shape. It allows all parameters of the response distribution to depend on explanatory variables and thus each parameter can be modeled as a function of predictors. This model is useful when variability changes with predictors, skewness / kurtosis change with predictors.
Load Python libraries:
import pandas as pdimport seaborn as snsimport matplotlib.pyplot as plt
Load rpy2 to call R interface:
from rpy2 import robjectsfrom rpy2.robjects import pandas2ri, Formulaimport rpy2.robjects.packages as rpackagesfrom rpy2.robjects.conversion import localconverter
Error importing in API mode: ImportError("dlopen(/Users/junyan/work/teaching/ids-f25/ids-f25/.ids-f25-py312/lib/python3.12/site-packages/_rinterface_cffi_api.abi3.so, 0x0002): symbol not found in flat namespace '_R_BaseEnv'")
Trying to import in ABI mode.
Loading custom .RprofileLoading custom .Rprofile
Call gamlss from rpy2, rpy2 will look for the R global environment is the default location:
Send data frame from pandas to R and have it be converted to a R data frame using rpy2:
with localconverter(robjects.default_converter + pandas2ri.converter): r_df = robjects.conversion.py2rpy(penguins)
Here we are saying that mass depends on flipper, bill, species, sex and the variability of mass depends on flipper. Run gamlss from the R data frame to extract whatever we need from gamlss, in this case fitted \(\sigma\):
GAMLSS-RS iteration 1: Global Deviance = 4718.449
GAMLSS-RS iteration 2: Global Deviance = 4718.415
GAMLSS-RS iteration 3: Global Deviance = 4718.415
Send fitted values or anything extracted from the previous part:
with localconverter(robjects.default_converter + pandas2ri.converter): penguins["fitted_sigma"] = robjects.conversion.rpy2py(fitted_sigma)
Use extracted values for anything of interest in Python, in this case let’s say for some reason we wanted to use matplotlib and seaborn to plot our fitted \(\sigma\):
plt.figure(figsize=(8,6))sns.scatterplot(x="flipper", y="fitted_sigma", hue="species", data=penguins, alpha=0.6)sns.lineplot(x="flipper", y="fitted_sigma", hue="species", data=penguins, lw=2)plt.title("GAMLSS: Predicted Variability vs Flipper Length")plt.xlabel("Flipper Length (mm)")plt.ylabel("Predicted Body Mass Variability (grams)")plt.legend(title="Species")plt.show()
We see here that as flipper length increases, predicted body mass variability tends to decrease.