12  Advanced Topics

12.1 Filling Missing Data With Multiple Imputation

This section was prepared by Jacob Schlessel, a junior double majoring in Statistical Data Science and Economics.

12.1.1 Introduction

It is rare to work with a dataset that does not have missing values. This section will introduce multiple imputation as a method of filling these values and justify its use over other methods of imputation. The following topics will be covered:

  1. Assessing Missingness

  2. Benefits of Imputation

  3. Limitations of Imputation

  4. Methods of Single Imputation

  5. Multiple Imputation

12.1.2 Assessing Missingness

Before filling in missing values, we must determine whether imputation makes sense given the context of the data. As a starting point, we can classify the missingness of the data into one of three different types. To illustrate the differences between these cases, we will use the following example:

A lab technician conducts a study on the relationship between age and blood pressure. Measurements are taken from a randomly selected group of 1000 volunteers, with all age groups being included in the study.

  1. Missing completely at random (MCAR): the ideal case - missingness is independent of both observed and unobserved data
    • A lab technician accidentally forgets to record 10 patients’ blood pressure values because of a computer glitch
    • Missingness is random noise - estimates are not biased
  2. Missing at random (MAR): the second best case - missingness depends only observed variables
    • Older patients are more likely to skip the blood pressure test because the cuff feels uncomfortable
    • Results are biased if you ignore the missing values
  3. Missing not at random (MNAR): the missing values depend on the unobserved(missing) value itself
    • Patients with very high blood pressure intentionally refuse the test for fear of bad news
    • Imputation is not possible - requires advanced techniques like sensitivity analyses or pattern-mixture modes

Once we understand what type of missingness exists in the data, we have two options:

  • Complete Case Analysis: Drop all rows with missing data and only include complete rows in analysis
  • Imputation: Replace missing values with specific values, include filled values in analysis

12.1.2.1 Application in Python:

We wll use the airquality dataset from statsmodels as an example of working with missing data.

import statsmodels.api as sm
import pandas as pd

# Load the dataset
df = sm.datasets.get_rdataset("airquality").data

# Summarize missing data
missing_count = df.isna().sum()

missing_percent = (df.isna().mean() * 100).round(2)

missing_summary = pd.DataFrame({
    'Missing Count': missing_count,
    'Missing %': missing_percent
}).sort_values('Missing %', ascending=False)

print(missing_summary)
         Missing Count  Missing %
Ozone               37      24.18
Solar.R              7       4.58
Wind                 0       0.00
Temp                 0       0.00
Month                0       0.00
Day                  0       0.00

We cannot perform imputation if the data is MNAR. Therefore, to disprove that a dataset is MNAR, a good starting point is to look for evidence for MCAR or MAR, as a dataset cannot exhibit two different types of missingness. We do not actually need to know which of these two cases apply since imputation works with both, we just need to be able to confidently claim that the missing values are not associated with unobserved values and thus will not affect inference.

To find evidence of MCAR or MAR, we can use the following framework to analyze patterns of missingness:

  1. Define the missingness target (the variable with missing values)

  2. Perform statistical tests (t-tests for numeric, chi-squared tests for categorical) - note: this is not always conclusive

  3. Find correlations between the target and other variables. If the correlation is above 0.2, this could be evidence of MAR and warrants further investigation

  4. Use domain knowledge to perform in-depth analysis to rule out the possibility of MNAR missingness. Note: it is impossible to definitively prove or disprove MNAR based on the data because by definition, we do not have data to analyze.

Null hypothesis: The means for each variable are the same for the in the missing/nonmissing Ozone groups

Alternative hypothesis: The means for each variable are different for the in the missing/nonmissing Ozone groups

from scipy.stats import ttest_ind

# Create indicator: 1 if Ozone is missing, else 0
df['Ozone_missing'] = df['Ozone'].isna().astype(int)

for col in ['Solar.R', 'Wind', 'Temp']:
    group1 = df.loc[df['Ozone_missing'] == 0, col].dropna() 
    group2 = df.loc[df['Ozone_missing'] == 1, col].dropna()  
    stat, p = ttest_ind(group1, group2, equal_var=False)
    
    mean1 = group1.mean()
    mean2 = group2.mean()
    
    print(f"{col}:")
    print(f"  Mean (Ozone observed) = {mean1:.2f}")
    print(f"  Mean (Ozone missing)  = {mean2:.2f}")
    print(f"  p-value = {p:.4f}\n")
Solar.R:
  Mean (Ozone observed) = 184.80
  Mean (Ozone missing)  = 189.51
  p-value = 0.7846

Wind:
  Mean (Ozone observed) = 9.86
  Mean (Ozone missing)  = 10.26
  p-value = 0.5446

Temp:
  Mean (Ozone observed) = 77.87
  Mean (Ozone missing)  = 77.92
  p-value = 0.9787

The pvalues are all large, meaning the means are the same for each variable regardless of whether ozone is missing.

Since Month is a categorical variable, we can perform a chi-squared test to check if the number of missing values depend on the month:

Null hypothesis: The number of missing Ozone values does not depend on the month Alternative hypothesis: The number of missing Ozone values does depend on the month in some way

from scipy.stats import chi2_contingency

# Create contingency table: Month × Missingness
contingency = pd.crosstab(df["Month"], df["Ozone_missing"])
print("Contingency Table (Month vs. Ozone Missingness):\n")
print(contingency)

# Perform Chi-square test of independence
chi2, p, dof, expected = chi2_contingency(contingency)

print("\nChi-square test results:")
print(f"Chi-square statistic = {chi2:.3f}")
print(f"Degrees of freedom    = {dof}")
print(f"p-value               = {p:.4f}")
Contingency Table (Month vs. Ozone Missingness):

Ozone_missing   0   1
Month                
5              26   5
6               9  21
7              26   5
8              26   5
9              29   1

Chi-square test results:
Chi-square statistic = 44.751
Degrees of freedom    = 4
p-value               = 0.0000

June has noticeably more missing Ozone values than the other months, so we confirm that this difference is statistically significant by observing a pvalue of 0. We can confirm this visually as well:

import matplotlib.pyplot as plt

# Plot number of missing ozone observations by month
missing_counts = (
    df[df["Ozone_missing"] == 1]["Month"]
    .value_counts()
    .sort_index()
)
missing_counts = (
    missing_counts.reindex([5, 6, 7, 8, 9], fill_value=0)
)
plt.figure(figsize=(6, 4))
plt.bar(
    missing_counts.index.astype(str),
    missing_counts.values,
    color="lightsteelblue",
    edgecolor="black"
)
plt.xlabel("Month (May = 5, ..., September = 9)")
plt.ylabel("Number of Missing Ozone Values")
plt.title("Missing Ozone Values by Month")
plt.grid(axis="y", linestyle="--", alpha=0.6)
plt.tight_layout()
plt.show()

Since the number of missing values of Ozone depends on observed data, we have determined that the missingness is MAR and we can continue with imputation.

We can follow the same process for analyzing the missingness of Solar.R.

df['Solar_missing'] = df['Solar.R'].isna().astype(int) #1 = Missing

#t test for numeric variables
for col in ['Ozone', 'Wind', 'Temp']:
    g1 = df.loc[df['Solar_missing']==0, col].dropna()
    g2 = df.loc[df['Solar_missing']==1, col].dropna()
    stat, p = ttest_ind(g1, g2, equal_var=False)
    print(col)
    print(f"Mean (Solar.R observed) = {g1.mean():.2f}")
    print(f"Mean: (Solar.R missing) = {g2.mean():.2f}")
    print(f"p-value = {p:.4f}\n")

#chi squared test for categorical variable (month)
cont = pd.crosstab(df['Month'], df['Solar_missing'])
chi2, p, dof, exp = chi2_contingency(cont)
print(f"Month vs Solar.R missingness: chi2={chi2:.2f}, p={p:.4f}")


# Bar chart of counts by month
missing_counts = (
    df[df["Solar_missing"] == 1]["Month"]
    .value_counts()
    .sort_index()
)
missing_counts = (
    missing_counts.reindex([5, 6, 7, 8, 9], fill_value=0)
)
plt.figure(figsize=(6, 4))
plt.bar(
    missing_counts.index.astype(str),
    missing_counts.values,
    color="lightsteelblue",
    edgecolor="black"
)
plt.xlabel("Month (May = 5, ..., September = 9)")
plt.ylabel("Number of Missing Solar.R Values")
plt.title("Missing Solar.R Values by Month")
plt.grid(axis="y", linestyle="--", alpha=0.6)
plt.tight_layout()
plt.show()
Ozone
Mean (Solar.R observed) = 42.10
Mean: (Solar.R missing) = 42.80
p-value = 0.9602

Wind
Mean (Solar.R observed) = 10.00
Mean: (Solar.R missing) = 9.00
p-value = 0.5343

Temp
Mean (Solar.R observed) = 78.12
Mean: (Solar.R missing) = 73.00
p-value = 0.3602

Month vs Solar.R missingness: chi2=11.14, p=0.0251

Similarly to Ozone, it appears that Solar.R is MAR since the missing values are related to the month during which the data was measured.

12.1.2.2 Important Notes

  • Consider MNAR if domain knowledge suggests it

  • Make sure to evaluate case-specific context to ensure imputation is appropriate. For example, using statistical imputation methods for the missing zip codes in the NYC Crash Data would not have made sense

  • Proportion of missing data alone should not dictate whether imputation is used or not, need to use best judgement

12.1.3 Benefits of Imputation

Impututation replaces missing values with estimated ones to create a complete dataset. This process offers a number of benefits for modeling and analysis.

12.1.3.1 Preserving information

If we were to perform complete case analysis, we would lose the valuable information provided by the rows that may be missing values for one or two variables. By making an educated guess as to what the value would have been, we can

  • Maintain real-world relationships observed among variables
  • Preserve sample size to build models with higher precision
  • Conduct robust analysis with messy data
  • Enables use of statistical methods and machine learning models

12.1.3.2 Reducing bias

If the data are MAR, deletion will bias the results of any statistical modeling because the missing cases are systematically different from observed cases. To ensure that our data is representative of what we are studying, we can use fill missing values using estimates based on relationships that exist in our observed data. Imputation preserves covariances and joint distributions, so the result is lower bias in model coefficients.

Note: not all imputation methods have the same effect on reducing bias. Multiple imputation tends to minimize bias more than single imputation methods when data is MAR.

12.1.4 Limitations of Imputation

Imputation does not completely solve issues caused by missing data, it only alleviates them. It is important to note the following limitations of imputation to better understand when it is appropriate to use:

  • Simple imputation methods distort distributions and can impact visualizations
  • Need to acknowledge that imputing missing data adds some uncertainty
  • Cannot correct cases where data are MNAR
  • Can be very computationally expensive depending on method of imputation and complexity of data

12.1.5 Methods of Single Imputation

Single Imputation: Replacing each missing value with a single, fixed estimate

Types of single imputation:

  1. Mean Imputation: Replacing missing values with the mean of the observed data
  2. Median Imputation: Replacing missing values with the median of observed data
  3. Mode Imputation: Replace missing values with the mode (for filling categorical data)
  4. Hot-Deck Imputation: Replace a missing value with an observed value (a “donor” from a similar case)

For each method, we will fit a regression model with Temp as the response variable and Ozone, Solar.R, Wind, and Month as predictors. We will find the value of the coefficients and their standard errors after applying each method of imputation to compare the results.

12.1.5.1 Mean Imputation

We will replace all missing Ozone and Solar.R values with the respective means from the observed data:

# Encode Month as categorical and create dummy variables
df["Month"] = df["Month"].astype("category")
df= pd.get_dummies(df, columns=["Month"], drop_first=True)
df = df.astype(float)

# Assign predictor and response variables
x_var = [
    "Ozone", "Solar.R", "Wind", "Month_6", "Month_7", "Month_8", "Month_9"]

# Apply mean imputation and fit the regression model
df_mean = df.copy()

for col in ["Ozone", "Solar.R"]:
    df_mean[col] = df_mean[col].fillna(df_mean[col].mean())

y = df_mean["Temp"]
X = sm.add_constant(df_mean[x_var])
model = sm.OLS(y, X).fit()
results = pd.DataFrame({
    "Coefficient": model.params.round(4),
    "Std_Error": model.bse.round(4)
})

print(results)
         Coefficient  Std_Error
const        64.6504     2.3474
Ozone         0.0953     0.0207
Solar.R       0.0153     0.0054
Wind         -0.3807     0.1512
Month_6      11.7914     1.4211
Month_7      13.9666     1.4895
Month_8      14.5699     1.5075
Month_9      10.5281     1.4189

Benefits:

  • Is sufficient if less than 5% of a particular variable is missing

Limitations:

  • Underestimates variances -> does not add any spread to the data, standard errors and confidence intervals are too small
  • Reduces natural variability
  • Performs poorly if data is not MCAR
  • Shrinks covariance -> coefficients are biased towards zero

12.1.5.2 Median Imputation

We can fit the same model, this time using median imputation:

df_median = df.copy()

# Apply median imputation and fit the model
for col in ["Ozone", "Solar.R"]:
    df_median[col] = df_median[col].fillna(df_median[col].median())

# 4) Define predictors and response
y = df_median["Temp"]
X = sm.add_constant(df_median[x_var])

# 5) Fit OLS regression
model = sm.OLS(y, X).fit()

# 6) Display coefficients and standard errors only
results = pd.DataFrame({
    "Coefficient": model.params.round(4),
    "Std_Error": model.bse.round(4)
})

print(results)
         Coefficient  Std_Error
const        64.6857     2.3324
Ozone         0.0965     0.0205
Solar.R       0.0150     0.0054
Wind         -0.3708     0.1509
Month_6      12.3827     1.4122
Month_7      14.0042     1.4848
Month_8      14.5675     1.5032
Month_9      10.4375     1.4178

Benefits:

  • More robust to outliers than mean imputation

Limitations:

  • Underestimates uncertainty
  • Flattens the distribution by clustering values at the center

12.1.5.3 Hot-Deck Imputation

Hot-Deck imputation chooses the value from a similar observation in the current dataset to fill the missing values with. The “donor” can be chosen randomly or conditionally. Here. we will fill missing values with similar observations from the same month, since earlier we found that Month was related to the number of missing values of both Ozone and Solar.R.

import numpy as np

df_hot = df.copy()
rng = np.random.default_rng(42)

def hot_deck_impute(df, target_col, month_dummies):
    # Reconstruct Month number from dummies (5–9)
    df = df.copy()
    df["Month_num"] = 5
    for i, col in enumerate(month_dummies, start=6):
        df.loc[df[col] == 1, "Month_num"] = i

    # Impute missing values within same month
    for i in df[df[target_col].isna()].index:
        month = df.at[i, "Month_num"]
        donor_pool = df.loc[
            (df["Month_num"] == month) & (~df[target_col].isna()), target_col
        ]
        if donor_pool.empty:
            donor_pool = df.loc[~df[target_col].isna(), target_col]
        df.at[i, target_col] = rng.choice(donor_pool.values)
    return df.drop(columns="Month_num")

# Apply to missing targets and fit model
month_dummies = ["Month_6", "Month_7", "Month_8", "Month_9"]
for col in ["Ozone", "Solar.R"]:
    df_hot = hot_deck_impute(df_hot, col, month_dummies)

y = df_hot["Temp"]
X = sm.add_constant(df_hot[x_var])
model = sm.OLS(y, X).fit()
results = pd.DataFrame({
    "Coefficient": model.params.round(4),
    "Std_Error": model.bse.round(4)
})
print(results)
         Coefficient  Std_Error
const        66.0892     2.2664
Ozone         0.0929     0.0194
Solar.R       0.0128     0.0053
Wind         -0.4199     0.1475
Month_6      12.5071     1.4301
Month_7      13.6540     1.5196
Month_8      13.9432     1.5414
Month_9      10.0670     1.4339

Benefits:

  • Preserves real data values -> borrowed from observed “donors”
  • Retains covariance structure, keeping relationships realistic
  • Works well in MAR cases

Limitations:

  • Requires enough donor candidates to avoid highly biased results
  • Can propagate outliers

12.1.6 Multiple Imputation

Instead of replacing the missing cases with a single guess, multiple imputation creates multiple guesses to incorporate some randomness that mimics real world uncertainty.

Multiple Imputation by Chained Equations (MICE): the most widely used implementation of multiple imputation

  1. Define one target variable and start with simple guesses (mean/median imputation) for missing cases of the non-target variable

  2. Fit a regression model for the target using all other variables as predictors

  3. Impute the missing values of the target variable with the estimate generated by the regression

  4. Repeat for all variables that have missing data, using your newly found imputations as training data

  5. Repeat the process m times using different random draws to create m complete datasets. For each cycle, the coefficients for the regression models are sampled from the posterior distribution - essentially, the coefficients are randomly selected from their respective distributions of their plausible values

  6. Fit the final model m times using the complete datasets and pool results using Rubin’s Rules to combine estimates and obtain valid inference

Rubin’s Rules: formulas for combining m regression results into one final set of estimates, correctly accounts for within-imputation variance and between-imputation variance.

Pooled Estimate (Mean of Coefficients)

\[ \bar{Q} = \frac{1}{m} \sum_{i=1}^{m} Q_i \]

  • \(Q_i\): estimate (regression coefficient) from imputed dataset i
  • \(\bar{Q}\): overall pooled estimate (average of all estimates)

Within-Imputation Variance: captures the uncertainty within each model

\[ \bar{U} = \frac{1}{m} \sum_{i=1}^{m} U_i \]

  • \(U_i\): estimated variance from dataset i

Between-Imputation Variance: captures uncertainty across imputations

\[ B = \frac{1}{m - 1} \sum_{i=1}^{m} (Q_i - \bar{Q})^2 \]


Total Variance (Rubin’s Combined Uncertainty): combines both within- and
between-imputation uncertainty

\[ T = \bar{U} + \left(1 + \frac{1}{m}\right) B \]


Pooled Standard Error: the final standard error that accounts for both
model uncertainty and missing-data uncertainty, useful for inference

\[ SE(\bar{Q}) = \sqrt{T} \]

12.1.6.1 Summary Table

Symbol Meaning Interpretation
\(Q_i\) Estimate from imputation i e.g., regression coefficient
\(U_i\) Variance from imputation i \(SE^2\) from that model
\(\bar{Q}\) Mean of all \(Q_i\) Final pooled estimate
\(\bar{U}\) Mean of all \(U_i\) Within-imputation variance
\(B\) Between-imputation variance Variability across datasets
\(T\) Total variance Combined uncertainty
\(SE(\bar{Q})\) Pooled standard error \(\sqrt{T}\), used in inference

Scikit-learn’s IterativeImputer class simplifies this process.

Example using the airquality data:

from sklearn.experimental import enable_iterative_imputer 
from sklearn.impute import IterativeImputer

# Variables for imputation regression models
month_dummies = [c for c in df.columns if c.startswith("Month_")]
num_cols = ["Ozone", "Solar.R", "Wind", "Temp"] + month_dummies

# Variables for final regression omdel
x_vars = ["Ozone", "Solar.R", "Wind"] + month_dummies
y_var = "Temp"


# Number of imputations
M = 10

# To store results
coef_list, var_list = [], []

# Perform MICE 10 times
for m in range(M):
    imp = IterativeImputer(  # Performs chained equations 
        random_state=42 + m, # Different random seed every time
        sample_posterior=True, # Adds random noise
        max_iter=20 # Iterates thru models 20 times or until convergence
    )
    imputed = imp.fit_transform(df[num_cols])

    df_imp = df.copy()
    df_imp[num_cols] = imputed

    X = sm.add_constant(df_imp[x_vars])
    y = df_imp[y_var]
    model = sm.OLS(y, X).fit()

    coef_list.append(model.params)
    var_list.append(model.bse ** 2)

# Rubin’s Rules
coefs_df = pd.DataFrame(coef_list)
vars_df  = pd.DataFrame(var_list)

Q_bar = coefs_df.mean()
U_bar = vars_df.mean()
B     = coefs_df.var(ddof=1)
T_var = U_bar + (1 + 1/len(coef_list)) * B
SE    = np.sqrt(T_var)

pooled = pd.DataFrame({"MI Coefficient": Q_bar.round(4),
                       "MI Std_Error": SE.round(4)})
print(pooled)
         MI Coefficient  MI Std_Error
const           64.9348        2.2675
Ozone            0.0946        0.0211
Solar.R          0.0138        0.0054
Wind            -0.3213        0.1545
Month_6         11.2000        1.5068
Month_7         13.2032        1.5408
Month_8         13.8689        1.5867
Month_9          9.9381        1.4006

Notice that the standard errors for this method are larger than those from the single imputation methods. This is good -> we are correctly accounting for the increased uncertainty caused by missing data

12.1.7 Conclusion

  • Imputation is an excellent tool for producing valid inference from datasets with missing data

  • Before performing any type of imputation, you should get very familiar with your data and assess what type of missingness it exhibits

  • The model results generated from using MICE have less bias but increased variance

12.1.8 Further Readings

12.2 Web Scraping

This section was prepared by Sahil Patel, an undergraduate junior majoring in Computer Science, Statistical Data Science, and Economics.

12.2.1 An Introduction to Web Scraping

12.2.1.1 What is Web Scraping

Web scraping is a way to automatically gather information from websites. Instead of spending hours copying and pasting data, you can use scripts to collect large amounts of structured information quickly and consistently. Once you have the data, it can be analyzed, stored, or used for research and decision making.

Web scraping is particularly useful because most web data is created for humans to read, not for machines to process. By scraping, we can turn that information into a format that computers can work with, allowing us to explore trends, track changes, or gain insights that would be really hard to collect manually.

12.2.1.2 Use Cases of Web Scraping

Here are some of the use cases of web scarping:

  • Market Analysis
    • Companies use web scraping to track products, prices, and customer reviews, helping them understand market trends and stay competitive.
  • Finance and Investment
    • Analysts gather stock prices, financial news, and reports to make informed investment decisions or to analyze market sentiment.
  • Academic Research
    • Researchers collect articles, datasets, and public records to support studies and uncover insights across a variety of fields.
  • Social Media and Marketing
    • Marketers and analysts monitor trends, hashtags, and audience engagement to understand consumer behavior and improve campaigns.

12.2.1.3 Static versus Dynamic Web Pages

Before scraping, it is important to understand the kind of web pages you are working with:

  • Static Web Pages
    • The content is already in the HTML when the page loads. Everything you need is there, so extracting it is relatively straightforward.
  • Dynamic Web Pages
    • The content is generated by JavaScript or updated after the page loads. Fetching the HTML alone may not give you what you want. You may need to interact with the page to see the full content.

12.2.2 Basics of Web Requests

Before you can extract any data from a website, you need to first fetch the content of the page. In Python, this is typically done using the requests library, which allows you to send HTTP requests and receive responses. Understanding how these requests work is key to effective web scraping.

If you haven’t done so already, you will need install requests to your terminal.

pip install requests

12.2.2.1 HTML Basics

To scrape web pages effectively, it is important to understand the basic structure of HTML, which is the language used to create web pages. Web pages are made up of elements enclosed in tags. Some common tags include:

  • <body>: this tag identifies the main body of the website, which contains the content that is visible to users.

  • <table>: This tag identifies a table element, which is used to organize data in rows and columns.

  • <tbody>: This tag identifies the body of the table, which contains all the rows of the table.

  • <tr>: This tag identifies a table row, which contains individual cells (<td>) with data.

12.2.2.2 Response Status Codes

Every HTTP request returns a status code that indicates whether it was successful.

  • 200 means the request was successful, and the server returned the requested content.
  • 404 means the page was not found, which usually indicates the URL is incorrect.
  • 403 means access is forbidden, which often happens when a server blocks automated requests.
  • 500 indicates a server error on the website.

Checking these codes allows you to handle errors gracefully and avoid scraping pages that are unavailable or blocked.

12.2.2.3 HTTP Methods

The two most common HTTP methods used in web scraping are GET and POST.

  • GET requests are used to retrieve data from a web server. This is the method you will use most often because you are usually trying to download the content of a page.
  • POST requests are used to send data to a server, often when submitting forms or interacting with a website. Some websites require POST requests to access certain content or search results.

12.2.2.4 Using Headers and User-Agent Strings

Web servers sometimes treat automated requests differently from requests made by real browsers. By adding headers, you can make your requests look more like they are coming from a normal user.

A common header to include is the User-Agent, which identifies the browser and device making the request. Here is an example of what a header can look like:

import requests

url = 'https://uconn.edu'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 '
                  '(KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36'
}

response = requests.get(url, headers=headers)
print(response.status_code)  # prints the HTTP status code
print(response.text[:500])   # prints the first 500 characters of the HTML
200
<!DOCTYPE html>
<html lang="en-US" class="no-js">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <meta http-equiv="X-UA-Compatible" content="ie=edge">
    <link rel="profile" href="http://gmpg.org/xfn/11" />
    <link rel="pingback" href="https://uconn.edu/xmlrpc.php" />
    <title>University of Connecticut</title>
        <script type="application/ld+json">
        {
            "@context": "http://schema.org",
            "@type

12.2.3 An Introduction to BeautifulSoup

BeautifulSoup is one of the most popular Python libraries for web scraping. It helps you collect and organize information from web pages written in HTML or XML. When you look at a web page’s source code, it often looks like a confusing block of text. BeautifulSoup turns that chaos into a clean, organized structure that you can easily search and navigate. It is a reliable tool for collecting data from static websites such as news articles, course lists, or research archives.

12.2.3.1 Installing and Importing BeautifulSoup

To use BeautifulSoup, you first need to install it.

pip install beautifulsoup4

Once installed, import it into your Python script:

from bs4 import BeautifulSoup

BeautifulSoup works well alongside the requests library, creating a simple and efficient way to download and process web pages.

12.2.3.2 Loading HTML into BeautifulSoup

Once you have downloaded the HTML content of a web page, the next step is to load it into BeautifulSoup.

url = "https://decisiondrivers.com/nyc/zip-code/"
headers = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 "
                  "(KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36"
}

response = requests.get(url, headers=headers)
print("Status code:", response.status_code)

soup = BeautifulSoup(response.text, "html.parser")
Status code: 200

The requests.get() function fetches the page, and BeautifulSoup() parses it into a structure that can be easily searched. The "html.parser" argument tells BeautifulSoup how to interpret the HTML. The resulting soup object becomes your workspace for exploring and extracting information. You can now search for tags like <table>, <tr>, or <td> to extract the data you need.

Since the status code prints 200, that means the request worked successfully and the page’s content is ready to scrape.

12.2.3.4 Extracting Data

Before writing any scraping code, it’s helpful to understand where the data actually sits within the webpage. You can right-click anywhere on the site and choose Inspect to open your browser’s developer tools. This reveals the HTML structure of the page, where information is organized using tags such as <table>, <tr>, and <td>. These tags define how the data is arranged, helping you identify exactly which elements you’ll need to target with BeautifulSoup.

For example, on the [Decision Drivers NYC ZIP Codes] (https://decisiondrivers.com/nyc/zip-code/) page, the ZIP code information is not stored inside a table but listed under each borough heading. Every borough name appears as a heading tag, such as <h2>, followed by a list of ZIP codes written in plain text or inside <ul> and <li> tags. Instead of looping through table rows, you can search for each heading, record the borough name, and then capture the ZIP codes that appear beneath it. This structure works well with BeautifulSoup because it allows you to identify the borough from the heading and pair it with its corresponding ZIP codes by reading the text that follows.

import pandas as pd

boroughs = []
zips = []

# Split the entire text into lines, then look for boroughs and ZIPs
lines = soup.get_text("\n", strip=True).splitlines()
current_borough = None

for line in lines:
    text = line.strip()
    # If the line is a borough name, update the current borough
    if text.upper() in ["MANHATTAN", "BROOKLYN", "QUEENS", "BRONX", 
    "STATEN ISLAND"]:
        current_borough = text.title()
        continue

    # If the line is a 5-digit number, it’s a ZIP code
    if current_borough and text.isdigit() and len(text) == 5:
        boroughs.append(current_borough)
        zips.append(text)

# Create a clean DataFrame
df = pd.DataFrame({"Borough": boroughs, "ZIP": zips})
df = df.drop_duplicates().sort_values(["Borough", "ZIP"]).reset_index(drop=True)

# Display the first few rows
print(df.head(10))
print("\nTotal ZIP codes found:", len(df))
  Borough    ZIP
0   Bronx  10451
1   Bronx  10452
2   Bronx  10453
3   Bronx  10454
4   Bronx  10455
5   Bronx  10456
6   Bronx  10457
7   Bronx  10458
8   Bronx  10459
9   Bronx  10460

Total ZIP codes found: 179

12.2.3.5 Extended Example

This is an example that will be useful for our midterm assignment. Here, we scrape the addresses of all the precincts in NYC.

url = "https://www.nyc.gov/site/nypd/bureaus/patrol/precincts-landing.page"
headers = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 "
                  "(KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36"
}

response = requests.get(url, headers=headers)
print("Status code:", response.status_code)

soup = BeautifulSoup(response.text, "html.parser")

# Step 1: Locate the table containing the precinct information
table = soup.find("table", class_="rt")

# Step 2: If the table is found, extract all precinct rows
if table:
    rows = table.find_all("tr")
else:
    print("No table found on this page.")
    rows = []

# Step 3: Prepare a list to store precinct data as dictionaries
precinct_data = []

# Step 4: Loop through each row and collect precinct name and address
for row in rows:
    precinct_cell = row.find("td", attrs={"data-label": "Precinct"})
    address_cell = row.find("td", attrs={"data-label": "Address"})

    if precinct_cell and address_cell:
        precinct_name = precinct_cell.text.strip()
        address = address_cell.text.strip()
        precinct_data.append({
            "Precinct": precinct_name,
            "Address": address
        })

precincts_df = pd.DataFrame(precinct_data)

# Step 6: Display a quick summary
print(f"Extracted {len(precincts_df)} precincts successfully.")
print(precincts_df.head())
Status code: 200
Extracted 78 precincts successfully.
       Precinct              Address
0  1st Precinct    16 Ericsson Place
1  5th Precinct  19 Elizabeth Street
2  6th Precinct   233 West 10 Street
3  7th Precinct   19 1/2 Pitt Street
4  9th Precinct    321 East 5 Street

12.2.4 An Introduction to Selenium

While BeautifulSoup works best for static websites, not all web pages are created equal. Many modern sites use JavaScript to load content after the initial HTML has already been sent to the browser. This means that if you use requests and BeautifulSoup, you might get an empty page or partial data. This is where you use Selenium.

Selenium is a powerful browser automation tool that lets you control an actual web browser, like Chrome or Firefox, through Python. It can click buttons, fill out forms, scroll through pages, and load JavaScript rendered content before scraping it. You can then pass the fully loaded page to BeautifulSoup for parsing.

12.2.4.1 Installing and Setting Up Selenium

To use Selenium, you first need to install it.

pip install selenium

Selenium also requires a WebDriver, which acts as a bridge between Python and your browser. If you are using Google Chrome, download ChromeDriver and make sure it is installed correctly. You can verify this by typing the following command in your terminal:

chromedriver --version

If you are on macOS and see the error “chromedriver cannot be opened because the developer cannot be verified,” you can fix it by running this command:

xattr -d com.apple.quarantine $(which chromedriver)

After this, your WebDriver should work normally.

Now we can work on the code itself. This is how we set up Selenium:

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
import os

Before launching Chrome, it’s common to set a few options that control how it behaves. For scraping, “headless” mode is useful because it runs Chrome without opening a window. It is faster and less distracting.

# Create Chrome options to control browser behavior
options = Options()

# Run Chrome invisibly (no window)
options.add_argument("--headless")

# Disable GPU acceleration for smoother headless operation
options.add_argument("--disable-gpu")

# Helps avoid permission issues in some environments
options.add_argument("--no-sandbox")

Instead of typing out the full path manually, this function looks through your system’s PATH to find where chromedriver is installed.

def find_chromedriver():

    # Loop through all directories listed in the system PATH
    for path in os.environ["PATH"].split(os.pathsep):

        # Build the potential path to chromedriver
        driver_path = os.path.join(path, "chromedriver")

        # Check if it exists and is executable
        if os.path.isfile(driver_path) and os.access(driver_path, os.X_OK):

            return driver_path  # Return the path if found

    return None  # Return None if not found

# Run the function to locate chromedriver
chromedriver_path = find_chromedriver()

# If found, print its path
if chromedriver_path:
    print("Found ChromeDriver at:", chromedriver_path)

# If not found, raise an error with instructions
else:
    raise FileNotFoundError("ChromeDriver not found.")
Found ChromeDriver at: /usr/local/bin/chromedriver

Once you have the driver path and your browser settings, you can start Chrome.

# Import the Service class and connect the driver
service = Service(chromedriver_path)

# Start the Chrome WebDriver with the defined options
driver = webdriver.Chrome(service=service, options=options)

At this point, Selenium is ready. You can visit a webpage by calling:

driver.get("https://example.com")

12.2.4.2 Using Selenium to Load Dynamic Pages

Here is a simple example that shows how Selenium can open a website, wait for it to load, and then pass the content to BeautifulSoup.

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd

def scrape_schedule(url, league):  # league = "NFL" or "MLB"
    league = league.upper()
    # Identify the correct team link pattern based on the league
    team_href = "/nfl/team/_/name" if league == "NFL" else "/mlb/team/_/name"

    # Configure and launch Chrome
    options = Options()
    options.add_argument("--headless=new")
    options.add_argument("--window-size=1920,1080")
    driver = webdriver.Chrome(options=options)

    # Load the ESPN schedule page
    driver.get(url)
    wait = WebDriverWait(driver, 25)

    # Wait until key elements are visible (date headers and team links)
    wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, 
    "div.Table__Title")))
    wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, 
    f"a.AnchorLink[href*='{team_href}']")))

    rows_out = []

    # Extract data by date section
    # Each schedule day starts with a <div class="Table__Title"> element
    for title in driver.find_elements(By.CSS_SELECTOR, "div.Table__Title"):
        date_text = title.text.strip()
        if not date_text:
            continue  # Skip if the date header is empty

        # The table directly after the title contains the matchups
        try:
            table = title.find_element(By.XPATH, "following::table[1]")
        except:
            continue  # Skip if structure differs

        # Extract teams for each game
        for row in table.find_elements(By.XPATH, 
        ".//tr[contains(@class,'Table__TR')]"):
            teams = [
                a.text.strip()
                for a in row.find_elements(By.CSS_SELECTOR, 
                f"td a.AnchorLink[href*='{team_href}']")
                if a.text.strip()
            ]
            if len(teams) >= 2:
                # Format as "Away @ Home"
                rows_out.append({"Date": date_text, "Matchup": f"{teams[0]} @ {
                    teams[1]}"})

    # --- Step 5: Clean up and close browser ---
    driver.quit()

    # Build a clean DataFrame
    df = pd.DataFrame(rows_out, columns=["Date", "Matchup"])
    df["League"] = league
    return df


# --- Step 6: Run for both leagues ---
nfl_df = scrape_schedule("https://www.espn.com/nfl/schedule", "NFL")
mlb_df = scrape_schedule("https://www.espn.com/mlb/schedule", "MLB")

print("===== NFL =====")
print(nfl_df)
print("\n===== MLB =====")
print(mlb_df)
===== NFL =====
                           Date                     Matchup League
0   Thursday, November 20, 2025           Buffalo @ Houston    NFL
1     Sunday, November 23, 2025        Pittsburgh @ Chicago    NFL
2     Sunday, November 23, 2025    New England @ Cincinnati    NFL
3     Sunday, November 23, 2025          New York @ Detroit    NFL
4     Sunday, November 23, 2025       Minnesota @ Green Bay    NFL
5     Sunday, November 23, 2025         Seattle @ Tennessee    NFL
6     Sunday, November 23, 2025  Indianapolis @ Kansas City    NFL
7     Sunday, November 23, 2025        New York @ Baltimore    NFL
8     Sunday, November 23, 2025       Cleveland @ Las Vegas    NFL
9     Sunday, November 23, 2025      Jacksonville @ Arizona    NFL
10    Sunday, November 23, 2025       Philadelphia @ Dallas    NFL
11    Sunday, November 23, 2025       Atlanta @ New Orleans    NFL
12    Sunday, November 23, 2025     Tampa Bay @ Los Angeles    NFL
13    Monday, November 24, 2025    Carolina @ San Francisco    NFL

===== MLB =====
                         Date                Matchup League
0  Saturday, November 1, 2025  Los Angeles @ Toronto    MLB

Here is how the code works:

  1. Setting Up Selenium
  • The first part initializes the Chrome browser in headless mode so it runs quietly in the background.
  1. Waiting for the Page to Load
  • Selenium doesn’t automatically know when a website’s content is ready. Using WebDriverWait and expected_conditions, the script pauses until:
    • A date header (div.Table__Title) appears
    • A team link with a specific pattern (/nfl/team/_/name or /mlb/team/_/name) exists
  1. Looping Through Each Day’s Games
  • Each date on the ESPN page appears as a header (Table__Title), followed by a table of games. The script finds each table using the XPath rule "following::table[1]", meaning the first table that comes after the header.
  1. Extracting Teams from Each Row
  • Inside each table, every row (tr) represents one game. The team names are stored in <a> tags with URLs containing the league’s team name pattern. Selenium grabs these links and combines them into "Away @ Home" format.
  1. Creating a DataFrame
  • Once all data is collected, it’s turned into a Pandas DataFrame with columns:
    • Date: the game date from the header
    • Matchup: formatted team pairing
    • League: either NFL or MLB
  1. Closing the Browser
  • Always close the browser using driver.quit() to free up system resources.

12.2.5 Further Reading

  1. Python Web Scraping Tutorials
  2. Python Web Scraping: Full Tutorial With Examples
  3. Ultimate Guide to Web Scraping with Python
  4. BeautifulSoup Documentation
  5. Webscraping With Selenium

12.3 Uniform Manifold Approximation and Projection

This section was prepared by Mark Zhang, an undergraduate at UCONN.

12.3.1 Introduction

Uniform Manifold Approximation and Projection (UMAP) was introduced by Leland McInnes, John Healy, and James Melville in 2018 (mcinnes2020umapuniformmanifoldapproximation?). UMAP is a powerful dimensionality reduction technique designed to visualize high-dimensional data in a low-dimensional space, such as 2D or 3D. Unlike linear methods like PCA, UMAP captures nonlinear structures, preserving local clusters while also maintaining the overall global patterns of the data. UMAP is utilized for clustering, data visualization, and data exploration.

12.3.2 Other Dimensionality Reduction Techniques

12.3.2.1 Principle Component Analysis (PCA)

  • linear technique; projects data along directions of maximum variance.
  • Preserves global structure fairly well but may lose local neighborhood. details.
  • good for visualization of 2-3 PCs but cannot capture nonlinear structures.

12.3.2.2 t-distributed Stochastic Neighbor Embedding (t-SNE)

  • finds distances between pairs of points, maps onto t dist.
  • nonlinear technique; preserves local structure very well.
  • can distort global relationships.
  • often slower than UMAP for large datasets.

12.3.3 Algorithm Approach and Intuition

UMAP is a graph-based dimensionality reduction algorithm rooted in algebraic topology. It represents data as a fuzzy graph and finds a low-dimensional embedding that preserves both local neighborhoods and global structure.

12.3.3.1 Mathematical Foundations

  1. Riemannian Manifold - UMAP assumes the data lies on a Riemannian manifold, a smoothly curved surface embedded in high-dimensional space.

  2. Simplicial Sets - From algebraic topology, a simplex is the simplest geometric building block (a point, line, triangle, etc.), and a simplicial complex describes how these pieces connect to form a space. Geometrically a k-simplex is a very simple way to build a k-dimensional object.

12.3.3.2 Learning the manifold structure

UMAP builds a graph as a discrete approximation of the manifold’s structure. To aid in visualizing the concepts, I will show how they apply to a noisy sin wave data set.

Test data set of a noisy sin wave

How do we represent this data as a manifold? UMAP uses the concept of simplicial sets and topology and adapts them to fuzzy simplicial sets and metric space.

12.3.3.2.1 Finding nearest neighbors

For each point we find the nearest neighbors. The nneighbors hyperparameter controls how local or global we want our structure to be. A large amount of neighbors represent the manifold more accurately as a whole and a smaller number would capture fine details of the structure.

Fuzzy circles to 4th nearest neighbor
12.3.3.2.2 Constructing a neighborhood graph

UMAP represents high-dimensional data as a fuzzy simplicial set, capturing the local connectivity of points. For each point, a fuzzy membership weight is assigned to its neighbors: \[ w_{ij} = \exp\Bigg(-\frac{d_{ij} - \rho_i}{\sigma_i}\Bigg), \]

where:

  • \(d_{ij}\) is the distance between points \(i\) and \(j\),
  • \(\rho_i\) is the distance to the closest neighbor of \(i\),
  • \(\sigma_i\) is a local scaling factor.

In other words, each point adjusts its influence based on the local density of its neighborhood. Dense regions, where points are very close together, have smaller \(\sigma_i\), which reduces the weight of individual connections. Sparse regions, with points farther apart, have larger \(\sigma_i\). , increasing the weight of each connection.

Since \(w_{i \to j}\) and \(w_{j \to i}\) can differ, UMAP combines them using a probabilistic union:

\[ w_{ij}^{(final)} = w_{i \to j} + w_{j \to i} - w_{i \to j} \cdot w_{j \to i}. \]

This produces a symmetric weight between 0 and 1 that preserves local topology, balances contributions across dense and sparse regions, and yields a uniform manifold representation.

Fuzzy weighted graph with UMAP-style combined weights

12.3.3.3 Finding a low-dimensional representation

After building the fuzzy simplicial set (the weighted neighborhood graph), UMAP tries to embed the points in a low-dimensional space (e.g., 2D or 3D) while preserving the relationships encoded in the graph. This is done by optimizing a cost function.

12.3.3.3.1 Minimum Distance

The min_dist parameter controls how close points can be in the low-dimensional embedding:

  • Small min_dist - points can be closer together, resulting in dense clusters that preserve fine local structure.
  • Large min_dist - points are forced farther apart, producing spread-out clusters that highlight global relationships.

In other words, min_dist adjusts the tightness of clusters in the embedding without changing the underlying topology of the data.

12.3.3.3.2 Minimizing the cost function (Cross-Entropy)

After constructing the fuzzy simplicial set and initializing the low-dimensional embedding, UMAP positions the points by minimizing a cross-entropy-like cost function:

To compute the similarity between points in the low-dimensional embedding, UMAP defines a corresponding low-dimensional weight for each pair of points:

\[ w_{ij}^{(\text{low})} = \frac{1}{1 + a \|y_i - y_j\|^{2b}}, \]

where:

  • \(\|y_i - y_j\|\) is the Euclidean distance between points \(y_i\) and \(y_j\) in the low-dimensional space (2D or 3D),
  • \(a\) and \(b\) are hyperparameters that control the shape of the attraction/repulsion curve.

The cross-entropy cost function is then defined as: \[ C = \sum_{i \neq j} \Big[ w_{ij} \log \frac{1}{w_{ij}^{(\text{low})}} + (1 - w_{ij}) \log \frac{1}{1 - w_{ij}^{(\text{low})}} \Big], \] where: - \(w_{ij}\) = high-dimensional fuzzy weight between points \(i\) and \(j\), - \(w_{ij}^{(\text{low})}\) = corresponding low-dimensional embedding weight.

Intuition:

  1. If two points are strongly connected in high dimensions (large \(w_{ij}\)), the embedding tries to place them close together.
  2. If two points are weakly connected (small \(w_{ij}\)), the embedding tries to place them far apart.

UMAP uses stochastic gradient descent to minimize this cost function, adjusting the coordinates of points in low dimensions so that the embedding preserves both local neighborhoods and global manifold structure.

12.3.4 Practical Example: UMAP on 28x28 digit data

First, install the necessary dependency:

pip install umap-learn

Before applying UMAP, let’s look at the raw data. The MNIST dataset from scikit-learn contains 70,000 images of handwritten digits, each represented as a 28×28 grayscale image (784 features per sample).

import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import fetch_openml

# Load MNIST
mnist = fetch_openml("mnist_784", version=1, as_frame=False)
X = mnist.data[:2000]
y = mnist.target[:2000].astype(int)

# Reshape to 28×28 images
images = X.reshape(-1, 28, 28)

# Show first 12 digits in a 3×4 grid
fig, axes = plt.subplots(3, 4, figsize=(6, 6))
for i, ax in enumerate(axes.flat):
    ax.imshow(images[i], cmap='gray')
    ax.set_title(y[i])
    ax.axis('off')

plt.tight_layout()
plt.show()

Handwritten digit images (28×28) from MNIST dataset

We now project the 784-dimensional data into a 2D embedding using UMAP. The parameters below control the embedding:

  • n_neighbors: how many nearby points are used to estimate local structure.

  • min_dist: how tightly UMAP packs points in the low-dimensional space.

from umap.umap_ import UMAP
import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import fetch_openml

# Load MNIST
mnist = fetch_openml("mnist_784", version=1, as_frame=False)
X = mnist.data[:2000]
y = mnist.target[:2000].astype(int)

# 2D UMAP projection
reducer = UMAP(
    n_components=2,
    n_neighbors=15,
    min_dist=0.1,
    random_state=42
)
X_umap = reducer.fit_transform(X)

# 2D scatter plot
plt.figure(figsize=(8, 6))
scatter = plt.scatter(
    X_umap[:, 0], X_umap[:, 1],
    c=y, cmap='tab10', s=5
)

plt.title("2D UMAP projection of MNIST (28×28)")
plt.xlabel("UMAP 1")
plt.ylabel("UMAP 2")
plt.colorbar(scatter, label='Digit')
plt.show()

2D UMAP of MNIST dataset

In the 2D UMAP embedding of the digits dataset, each point represents a digit image, colored by its label. The plot shows that UMAP successfully captures the structure of high-dimensional data: points of the same digit cluster tightly, while visually similar digits appear near each other, preserving meaningful relationships.

  • Same digits - form tight clusters, showing that UMAP preserves local neighborhood structure and keeps similar points together.
  • Similar digits (e.g., 3 & 8) - positioned near each other, reflecting their visual similarity in shape and style.
  • Different clusters - mostly separate, indicating clear class separation and that UMAP maintains distinctions between digit types.
  • Global layout - reflects the relative positions of all clusters , preserving overall structure of the dataset.

Takeaway: UMAP reveals both local and global patterns, making high-dimensional data interpretable in 2D.

12.3.5 Conclusion

UMAP is a nonlinear dimensionality reduction technique that effectively captures both local and global structure in high-dimensional data. By constructing a fuzzy neighborhood graph and optimizing a low-dimensional embedding, it preserves cluster relationships, highlights visually similar points, and maintains overall manifold structure. This makes UMAP a powerful, interpretable tool for visualization and exploratory data analysis.

12.3.6 Further Readings

12.4 How to call R from Python

This section was prepared by Wilson Tang, a undergraduate junior as of fall 2025 pursuing a single degree in statistical data science.

12.4.1 Motivations

R and Python are both languages we have been exposed to the most at UConn. I’ve had moments using one where I think about how the other might be useful, and so one might want to incorporate both languages. I want to show how we can properly incorporate R and Python together and when each language may be better.

Data science workflow:

Understanding the data science workflow will help us decide if we want to use R and Python together, or just individually.

  • Identify problem
  • Gather data
  • Clean & preprocess data
  • Data exploration
  • Feature engineering / external data
  • Statistical testing
  • Modeling & machine learning
  • Evaluation
  • Visualization

12.4.2 Corresponding packages

For most tasks, we can find corresponding packages in both Python and R that can generally accomplish the same tasks. Of course, do mind exceptions can occur for very specific tasks and you should make decisions based on the scope of your work.

  • Cleaning / Preprocessing Data
    • R: dplyr
    • Python: pandas
  • Exploratory Data Analysis
    • R: built-in functions, dplyr, ggplot2
    • Python: pandas, matplotlib, seaborn, plotnine
  • Feature Engineering
    • R: recipes (tidymodels), forcats, lubridate
    • Python: sklearn, pandas
  • Evaluation
    • R: yardstick, caret
    • Python: sklearn, yellowbrick
  • Visualization
    • R: ggplot2
    • Python: plotnine, matplotlib, seaborn

12.4.3 R / Python advantages

R:

  • Statistical testing: t-test, ANOVA, chi-squared, Shapiro-wilk are common tests built into R. Durbin Test and a fisher’s test that can work for any r x c contingency table.
  • Econometrics, time-series and biostatistics.

Python:

  • Being able to handle bigger data
  • Machine learning and AI
    • sklearn for machine learning
    • Tensor flow / Keras / Pytorch for deep learning
    • XGBoost / LightGBM / CatBoost for gradient boosting
    • Pipelines
  • Versatile data handling
    • APIs - Better for large scale / complex integration (‘requests’, ‘httpx’, ‘aiohttp’)
    • Web Scraping - Better for large scale / dynamic web scraping (‘Beautiful’, ‘Scrapy’, ‘Selenium’)
    • Databases - SQL, NoSQL and big platforms
    • Cloud data - R has limited options for this and sometimes rely on Python libraries
    • Real-time data - R can’t really do this

12.4.4 Setup both R and Python

Make sure you have your own python:

Windows comes with python, however we want to install our own because it will prevent errors in the future steps. (Microsoft sucks)

Install python

Download R and Install R extension:

Install R

Find the Extensions tab on the left toolbar, search R language and install the R extension for VS code. And you can use Ctrl+Shift+P and type R: Create R terminal.

Calling R in Python:

We will use the package rpy2 to call R from Python, there are also other packages that can call R, check Additional resources.

Calling R from Python is a Python first method, and is generally recommended if the user is going to be using more python. However, one can also call Python from R as well. Check Additional resources for packages in R that allows us to call Python from R.

Reproducibility:

Since we are trying to use 2 languages at once, there will be packages from both R and Python. Of course, we would never want to compromise on reproducibility so what should we do? The best method would be to use Conda.

A single Conda environment can specify both versions of Python as well as R, while also allowing us to keep the packages of both in one place.

12.4.5 Example

Prerequisites:

In order to render these classnotes you will need to have a Python virtual environment with packages from requirements.txt and the package gamlss needs to be installed into your R global environment.

GAMLSS or Generalized Additive Models for Location, Scale, and Shape. It allows all parameters of the response distribution to depend on explanatory variables and thus each parameter can be modeled as a function of predictors. This model is useful when variability changes with predictors, skewness / kurtosis change with predictors.

Load Python libraries:

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

Load rpy2 to call R interface:

from rpy2 import robjects
from rpy2.robjects import pandas2ri, Formula
import rpy2.robjects.packages as rpackages
from rpy2.robjects.conversion import localconverter
Error importing in API mode: ImportError("dlopen(/Users/junyan/work/teaching/ids-f25/ids-f25/.ids-f25-py312/lib/python3.12/site-packages/_rinterface_cffi_api.abi3.so, 0x0002): symbol not found in flat namespace '_R_BaseEnv'")
Trying to import in ABI mode.
Loading custom .RprofileLoading custom .Rprofile

Call gamlss from rpy2, rpy2 will look for the R global environment is the default location:

gamlss = rpackages.importr('gamlss')

Load in penguins from seaborn and prepare Data:

penguins = sns.load_dataset("penguins").dropna()

penguins = penguins.rename(columns={
    "body_mass_g": "mass",
    "bill_length_mm": "bill",
    "flipper_length_mm": "flipper"
})

penguins = penguins[["mass", "bill", "flipper", "species", "sex"]]

Send data frame from pandas to R and have it be converted to a R data frame using rpy2:

with localconverter(robjects.default_converter + pandas2ri.converter):
    r_df = robjects.conversion.py2rpy(penguins)

Here we are saying that mass depends on flipper, bill, species, sex and the variability of mass depends on flipper. Run gamlss from the R data frame to extract whatever we need from gamlss, in this case fitted \(\sigma\):

formula_mu = Formula("mass ~ flipper + bill + species + sex")
formula_sigma = Formula("~ flipper")

model = gamlss.gamlss(formula_mu, sigma_formula=formula_sigma,
                      data=r_df, family="NO")

fitted_sigma = robjects.r("fitted.values")(model, "sigma")
GAMLSS-RS iteration 1: Global Deviance = 4718.449 
GAMLSS-RS iteration 2: Global Deviance = 4718.415 
GAMLSS-RS iteration 3: Global Deviance = 4718.415 

Send fitted values or anything extracted from the previous part:

with localconverter(robjects.default_converter + pandas2ri.converter):
    penguins["fitted_sigma"] = robjects.conversion.rpy2py(fitted_sigma)

Use extracted values for anything of interest in Python, in this case let’s say for some reason we wanted to use matplotlib and seaborn to plot our fitted \(\sigma\):

plt.figure(figsize=(8,6))
sns.scatterplot(x="flipper", y="fitted_sigma", hue="species", data=penguins, 
                alpha=0.6)
sns.lineplot(x="flipper", y="fitted_sigma", hue="species", data=penguins, lw=2)
plt.title("GAMLSS: Predicted Variability vs Flipper Length")
plt.xlabel("Flipper Length (mm)")
plt.ylabel("Predicted Body Mass Variability (grams)")
plt.legend(title="Species")
plt.show()

We see here that as flipper length increases, predicted body mass variability tends to decrease.

12.4.6 Additional resources