8  Statistical Tests and Models

8.1 Tests for Exploratory Data Analysis

A collection of functions are available from scipy.stats.

  • Comparing the locations of two samples
    • ttest_ind: t-test for two independent samples
    • ttest_rel: t-test for paired samples
    • ranksums: Wilcoxon rank-sum test for two independent samples
    • wilcoxon: Wilcoxon signed-rank test for paired samples
  • Comparing the locations of multiple samples
    • f_oneway: one-way ANOVA
    • kruskal: Kruskal-Wallis H-test
  • Tests for associations in contigency tables
    • chi2_contingency: Chi-square test of independence of variables
    • fisher_exact: Fisher exact test on a 2x2 contingency table
  • Goodness of fit
    • goodness_of_fit: distribution could contain unspecified parameters
    • anderson: Anderson-Darling test
    • kstest: Kolmogorov-Smirnov test
    • chisquare: one-way chi-square test
    • normaltest: test for normality

Since R has a richer collections of statistical functions, we can call R function from Python with rpy2. See, for example, a blog on this subject.

For example, fisher_exact can only handle 2x2 contingency tables. For contingency tables larger than 2x2, we can call fisher.text() from R through rpy2. See this StackOverflow post. Note that the . in function names and arguments are replaced with _.

import pandas as pd
import numpy as np
import rpy2.robjects.numpy2ri
from rpy2.robjects.packages import importr
rpy2.robjects.numpy2ri.activate()

stats = importr('stats')

jan23 = pd.read_csv("data/nyc_crashes_202301.csv")
jan23["injury"] = np.where(jan23["NUMBER OF PERSONS INJURED"] > 0, 1, 0)
m = pd.crosstab(jan23["injury"], jan23["BOROUGH"])
print(m)

res = stats.fisher_test(m.to_numpy(), simulate_p_value = True)
res[0][0]
Loading custom .Rprofile
BOROUGH  BRONX  BROOKLYN  MANHATTAN  QUEENS  STATEN ISLAND
injury                                                    
0          526       999        528     801            126
1          295       654        283     515             69
0.09995002498750624

8.2 Statistical Functions from scypy.stat

Colin Sullivan

The scipy.stats module is a SciPy sub-package used for probabilistic distributions and general statistical operations. SciPy, the parent module, is one of the largest and most well known packages available for Python, which deals with common science and engineering problems.

First, you will need to install scipy on your computer. To do so, use one of the following commands:

python -m pip install scipy

or

conda install scipy

depending on which tool you prefer. Then, you will need to import scipy.stats at the beginning of each python file like so:

from scipy import stats
import numpy as np

There are three different classes available in scipy.stats:

  • rv_continous, for continuous random variables

  • rv_discrete, for discrete random variables

  • rv_histogram, used to generate specific distribution histograms

8.2.1 Discrete Variables

First, lets talk about discrete variables. The default initialization variables are as folows:

stats.rv_discrete(a=0, b=None, name=None, badvalue=None, moment_tol=1e-08, values=None, inc=1, longname=None, shapes=None, extradoc=None, seed=None)
<scipy.stats._distn_infrastructure.rv_discrete at 0x117d9ea90>
  • a, b are floats representing the lower and upper bound of the distribution, respectively.

  • name represents the name of the distribution, which is optional.

  • badvalue is used to indicate a value for which some argument restriction is violated.

  • moment_tol is the tolerance for calculations of moments.

  • values is a tuple of arrays of equal length, one representing integers and one representing their probability.

  • inc is the increment for the support of the distribution.

  • longname is used to provide a line for the docstring if you are creating a class without one.

  • shapes represents the shape of the distribution. If no values are provided here it will default to the pmf and cdf.

  • extradoc is the last line for the docstring given in longname.

  • seed represents a random seed. If none is provided, scipy.stats’ default random seed will be used.

You can build your own discrete variables by defining their pdf and cdf using classes. An example is shown below for a 4 sided die:

class four_die(stats.rv_discrete):
    def _pmf(self, x):
        lst = list([1, 2, 3, 4])
        if x.tolist() in lst:
            return 1/4
        else:
            return 0

    def _cdf(self, x):
        if x < 1:
            return 0
        elif x >= 4:
            return 1
        else:
            return (x // 1) * 1/4

    def _stats(self):
        mean = np.mean([1, 2, 3, 4])
        var = np.var([1, 2, 3, 4], ddof=0)
        std = np.sqrt(var)
        return mean, var, 0, 0
die = four_die()

You can then test your new variable using scipy.stats build in functions.

print(die.rvs(size=10))
print(die.mean())
print(die.var())
print(die.std())
[1 1 1 1 3 3 4 2 3 3]
2.5
1.25
1.118033988749895

The above is used to define your own discrete random variable. Scipy.stats has many built in already, including common ones such as:

  • Bernoulli – scipy.stats.bernoulli
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
fig, ax = plt.subplots(1, 1)
a = 0.5 ## shape parameter, prob of a single success
x = np.arange(stats.bernoulli.ppf(0.01, a), stats.bernoulli.ppf(0.99, a))
rv = stats.bernoulli(a)
ax.plot(x, rv.pmf(x), ms=8)
ax.vlines(x, 0, rv.pmf(x), alpha=0.5)
plt.show()

  • Binomial – scipy.stats.binom
fig, ax = plt.subplots(1, 1)
n = 5 ## number of trials
a = 0.5 ## shape parameter, prob of a single success
x = np.arange(stats.binom.ppf(0.01, n, a), stats.binom.ppf(0.99, n, a))
rv = stats.binom(n, a)
ax.plot(x, rv.pmf(x), ms=8)
ax.vlines(x, 0, rv.pmf(x), alpha=0.5)
plt.show()

  • Geometric – scipy.stats.geom
fig, ax = plt.subplots(1, 1)
n = 5 ## number of trials
a = 0.5 ## shape parameter, prob of a single success
x = np.arange(stats.geom.ppf(0.01, a), stats.geom.ppf(0.99, a))
rv = stats.geom(a)
ax.plot(x, rv.pmf(x), ms=8)
ax.vlines(x, 0, rv.pmf(x), alpha=0.5)
plt.show()

  • Hyper Geometric – scipy.stats.hypergeom
fig, ax = plt.subplots(1, 1)
m = 50 ## number of objects
n = 17 ## number of type x objects
n2 = 35 ## number of type x objects drawn
a = 0.5 ## shape parameter, prob of a single success
x = x = np.arange(0, n+1)
rv = stats.hypergeom(m, n, n2)
ax.plot(x, rv.pmf(x), ms=8)
ax.vlines(x, 0, rv.pmf(x), alpha=0.5)
plt.show()

  • Poisson – scipy.stats.poisson
fig, ax = plt.subplots(1, 1)
a = 0.5 ## mean, shape parameter
x = np.arange(stats.poisson.ppf(0.01, a), stats.poisson.ppf(0.99, a))
rv = stats.poisson(a)
ax.plot(x, rv.pmf(x), ms=8)
ax.vlines(x, 0, rv.pmf(x), alpha=0.5)
plt.show()

  • Uniform – scipy.stats.randint
fig, ax = plt.subplots(1, 1)
a = 8 ## low
b = 40 ## high
x = np.arange(stats.randint.ppf(0.01, a, b), stats.randint.ppf(0.99, a, b))
rv = stats.randint(a, b)
ax.plot(x, rv.pmf(x), ms=8)
ax.vlines(x, 0, rv.pmf(x), alpha=0.5)
plt.show()

All of these different random variables, including ones that you choose to define yourself, can be used with all of scipy.stats’ operations which will be discussed later.

8.2.2 Continuous Variables

stats.rv_continuous(momtype=1, a=None, b=None, xtol=1e-14, badvalue=None, name=None, longname=None, shapes=None, extradoc=None, seed=None)
<scipy.stats._distn_infrastructure.rv_continuous at 0x121192710>

Similarly:

  • a, b are floats representing the lower and upper bound of the distribution, respectively.

  • name represents the name of the distribution, which is optional.

  • badvalue is used to indicate a value for which some argument restriction is violated.

  • xtol is the tolerance for fixed point calculations of generic ppf.

  • values is a tuple of arrays of equal length, one representing integers and one representing their probability.

  • momtype is the type of generic moment calculation to use.

  • longname is used to provide a line for the docstring if you are creating a class without one.

  • shapes represents the shape of the distribution. If no values are provided here it will default to the pmf and cdf.

  • extradoc is the last line for the docstring given in longname.

  • seed represents a random seed. If none is provided, scipy.stats’ default random seed will be used.

We can define our own continuous variable of student grades using this rv_continous method.

class test_grades(stats.rv_continuous):
    
    def _pdf(self, x, mu, sigma):
        return stats.norm.pdf(x, mu, sigma)
    
    def _cdf(self, x, mu, sigma):
        return stats.norm.cdf(x, mu, sigma)
    
    def _ppf(self, q, mu, sigma):
        return stats.norm.ppf(q, mu, sigma)
    
    def _rvs(self, mu, sigma, size):
        return stats.norm.rvs(mu, sigma, size=size)

grades = test_grades()

mu = 75
sigma = 10
size = 1000

# samples = grades.rvs(mu, sigma, size=size)

print("Mean grade: ", grades.mean(mu, sigma))
print("Standard deviation of grades: ", grades.std(mu, sigma))
print("95th percentile: ", grades.ppf(0.95, mu, sigma))
Mean grade:  75.0
Standard deviation of grades:  10.0000000001417
95th percentile:  91.44853626951472

Again similarly, there are many built in continuous variable distributions that you can use, such as:

  • Alpha – scipy.stats.alpha
fig, ax = plt.subplots(1, 1)
a = 3.57
x = np.linspace(stats.alpha.ppf(0.01, a), stats.alpha.ppf(0.99, a), 100) ## create PDF
rv = stats.alpha(a) ## initialize alpha rv
ax.plot(x, rv.pdf(x), 'k-', lw=2) ## plot pdf
plt.show()

  • Beta – scipy.stats.beta
fig, ax = plt.subplots(1, 1)
a, b = 2.31, 0.627
x = np.linspace(stats.beta.ppf(0.01, a, b), stats.beta.ppf(0.99, a, b), 100)
rv = stats.beta(a, b)
ax.plot(x, rv.pdf(x), 'k-', lw=2)
plt.show()

  • Chi Squared – scipy.stats.chi2
fig, ax = plt.subplots(1, 1)
a = 100 ## degrees of freedom
x = np.linspace(stats.chi2.ppf(0.01, a), stats.chi2.ppf(0.99, a), 100)
rv = stats.chi2(a)
ax.plot(x, rv.pdf(x), 'k-', lw=2)
plt.show()

  • Exponential – scipy.stats.expon
fig, ax = plt.subplots(1, 1)
x = np.linspace(stats.expon.ppf(0.01), stats.expon.ppf(0.99), 100)
rv = stats.expon() ## takes no arguments
ax.plot(x, rv.pdf(x), 'k-', lw=2)
plt.show()

  • Gamma – scipy.stats.gamma
fig, ax = plt.subplots(1, 1)
a = 3 ## shape parameter
x = np.linspace(stats.gamma.ppf(0.01, a), stats.gamma.ppf(0.99, a), 100)
rv = stats.gamma(a)
ax.plot(x, rv.pdf(x), 'k-', lw=2)
plt.show()

  • Uniform – scipy.stats.uniform
fig, ax = plt.subplots(1, 1)
x = np.linspace(stats.uniform.ppf(0.01), stats.uniform.ppf(0.99), 100)
rv = stats.uniform()
ax.plot(x, rv.pdf(x), 'k-', lw=2)
plt.show()

8.2.3 Summary Statistics

When examining data, often you want to find different summary statistics before beginning statistical tests.

There are several built in methods to do so. First, lets take a look at describe:

import numpy as np
%matplotlib inline


array1 = np.random.randint(1,101,10)
print(array1)
print(stats.describe(array1))
[40 44 29 37  4 88 96  8 58 26]
DescribeResult(nobs=10, minmax=(4, 96), mean=43.0, variance=926.2222222222223, skewness=0.5537853453764205, kurtosis=-0.7004798003986137)

The describe method returns a number of summary statistics: length of array, a tuple containing the min and the max, the mean and variance, the skewness, and kurtosis, which is the measure of tailedness in the data. There are a number of optional parameters in describe that can change how our results are found:

  • ddof – delta degrees of freedom, integer.

  • bias – boolean variable with default value True. If False, skewness and kurtosis are corrected for statistical biases.

  • nan_policy – options are {‘propogate’, ‘raise’, ‘omit’}, and they tell the method how to deal with NaN values. Propogate returns NaN, raise throws an error, and omit skips the value in the array.

We can also compute statistics of trimmed data, if we do not want to alter the dataset. For example, let’s say we only want to see the values between 20 and 80.

print("Trimmed mean: " + str(stats.tmean(array1, (20, 80))))
print("Trimmed standard deviation: " + str(stats.tstd(array1, (20, 80))))
print(array1)
Trimmed mean: 39.0
Trimmed standard deviation: 11.489125293076057
[40 44 29 37  4 88 96  8 58 26]

As you can see, the array remains unchanged but we find a new mean and standard deviation with the values trimmed. Similar functions can be used:

  • tvar – takes array and limit tuple, returns trimmed variance

  • tsem – takes array and limit tuple, returns trimmed standard error

  • tmin – takes array and minimum limit, returns new minimum value above trimmed limit

  • tmax – takes array and maximum limit, returns new maximum value below trimmed limit

Further functions include iqr to compute the interquartile range of the data and sem which computes the standard error of the mean.

print("IQR: " + str(stats.iqr(array1)))
print("Std Error of the Mean: " + str(stats.sem(array1)))
IQR: 27.75
Std Error of the Mean: 9.624043964063247

8.2.4 Frequency Statistics

There are several methods in scipy.stats related to frequency statistics. Relfreq and Cumfreq are used to create a relative frequency and cumulative frequency histogram, respectively. This can provide more insight on the data then a regular histogram would. An example is demonstrated below.

import matplotlib as plt
import matplotlib.pyplot as plt
## Create an array of grades with frequency of 0.2 for 90, 0.3 for 80, and so on
poss_grades = [90, 80, 70, 60, 50]
grades = np.random.choice(poss_grades, 100, p=[0.2, 0.3, 0.3, 0.1, 0.1])
print(grades)
## Use cumfreq to find our variables for the Cumulative Histogram
res = stats.cumfreq(grades, numbins=100)
x = res.lowerlimit + np.linspace(0, res.binsize*res.cumcount.size,
                                 res.cumcount.size)
## Use relfreq to find our variables for the Relative Histogram
res2 = stats.relfreq(grades, numbins=10)
x2 = res2.lowerlimit + np.linspace(0, res2.binsize*res2.frequency.size,
                                 res2.frequency.size)
## Plot normal histogram vs our 2 special ones
fig = plt.figure(figsize=(10, 4))
ax1 = fig.add_subplot(1, 3, 1)
ax2 = fig.add_subplot(1, 3, 2)
ax3 = fig.add_subplot(1, 3, 3)
ax1.hist(grades, bins=25)
ax1.set_title('Histogram')
ax2.bar(x, res.cumcount, width=res.binsize)
ax2.set_title('Cumulative histogram')
ax2.set_xlim([x.min(), x.max()])
ax3.bar(x2, res2.frequency, width=res2.binsize)
ax3.set_title('Relative frequency histogram')
ax3.set_xlim([x2.min(), x2.max()])
plt.show()
[90 80 60 70 70 80 70 70 60 80 50 90 90 80 50 50 90 80 80 70 80 60 60 80
 90 80 70 50 80 80 50 70 60 60 80 70 70 60 50 50 90 70 70 70 80 90 70 60
 50 80 80 70 80 50 90 80 80 50 50 80 80 80 80 80 80 60 90 50 90 70 80 70
 80 90 60 90 70 90 70 90 90 90 50 70 80 90 70 90 80 70 50 70 60 50 80 80
 80 80 80 80]

There are also two methods to use when checking percentile ranks:

  • percentileofscore – takes an array and a score as input, returns the percentile rank of the score

  • scoreatpercentile – takes an array and a percentile as input, returns the score at that percentile

array2 = np.random.randint(1,101,100)
print(array2)
print("The percentile of the score 90 is: " + str(stats.percentileofscore(array2, 90)))
print("Score at the 90th percentile is: " + str(stats.scoreatpercentile(array2, 90)))
[71 65 98 88  6 21  9 90 56 53 11 52 29 14 37 87 43 79 19 52 49 14 69 88
 42 37 51 63 44  3 18 53 51 10  1 16 21 85  3 31 49 88 27 44 91 94 98 39
 84  7 52 22 97 11 30 83 94 61 19 47 28 63 32 28 38 61 22 31 95 55 81 66
 29 63  7 61 66 74 99 50 90 52 31 76 29 13 99 88 99 94 86  6 75 16 56 95
 54 68 25 32]
The percentile of the score 90 is: 87.5
Score at the 90th percentile is: 94.0

As you can see, the values differ. This is important information that can be used to determine more than just the IQR.

8.2.5 Statistical Tests

Finally, statistical tests are arguably the most useful part of the scipy.stats package. The most important correlation function is the one-way ANOVA function. As a reminder, one-way ANOVA (analysis of variance) is used to compare the means of two or more independent groups, with the goal of determining if the means are significantly different. It is a parametric test that can be tedious and complicated to do by hand, so the scipy.stats functionality allows us to save a lot of time.

The initial set up is as follows:

scipy.stats.f_oneway(*samples, axis=0)

It takes parameters *samples, which is used to represent the two or more arrays that you input, and axis. Axis is defaulted to 0, but if you set it to another number that will be the axis of the arrays upon which the test is applied. The f_oneway function returns the F statistic and the p value of the test.

precip_1895 = np.array([4.36, 1.24, 3.28, 5.08, 3.13, 3.09, 4.15, 2.06, 1.06, 3.56, 3.07, 2.78])
precip_1935 = np.array([4.19, 2.88, 2.57, 2.54, 1.96, 4.59, 4.42, 2.89, 6.69, 4.63, 5.15, 1.89])
precip_1975 = np.array([5.23, 3.29, 4.19, 3.35, 4.55, 6.29, 8.71, 3.70, 8.05, 3.80, 3.68, 2.82])
precip_2015 = np.array([4.43, 2.18, 4.55, 2.67, 1.21, 8.12, 3.68, 2.26, 3.60, 4.02, 2.30, 4.84])

F, p = stats.f_oneway(precip_1895, precip_1935, precip_1975, precip_2015, axis=0)
print("F statistic is " + str(F))
print("p value is " + str(p))
F statistic is 2.3920989190674504
p value is 0.0813172700146769

If we were to run the test with an alpha level of 0.05, then our p value is not in the rejection region and we would be unable to say that the mean of the precipitation levels for these 4 samples over the last 128 years are significantly different.

Now, if we wanted to check the difference between the first, second, and third months of each quarter, we could formulate the arrays differently:

precip_1895_split = np.array([[4.36, 1.24, 3.28], [5.08, 3.13, 3.09], [4.15, 2.06, 1.06], [3.56, 3.07, 2.78]])
precip_1935_split = np.array([[4.19, 2.88, 2.57], [2.54, 1.96, 4.59], [4.42, 2.89, 6.69], [4.63, 5.15, 1.89]])
precip_1975_split = np.array([[5.23, 3.29, 4.19], [3.35, 4.55, 6.29], [8.71, 3.70, 8.05], [3.80, 3.68, 2.82]])
precip_2015_split = np.array([[4.43, 2.18, 4.55], [2.67, 1.21, 8.12], [3.68, 2.26, 3.60], [4.02, 2.30, 4.84]])

F, p = stats.f_oneway(precip_1895_split, precip_1935_split, precip_1975_split, precip_2015_split, axis=0)
print("F statistic is " + str(F))
print("p value is " + str(p))
F statistic is [0.98374586 3.35080866 1.86738832]
p value is [0.43298199 0.05551278 0.18891543]

As you can see, the second months of each quarter have the widest variety. This doesn’t tell us much information about this dataset, but it is in general a useful tool.

Finally, we can perform many sample tests related to the statistical distributions found earlier. Scipy.stats focuses primarily on t tests.

  • One sample t-tests: scipy.stats.ttest_1samp

This calculatutes the mean of one group and tests it against the null hypothesis mean. It is initialized with an array and the popmean to test against, and there is an optional parameter called alternative with options ‘two-sided’, ‘less’, and ‘greater’. The choice here will tell the program in which direction the hypothesis test has been set up. If the hypothesis test is merely checking for a difference between the mean of the data and popmean, use the default ‘two-sided’. If you want to check if your data’s mean is less than or greater than the given popmean, use ‘less’ or ‘greater’, respectively.

Let’s say we want to check if our 2015 monthly precipitation data is, on average, less than 5 inches. We can initialize the null and alternative hypothesis as \(H_0 = \mu_1 \geq \mu_2\) and \(H_a = \mu_1 < \mu_2\). If we want to check against 95% confidence, we will check with a p-value of 0.05.

print(stats.ttest_1samp(precip_2015, 5.0, alternative='less'))
TtestResult(statistic=-2.5832078155485436, pvalue=0.012722209865812231, df=11)

As you can see, our p-value is less than 0.05, so we can say with 95% confidence that the average monthly precipitation is less than 5.0 inches.

  • Two sample, independent t-tests: scipy.stats.ttest_ind

This test compares the means of two independent groups against each other. It is initialized with two arrays, and there is an optional parameter called alternative with options ‘two-sided’, ‘less’, and ‘greater’. The choice here will tell the program in which direction the hypothesis test has been set up. If the hypothesis test is merely checking for a difference between the means of the 2 data arrays, use the default ‘two-sided’. If you want to check if the first data mean is less than or greater than the second data mean, use ‘less’ or ‘greater’, respectively.

Let us now compare the 2015 monthly preciptation data against the 1895 data, and use two-sided to check if there is a difference. We can initialize our null and alternative hypothesis as: \(H_0 = \mu_1 = \mu_2\) and \(H_a = \mu_1 \neq \mu_2\). If we want to check against 95% confidence, we will check with a p-value of 0.05.

print(stats.ttest_ind(precip_2015, precip_1895, alternative='two-sided'))
Ttest_indResult(statistic=0.9344724605084526, pvalue=0.3602045183624706)

As you can see, our p-value is greater than 0.05, so we cannot say with 95% confidence that the average monthly precipitation in 2015 is different than that in 1895.

  • Kolmogorov-Smirnov test for goodness of fit: scipy.stats.ktest

This one-sample test compares the distribution of a sample vs a given distribution. The test takes as input an array and a callable distribution’s cdf. An example is below.

8.2.6 NYC crash data examples

Let’s try and use the Kolmogorov-Smirnov test for goodness of fit for checking our data of crashes by hour.

import pandas as pd
df = pd.read_csv('data/nyc_crashes_202301.csv')
df["HOUR"] = pd.to_numeric(df["CRASH TIME"].str.split(":").str[0])
plt.hist(df['HOUR'], bins=24)
plt.show()

As you can see, this sort of follows a chi-squared distribution. We can use the Kolmogorov-Smirnov test for goodness of fit to check how accurate this is.

import math
dist = getattr(stats, 'chi2')
parameters = dist.fit(df['HOUR'])
print('p value threshold: ' + str(1.36 / math.sqrt(7137)))
print(stats.kstest(df['HOUR'], "chi2", (parameters[0], parameters[1])))
p value threshold: 0.016098338682511194
KstestResult(statistic=0.9999999999949739, pvalue=0.0, statistic_location=23, statistic_sign=1)

Unfortunately, our p value does not reach the threshold, so we have sufficient evidence that the distribution does not follow a chi2 one.

parameters = stats.chi2.fit(df['HOUR']) 
plt.hist(df['HOUR'], bins=range(1, 25), density=True)
plt.xlabel('Value')
plt.ylabel('Relative frequency')
plt.title('Relative frequency histogram')

# xmin, xmax = plt.xlim()
# x = np.linspace(xmin, xmax, 100)
# p = stats.chi2.pdf(x, parameters[0], parameters[1])

degrees_f = 15# int(len(df['HOUR']) - 1)  # degrees of freedom
x = np.linspace(stats.chi2.ppf(0.01, degrees_f), stats.chi2.ppf(0.99, degrees_f), 100)
y = stats.chi2.pdf(x, degrees_f)
  
plt.plot(x, y, 'r-', lw=2, label='chi2 pdf')
title = "Fit Values: {:.2f} and {:.2f}".format(parameters[0], parameters[1])
plt.title(title)
  
plt.show()

Clearly the fit is extremely loose, as evidenced by the dramatically low p-value found in the previous calculation, so we can once again dismiss the idea that our data follows a chi2 distribution.

We can also use statistical tests to draw some conclusions from the data. Let’s separate our huge dataset into smaller ones based upon borough to more easily split up the data.

## Create new dataframes for each borough
df = pd.read_csv('data/nyc_crashes_202301.csv', index_col='BOROUGH')
bronx = df.loc['BRONX']
staten_island = df.loc['STATEN ISLAND']
brooklyn = df.loc['BROOKLYN']
manhattan = df.loc['MANHATTAN']
queens = df.loc['QUEENS']
## Test to make sure it works
print(staten_island.head())
               CRASH DATE CRASH TIME  ZIP CODE   LATITUDE  LONGITUDE  \
BOROUGH                                                                
STATEN ISLAND  01/01/2023       6:40   10305.0  40.585240 -74.093414   
STATEN ISLAND  01/01/2023       1:18   10301.0  40.631510 -74.103874   
STATEN ISLAND  01/01/2023       2:00   10306.0  40.583366 -74.108350   
STATEN ISLAND  01/01/2023      13:00   10305.0  40.613980 -74.066050   
STATEN ISLAND  01/01/2023       7:10   10301.0  40.648228 -74.084500   

                             LOCATION    ON STREET NAME   CROSS STREET NAME  \
BOROUGH                                                                       
STATEN ISLAND  (40.58524, -74.093414)   HYLAN BOULEVARD         BUEL AVENUE   
STATEN ISLAND  (40.63151, -74.103874)               NaN                 NaN   
STATEN ISLAND  (40.583366, -74.10835)               NaN                 NaN   
STATEN ISLAND   (40.61398, -74.06605)               NaN                 NaN   
STATEN ISLAND   (40.648228, -74.0845)  RICHMOND TERRACE  SAINT PETERS PLACE   

                       OFF STREET NAME  NUMBER OF PERSONS INJURED  ...  \
BOROUGH                                                            ...   
STATEN ISLAND                      NaN                          1  ...   
STATEN ISLAND  504       KISSEL AVENUE                          0  ...   
STATEN ISLAND  1955      RICHMOND ROAD                          1  ...   
STATEN ISLAND     1201      BAY STREET                          0  ...   
STATEN ISLAND                      NaN                          0  ...   

               CONTRIBUTING FACTOR VEHICLE 2  CONTRIBUTING FACTOR VEHICLE 3  \
BOROUGH                                                                       
STATEN ISLAND                    Unspecified                            NaN   
STATEN ISLAND                    Unspecified                            NaN   
STATEN ISLAND                            NaN                            NaN   
STATEN ISLAND                    Unspecified                            NaN   
STATEN ISLAND                    Unspecified                            NaN   

               CONTRIBUTING FACTOR VEHICLE 4  CONTRIBUTING FACTOR VEHICLE 5  \
BOROUGH                                                                       
STATEN ISLAND                            NaN                            NaN   
STATEN ISLAND                            NaN                            NaN   
STATEN ISLAND                            NaN                            NaN   
STATEN ISLAND                            NaN                            NaN   
STATEN ISLAND                            NaN                            NaN   

               COLLISION_ID                  VEHICLE TYPE CODE 1  \
BOROUGH                                                            
STATEN ISLAND       4594485  Station Wagon/Sport Utility Vehicle   
STATEN ISLAND       4594804                                Sedan   
STATEN ISLAND       4594854                                  NaN   
STATEN ISLAND       4595229                                Sedan   
STATEN ISLAND       4603513                                Sedan   

               VEHICLE TYPE CODE 2 VEHICLE TYPE CODE 3 VEHICLE TYPE CODE 4  \
BOROUGH                                                                      
STATEN ISLAND                Sedan                 NaN                 NaN   
STATEN ISLAND                Sedan                 NaN                 NaN   
STATEN ISLAND                  NaN                 NaN                 NaN   
STATEN ISLAND                  NaN                 NaN                 NaN   
STATEN ISLAND                Sedan                 NaN                 NaN   

              VEHICLE TYPE CODE 5  
BOROUGH                            
STATEN ISLAND                 NaN  
STATEN ISLAND                 NaN  
STATEN ISLAND                 NaN  
STATEN ISLAND                 NaN  
STATEN ISLAND                 NaN  

[5 rows x 28 columns]

Now, let’s perform an anova test to determine if the number of people injured on average differs between borough.

## Get arrays for each borough
bronx_injuries = bronx['NUMBER_OF_PERSONS_INJURED'].to_numpy()
si_injuries = staten_island['NUMBER_OF_PERSONS_INJURED'].to_numpy()
brooklyn_injuries = brooklyn['NUMBER_OF_PERSONS_INJURED'].to_numpy()
manhattan_injuries = manhattan['NUMBER_OF_PERSONS_INJURED'].to_numpy()
queens_injuries = queens['NUMBER_OF_PERSONS_INJURED'].to_numpy()
## Perform one way ANOVA
F, p = stats.f_oneway(bronx_injuries, si_injuries, brooklyn_injuries, manhattan_injuries, queens_injuries, axis=0)
print("F statistic is " + str(F))
print("p value is " + str(p))

If we were to check with 95% confidence, we can say that the means of each borough do differ.

Finally, we can use a test to compare boroughs against each other to draw more meaningful conclusions.

print(stats.ttest_ind(bronx_injuries, brooklyn_injuries, alternative='two-sided'))

print(np.mean(bronx_injuries))
print(np.mean(brooklyn_injuries))

As you can see, our p value is too large to reject the null. Comparing the two means, we can see that they are similar so it makes sense why they are not rejected.

8.2.7 REFERENCES

Scipy Offical Documentation

GeeksForGeeks Documentation

TutorialsPoint Documentation

Javatpoint Documentation

8.3 Linear Model

Let’s simulate some data for illustrations.

import numpy as np

nobs = 200
ncov = 5
np.random.seed(123)
x = np.random.random((nobs, ncov)) # Uniform over [0, 1)
beta = np.repeat(1, ncov)
y = 2 + np.dot(x, beta) + np.random.normal(size = nobs)

Check the shape of y:

y.shape
(200,)

Check the shape of x:

x.shape
(200, 5)

That is, the true linear regression model is \[ y = 2 + x_1 + x_2 + x_3 + x_4 + x_5 + \epsilon. \]

A regression model for the observed data can be fitted as

import statsmodels.api as sma
xmat = sma.add_constant(x)
mymod = sma.OLS(y, xmat)
myfit = mymod.fit()
myfit.summary()
OLS Regression Results
Dep. Variable: y R-squared: 0.309
Model: OLS Adj. R-squared: 0.292
Method: Least Squares F-statistic: 17.38
Date: Tue, 21 Mar 2023 Prob (F-statistic): 3.31e-14
Time: 16:18:04 Log-Likelihood: -272.91
No. Observations: 200 AIC: 557.8
Df Residuals: 194 BIC: 577.6
Df Model: 5
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 1.8754 0.282 6.656 0.000 1.320 2.431
x1 1.1703 0.248 4.723 0.000 0.682 1.659
x2 0.8988 0.235 3.825 0.000 0.435 1.362
x3 0.9784 0.238 4.114 0.000 0.509 1.448
x4 1.3418 0.250 5.367 0.000 0.849 1.835
x5 0.6027 0.239 2.519 0.013 0.131 1.075
Omnibus: 0.810 Durbin-Watson: 1.978
Prob(Omnibus): 0.667 Jarque-Bera (JB): 0.903
Skew: -0.144 Prob(JB): 0.637
Kurtosis: 2.839 Cond. No. 8.31


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Questions to review:

  • How are the regression coefficients interpreted? Intercept?
  • Why does it make sense to center the covariates?

Now we form a data frame with the variables

import pandas as pd
df = np.concatenate((y.reshape((nobs, 1)), x), axis = 1)
df = pd.DataFrame(data = df,
                  columns = ["y"] + ["x" + str(i) for i in range(1,
                  ncov + 1)])
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   y       200 non-null    float64
 1   x1      200 non-null    float64
 2   x2      200 non-null    float64
 3   x3      200 non-null    float64
 4   x4      200 non-null    float64
 5   x5      200 non-null    float64
dtypes: float64(6)
memory usage: 9.5 KB

Let’s use a formula to specify the regression model as in R, and fit a robust linear model (rlm) instead of OLS. Note that the model specification and the function interface is similar to R.

import statsmodels.formula.api as smf
mymod = smf.rlm(formula = "y ~ x1 + x2 + x3 + x4 + x5", data = df)
myfit = mymod.fit()
myfit.summary()
Robust linear Model Regression Results
Dep. Variable: y No. Observations: 200
Model: RLM Df Residuals: 194
Method: IRLS Df Model: 5
Norm: HuberT
Scale Est.: mad
Cov Type: H1
Date: Tue, 21 Mar 2023
Time: 16:18:04
No. Iterations: 16
coef std err z P>|z| [0.025 0.975]
Intercept 1.8353 0.294 6.246 0.000 1.259 2.411
x1 1.1254 0.258 4.355 0.000 0.619 1.632
x2 0.9664 0.245 3.944 0.000 0.486 1.447
x3 0.9995 0.248 4.029 0.000 0.513 1.486
x4 1.3275 0.261 5.091 0.000 0.816 1.839
x5 0.6768 0.250 2.712 0.007 0.188 1.166


If the model instance has been used for another fit with different fit parameters, then the fit options might not be the correct ones anymore .

For model diagnostics, one can check residual plots.

import matplotlib.pyplot as plt

myOlsFit = smf.ols(formula = "y ~ x1 + x2 + x3 + x4 + x5", data = df).fit()
fig = plt.figure(figsize = (6, 6))
## residual versus x1; can do the same for other covariates
fig = sma.graphics.plot_regress_exog(myOlsFit, 'x1', fig=fig)
eval_env: 1

See more on residual diagnostics and specification tests.

8.4 Generalized Linear Regression

A linear regression model cannot be applied to presence/absence or count data.

Binary or count data need to be modeled under a generlized framework. Consider a binary or count variable \(Y\) with possible covariates \(X\). A generalized model describes a transformation \(g\) of the conditional mean \(E[Y | X]\) by a linear predictor \(X^{\top}\beta\). That is \[ g( E[Y | X] ) = X^{\top} \beta. \] The transformation \(g\) is known as the link function.

For logistic regression with binary outcomes, the link function is the logit function \[ g(u) = \log \frac{u}{1 - u}, \quad u \in (0, 1). \]

What is the interpretation of the regression coefficients in a logistic regression? Intercept?

A logistic regression can be fit with statsmodels.api.glm.

Let’s generate some binary data first by dichotomizing existing variables.

import statsmodels.genmod as smg
eta = x.dot([2, 2, 2, 2, 2]) - 5
p = smg.families.links.Logit().inverse(eta)
df["yb"] = np.random.binomial(1, p, p.size)

Fit a logistic regression for y1b.

mylogistic = smf.glm(formula = 'yb ~ x1 + x2 + x3 + x4 + x5', data = df,
                     family = smg.families.Binomial())
mylfit = mylogistic.fit()
mylfit.summary()
Generalized Linear Model Regression Results
Dep. Variable: yb No. Observations: 200
Model: GLM Df Residuals: 194
Model Family: Binomial Df Model: 5
Link Function: Logit Scale: 1.0000
Method: IRLS Log-Likelihood: -110.80
Date: Tue, 21 Mar 2023 Deviance: 221.61
Time: 16:18:06 Pearson chi2: 196.
No. Iterations: 4 Pseudo R-squ. (CS): 0.2410
Covariance Type: nonrobust
coef std err z P>|z| [0.025 0.975]
Intercept -4.6982 0.820 -5.728 0.000 -6.306 -3.091
x1 2.1654 0.618 3.505 0.000 0.955 3.376
x2 1.1788 0.579 2.035 0.042 0.044 2.314
x3 1.5921 0.587 2.713 0.007 0.442 2.742
x4 2.2310 0.630 3.539 0.000 0.995 3.466
x5 2.5496 0.598 4.263 0.000 1.377 3.722

If we treat y1b as count data, a Poisson regression can be fitted.

myPois = smf.glm(formula = 'yb ~ x1 + x2 + x3 + x4 + x5', data = df,
                 family = smg.families.Poisson())
mypfit = myPois.fit()
mypfit.summary()
Generalized Linear Model Regression Results
Dep. Variable: yb No. Observations: 200
Model: GLM Df Residuals: 194
Model Family: Poisson Df Model: 5
Link Function: Log Scale: 1.0000
Method: IRLS Log-Likelihood: -161.33
Date: Tue, 21 Mar 2023 Deviance: 112.66
Time: 16:18:06 Pearson chi2: 89.6
No. Iterations: 5 Pseudo R-squ. (CS): 0.1071
Covariance Type: nonrobust
coef std err z P>|z| [0.025 0.975]
Intercept -2.4529 0.435 -5.644 0.000 -3.305 -1.601
x1 0.7638 0.361 2.117 0.034 0.057 1.471
x2 0.3698 0.330 1.121 0.262 -0.277 1.017
x3 0.5396 0.345 1.562 0.118 -0.137 1.217
x4 0.7895 0.368 2.144 0.032 0.068 1.511
x5 0.9477 0.352 2.691 0.007 0.257 1.638