9.6. Stepwise Regression

In a stepwise regression, variables are added and removed from the model based on significance. You can have a forward selection stepwise which adds variables if they are statistically significant until all the variables outside the model are not significant, a backwards elimination stepwise regression which puts in all the variables and then removes those that are not statistically significant until only statistically significant ones remain, and a bidirectional elimination which both adds and removes until all the variables inside are significant AND all those outside are not significant.

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
import statsmodels.api as sm
import warnings
warnings.filterwarnings('ignore')
df.head()
CRASH DATE CRASH TIME BOROUGH ZIP CODE LATITUDE LONGITUDE LOCATION ON STREET NAME CROSS STREET NAME OFF STREET NAME ... VEHICLE TYPE CODE 5 lat long hour date BROOK BRONX MANHAT STATEN QUEENS
0 01/01/2022 7:05 NaN NaN NaN NaN NaN EAST 128 STREET 3 AVENUE BRIDGE NaN ... NaN 0 0 7 1 0 0 0 0 0
1 01/01/2022 14:43 NaN NaN 40.769993 -73.915825 (40.769993, -73.915825) GRAND CENTRAL PKWY NaN NaN ... NaN 1 1 14 1 0 0 0 0 0
2 01/01/2022 21:20 QUEENS 11414.0 40.657230 -73.841380 (40.65723, -73.84138) 91 STREET 160 AVENUE NaN ... NaN 1 1 21 1 0 0 0 0 1
3 01/01/2022 4:30 NaN NaN NaN NaN NaN Southern parkway Jfk expressway NaN ... NaN 0 0 4 1 0 0 0 0 0
4 01/01/2022 7:57 NaN NaN NaN NaN NaN WESTCHESTER AVENUE SHERIDAN EXPRESSWAY NaN ... NaN 0 0 7 1 0 0 0 0 0

5 rows × 38 columns

9.6.1. Data scrubbing

df = pd.read_csv("../data/nyc_mv_collisions_202201.csv")

zeros = []
for i in range(len(df)):
    zeros.append(0)
    
df["lat"] = zeros
for i in range(len(df)):
    if df["LATITUDE"][i] > 0 or df["LATITUDE"][i] < 0:
        df["lat"][i] = 1
    else:
        df["lat"][i] = 0

df["long"] = zeros
for i in range(len(df)):
    if df["LATITUDE"][i] > 0 or df["LONGITUDE"][i] < 0:
        df["long"][i] = 1
    else:
        df["long"][i] = 0
    
df["hour"] = zeros
for i in range(len(df)):
    if (df["CRASH TIME"][i])[1:2] == ':':
        df["hour"][i] = str(df["CRASH TIME"][i])[0:1]
    else:
        df["hour"][i] = str(df["CRASH TIME"][i])[0:2]
    cap = int(df["hour"][i])
    df["hour"][i] = cap

df["date"] = zeros
for i in range(len(df)):
    df["date"][i] = str(df["CRASH DATE"][i])[3:5]
    cap = int(df["date"][i])
    df["date"][i] = cap
    
df["BROOK"] = zeros
df["BRONX"] = zeros
df["MANHAT"] = zeros
df["STATEN"] = zeros
df["QUEENS"] = zeros

for i in range(len(df)):
    if df["BOROUGH"][i] == "NaN":
        cat = 1
    elif df["BOROUGH"][i] == 'BROOKLYN':
        df["BROOK"][i] = 1
    elif df["BOROUGH"][i] == 'BRONX':
        df["BRONX"][i] = 1
    elif df["BOROUGH"][i] == 'MANHATTAN':
        df["MANHAT"][i] = 1
    elif df["BOROUGH"][i] == 'STATEN ISLAND':
        df["STATEN"][i] = 1
    elif df["BOROUGH"][i] == 'QUEENS':
        df["QUEENS"][i] = 1
x_columns = ["date", "hour", "BROOK", "BRONX", "STATEN", "QUEENS", "MANHAT", "NUMBER OF PERSONS KILLED", "lat"]
y = df["NUMBER OF PERSONS INJURED"]
def get_stats():
    x = df[x_columns]
    results = sm.OLS(y, x).fit()
    print(results.summary())

def rem_high():
    x = df[x_columns]
    results = sm.OLS(y, x).fit()
    spot = 0
    for i in range(len(results.pvalues)):
        if results.pvalues[spot] < results.pvalues[i]:
            spot = i
    if results.pvalues[spot] > .05:
        x_columns.pop(spot)
    print(results.summary())
get_stats()
rem_high()
      date  hour  BROOK  BRONX  STATEN  QUEENS  MANHAT  \
0        1     7      0      0       0       0       0   
1        1    14      0      0       0       0       0   
2        1    21      0      0       0       1       0   
3        1     4      0      0       0       0       0   
4        1     7      0      0       0       0       0   
...    ...   ...    ...    ...     ...     ...     ...   
7654    31    12      0      0       0       1       0   
7655    31    21      1      0       0       0       0   
7656    31     9      1      0       0       0       0   
7657    31     6      0      1       0       0       0   
7658    31    14      0      0       0       1       0   

      NUMBER OF PERSONS KILLED  lat  
0                            0    0  
1                            0    1  
2                            0    1  
3                            0    0  
4                            0    0  
...                        ...  ...  
7654                         0    1  
7655                         0    1  
7656                         0    1  
7657                         0    1  
7658                         0    1  

[7659 rows x 9 columns]
rem_high()
                                    OLS Regression Results                                    
==============================================================================================
Dep. Variable:     NUMBER OF PERSONS INJURED   R-squared (uncentered):                   0.226
Model:                                   OLS   Adj. R-squared (uncentered):              0.225
Method:                        Least Squares   F-statistic:                              279.0
Date:                       Tue, 19 Apr 2022   Prob (F-statistic):                        0.00
Time:                               03:31:10   Log-Likelihood:                         -8476.1
No. Observations:                       7659   AIC:                                  1.697e+04
Df Residuals:                           7651   BIC:                                  1.702e+04
Df Model:                                  8                                                  
Covariance Type:                   nonrobust                                                  
============================================================================================
                               coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------------
date                         0.0018      0.001      2.109      0.035       0.000       0.003
hour                         0.0139      0.001     11.350      0.000       0.011       0.016
BROOK                       -0.0278      0.022     -1.279      0.201      -0.070       0.015
STATEN                      -0.0293      0.053     -0.554      0.580      -0.133       0.074
QUEENS                      -0.0682      0.023     -2.918      0.004      -0.114      -0.022
MANHAT                      -0.0622      0.030     -2.084      0.037      -0.121      -0.004
NUMBER OF PERSONS KILLED     0.2820      0.173      1.633      0.103      -0.057       0.621
lat                          0.2189      0.022      9.965      0.000       0.176       0.262
==============================================================================
Omnibus:                     4117.098   Durbin-Watson:                   1.964
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            35735.067
Skew:                           2.450   Prob(JB):                         0.00
Kurtosis:                      12.379   Cond. No.                         452.
==============================================================================

Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.

9.6.2. The Bronx is removed

rem_high()
                                    OLS Regression Results                                    
==============================================================================================
Dep. Variable:     NUMBER OF PERSONS INJURED   R-squared (uncentered):                   0.226
Model:                                   OLS   Adj. R-squared (uncentered):              0.225
Method:                        Least Squares   F-statistic:                              318.8
Date:                       Tue, 19 Apr 2022   Prob (F-statistic):                        0.00
Time:                               03:31:12   Log-Likelihood:                         -8476.3
No. Observations:                       7659   AIC:                                  1.697e+04
Df Residuals:                           7652   BIC:                                  1.702e+04
Df Model:                                  7                                                  
Covariance Type:                   nonrobust                                                  
============================================================================================
                               coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------------
date                         0.0018      0.001      2.108      0.035       0.000       0.003
hour                         0.0139      0.001     11.340      0.000       0.011       0.016
BROOK                       -0.0261      0.022     -1.214      0.225      -0.068       0.016
QUEENS                      -0.0666      0.023     -2.871      0.004      -0.112      -0.021
MANHAT                      -0.0605      0.030     -2.039      0.042      -0.119      -0.002
NUMBER OF PERSONS KILLED     0.2826      0.173      1.636      0.102      -0.056       0.621
lat                          0.2175      0.022      9.970      0.000       0.175       0.260
==============================================================================
Omnibus:                     4118.495   Durbin-Watson:                   1.964
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            35771.998
Skew:                           2.451   Prob(JB):                         0.00
Kurtosis:                      12.384   Cond. No.                         452.
==============================================================================

Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.

9.6.3. Staten island is removed

rem_high()
                                    OLS Regression Results                                    
==============================================================================================
Dep. Variable:     NUMBER OF PERSONS INJURED   R-squared (uncentered):                   0.226
Model:                                   OLS   Adj. R-squared (uncentered):              0.225
Method:                        Least Squares   F-statistic:                              371.7
Date:                       Tue, 19 Apr 2022   Prob (F-statistic):                        0.00
Time:                               03:31:15   Log-Likelihood:                         -8477.0
No. Observations:                       7659   AIC:                                  1.697e+04
Df Residuals:                           7653   BIC:                                  1.701e+04
Df Model:                                  6                                                  
Covariance Type:                   nonrobust                                                  
============================================================================================
                               coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------------
date                         0.0018      0.001      2.072      0.038    9.57e-05       0.003
hour                         0.0138      0.001     11.301      0.000       0.011       0.016
QUEENS                      -0.0579      0.022     -2.624      0.009      -0.101      -0.015
MANHAT                      -0.0518      0.029     -1.798      0.072      -0.108       0.005
NUMBER OF PERSONS KILLED     0.2818      0.173      1.632      0.103      -0.057       0.620
lat                          0.2098      0.021     10.049      0.000       0.169       0.251
==============================================================================
Omnibus:                     4121.849   Durbin-Watson:                   1.964
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            35880.243
Skew:                           2.453   Prob(JB):                         0.00
Kurtosis:                      12.400   Cond. No.                         452.
==============================================================================

Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.

9.6.4. Brooklyn is removed

rem_high()
                                    OLS Regression Results                                    
==============================================================================================
Dep. Variable:     NUMBER OF PERSONS INJURED   R-squared (uncentered):                   0.225
Model:                                   OLS   Adj. R-squared (uncentered):              0.225
Method:                        Least Squares   F-statistic:                              445.4
Date:                       Tue, 19 Apr 2022   Prob (F-statistic):                        0.00
Time:                               03:31:18   Log-Likelihood:                         -8478.3
No. Observations:                       7659   AIC:                                  1.697e+04
Df Residuals:                           7654   BIC:                                  1.700e+04
Df Model:                                  5                                                  
Covariance Type:                   nonrobust                                                  
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
date           0.0018      0.001      2.084      0.037       0.000       0.003
hour           0.0138      0.001     11.296      0.000       0.011       0.016
QUEENS        -0.0578      0.022     -2.621      0.009      -0.101      -0.015
MANHAT        -0.0516      0.029     -1.791      0.073      -0.108       0.005
lat            0.2102      0.021     10.071      0.000       0.169       0.251
==============================================================================
Omnibus:                     4130.912   Durbin-Watson:                   1.964
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            36165.015
Skew:                           2.458   Prob(JB):                         0.00
Kurtosis:                      12.443   Cond. No.                         77.7
==============================================================================

Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.

9.6.5. Persons Killed is removed

rem_high()
                                    OLS Regression Results                                    
==============================================================================================
Dep. Variable:     NUMBER OF PERSONS INJURED   R-squared (uncentered):                   0.225
Model:                                   OLS   Adj. R-squared (uncentered):              0.225
Method:                        Least Squares   F-statistic:                              555.8
Date:                       Tue, 19 Apr 2022   Prob (F-statistic):                        0.00
Time:                               03:31:21   Log-Likelihood:                         -8479.9
No. Observations:                       7659   AIC:                                  1.697e+04
Df Residuals:                           7655   BIC:                                  1.700e+04
Df Model:                                  4                                                  
Covariance Type:                   nonrobust                                                  
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
date           0.0017      0.001      2.036      0.042    6.47e-05       0.003
hour           0.0137      0.001     11.253      0.000       0.011       0.016
QUEENS        -0.0515      0.022     -2.365      0.018      -0.094      -0.009
lat            0.2052      0.021      9.918      0.000       0.165       0.246
==============================================================================
Omnibus:                     4134.561   Durbin-Watson:                   1.965
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            36246.070
Skew:                           2.460   Prob(JB):                         0.00
Kurtosis:                      12.454   Cond. No.                         60.8
==============================================================================

Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.

9.6.6. Manhattan is removed

From here we can see all variables are under the .05 threshold established previously. Our formula would come out to be
y = date * .0017 + hour * .0137 - QUEENS * .0515 + lat * .2052

x_col = []
used = 0
def crunch_num():
    p_val = 1
    location = 0
    for i in range(len(x_columns)):
        if used == 0:
            x = df[x_columns[i]]
        results = sm.OLS(y, x).fit()
        print(p_val)
        print(results.pvalues[len(x_col)])
        if (p_val > results.pvalues[len(x_col)]):
            p_val = results.pvalues[len(x_col)]
            location = i
        print(results.summary())
    if (p_val <.05):
        print(location)
        x_col.append(x_columns[location])
        x_columns.remove(x_columns[location])
crunch_num()
1
1.4323814844141276e-306
                                    OLS Regression Results                                    
==============================================================================================
Dep. Variable:     NUMBER OF PERSONS INJURED   R-squared (uncentered):                   0.167
Model:                                   OLS   Adj. R-squared (uncentered):              0.167
Method:                        Least Squares   F-statistic:                              1537.
Date:                       Tue, 19 Apr 2022   Prob (F-statistic):                   1.43e-306
Time:                               05:18:18   Log-Likelihood:                         -8755.8
No. Observations:                       7659   AIC:                                  1.751e+04
Df Residuals:                           7658   BIC:                                  1.752e+04
Df Model:                                  1                                                  
Covariance Type:                   nonrobust                                                  
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
date           0.0185      0.000     39.207      0.000       0.018       0.019
==============================================================================
Omnibus:                     3858.755   Durbin-Watson:                   1.810
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            29521.077
Skew:                           2.303   Prob(JB):                         0.00
Kurtosis:                      11.444   Cond. No.                         1.00
==============================================================================

Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.
1.4323814844141276e-306
0.0
                                    OLS Regression Results                                    
==============================================================================================
Dep. Variable:     NUMBER OF PERSONS INJURED   R-squared (uncentered):                   0.208
Model:                                   OLS   Adj. R-squared (uncentered):              0.208
Method:                        Least Squares   F-statistic:                              2014.
Date:                       Tue, 19 Apr 2022   Prob (F-statistic):                        0.00
Time:                               05:18:18   Log-Likelihood:                         -8562.1
No. Observations:                       7659   AIC:                                  1.713e+04
Df Residuals:                           7658   BIC:                                  1.713e+04
Df Model:                                  1                                                  
Covariance Type:                   nonrobust                                                  
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
hour           0.0271      0.001     44.880      0.000       0.026       0.028
==============================================================================
Omnibus:                     3979.018   Durbin-Watson:                   1.957
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            33128.310
Skew:                           2.360   Prob(JB):                         0.00
Kurtosis:                      12.030   Cond. No.                         1.00
==============================================================================

Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.
0.0
4.269856558800235e-90
                                    OLS Regression Results                                    
==============================================================================================
Dep. Variable:     NUMBER OF PERSONS INJURED   R-squared (uncentered):                   0.052
Model:                                   OLS   Adj. R-squared (uncentered):              0.051
Method:                        Least Squares   F-statistic:                              416.0
Date:                       Tue, 19 Apr 2022   Prob (F-statistic):                    4.27e-90
Time:                               05:18:18   Log-Likelihood:                         -9253.7
No. Observations:                       7659   AIC:                                  1.851e+04
Df Residuals:                           7658   BIC:                                  1.852e+04
Df Model:                                  1                                                  
Covariance Type:                   nonrobust                                                  
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
BROOK          0.3956      0.019     20.397      0.000       0.358       0.434
==============================================================================
Omnibus:                     3929.307   Durbin-Watson:                   1.678
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            32281.384
Skew:                           2.327   Prob(JB):                         0.00
Kurtosis:                      11.916   Cond. No.                         1.00
==============================================================================

Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.
0.0
2.393436485478676e-52
                                    OLS Regression Results                                    
==============================================================================================
Dep. Variable:     NUMBER OF PERSONS INJURED   R-squared (uncentered):                   0.030
Model:                                   OLS   Adj. R-squared (uncentered):              0.030
Method:                        Least Squares   F-statistic:                              235.4
Date:                       Tue, 19 Apr 2022   Prob (F-statistic):                    2.39e-52
Time:                               05:18:18   Log-Likelihood:                         -9340.3
No. Observations:                       7659   AIC:                                  1.868e+04
Df Residuals:                           7658   BIC:                                  1.869e+04
Df Model:                                  1                                                  
Covariance Type:                   nonrobust                                                  
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
BRONX          0.4082      0.027     15.342      0.000       0.356       0.460
==============================================================================
Omnibus:                     3987.014   Durbin-Watson:                   1.601
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            32899.862
Skew:                           2.370   Prob(JB):                         0.00
Kurtosis:                      11.979   Cond. No.                         1.00
==============================================================================

Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.
0.0
4.369430499033895e-11
                                    OLS Regression Results                                    
==============================================================================================
Dep. Variable:     NUMBER OF PERSONS INJURED   R-squared (uncentered):                   0.006
Model:                                   OLS   Adj. R-squared (uncentered):              0.006
Method:                        Least Squares   F-statistic:                              43.57
Date:                       Tue, 19 Apr 2022   Prob (F-statistic):                    4.37e-11
Time:                               05:18:18   Log-Likelihood:                         -9434.6
No. Observations:                       7659   AIC:                                  1.887e+04
Df Residuals:                           7658   BIC:                                  1.888e+04
Df Model:                                  1                                                  
Covariance Type:                   nonrobust                                                  
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
STATEN         0.3842      0.058      6.601      0.000       0.270       0.498
==============================================================================
Omnibus:                     4179.444   Durbin-Watson:                   1.527
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            37164.558
Skew:                           2.489   Prob(JB):                         0.00
Kurtosis:                      12.574   Cond. No.                         1.00
==============================================================================

Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.
0.0
6.945005855138339e-56
                                    OLS Regression Results                                    
==============================================================================================
Dep. Variable:     NUMBER OF PERSONS INJURED   R-squared (uncentered):                   0.032
Model:                                   OLS   Adj. R-squared (uncentered):              0.032
Method:                        Least Squares   F-statistic:                              252.1
Date:                       Tue, 19 Apr 2022   Prob (F-statistic):                    6.95e-56
Time:                               05:18:18   Log-Likelihood:                         -9332.2
No. Observations:                       7659   AIC:                                  1.867e+04
Df Residuals:                           7658   BIC:                                  1.867e+04
Df Model:                                  1                                                  
Covariance Type:                   nonrobust                                                  
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
QUEENS         0.3482      0.022     15.878      0.000       0.305       0.391
==============================================================================
Omnibus:                     3969.257   Durbin-Watson:                   1.627
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            32316.549
Skew:                           2.362   Prob(JB):                         0.00
Kurtosis:                      11.886   Cond. No.                         1.00
==============================================================================

Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.
0.0
4.461967379856326e-32
                                    OLS Regression Results                                    
==============================================================================================
Dep. Variable:     NUMBER OF PERSONS INJURED   R-squared (uncentered):                   0.018
Model:                                   OLS   Adj. R-squared (uncentered):              0.018
Method:                        Least Squares   F-statistic:                              140.3
Date:                       Tue, 19 Apr 2022   Prob (F-statistic):                    4.46e-32
Time:                               05:18:18   Log-Likelihood:                         -9386.8
No. Observations:                       7659   AIC:                                  1.878e+04
Df Residuals:                           7658   BIC:                                  1.878e+04
Df Model:                                  1                                                  
Covariance Type:                   nonrobust                                                  
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
MANHAT         0.3596      0.030     11.843      0.000       0.300       0.419
==============================================================================
Omnibus:                     4087.592   Durbin-Watson:                   1.584
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            35155.863
Skew:                           2.431   Prob(JB):                         0.00
Kurtosis:                      12.302   Cond. No.                         1.00
==============================================================================

Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.
0.0
0.0018205759792527649
                                    OLS Regression Results                                    
==============================================================================================
Dep. Variable:     NUMBER OF PERSONS INJURED   R-squared (uncentered):                   0.001
Model:                                   OLS   Adj. R-squared (uncentered):              0.001
Method:                        Least Squares   F-statistic:                              9.729
Date:                       Tue, 19 Apr 2022   Prob (F-statistic):                     0.00182
Time:                               05:18:18   Log-Likelihood:                         -9451.4
No. Observations:                       7659   AIC:                                  1.890e+04
Df Residuals:                           7658   BIC:                                  1.891e+04
Df Model:                                  1                                                  
Covariance Type:                   nonrobust                                                  
============================================================================================
                               coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------------
NUMBER OF PERSONS KILLED     0.6111      0.196      3.119      0.002       0.227       0.995
==============================================================================
Omnibus:                     4194.263   Durbin-Watson:                   1.510
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            37329.833
Skew:                           2.501   Prob(JB):                         0.00
Kurtosis:                      12.590   Cond. No.                         1.00
==============================================================================

Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.
0.0
0.0
                                    OLS Regression Results                                    
==============================================================================================
Dep. Variable:     NUMBER OF PERSONS INJURED   R-squared (uncentered):                   0.210
Model:                                   OLS   Adj. R-squared (uncentered):              0.210
Method:                        Least Squares   F-statistic:                              2032.
Date:                       Tue, 19 Apr 2022   Prob (F-statistic):                        0.00
Time:                               05:18:18   Log-Likelihood:                         -8555.0
No. Observations:                       7659   AIC:                                  1.711e+04
Df Residuals:                           7658   BIC:                                  1.712e+04
Df Model:                                  1                                                  
Covariance Type:                   nonrobust                                                  
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
lat            0.3962      0.009     45.077      0.000       0.379       0.413
==============================================================================
Omnibus:                     4119.911   Durbin-Watson:                   1.965
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            35447.666
Skew:                           2.456   Prob(JB):                         0.00
Kurtosis:                      12.324   Cond. No.                         1.00
==============================================================================

Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.
1

9.6.7. Start by adding Hour

def crunch_nums():
    p_val = 1
    location = 0
    for i in range(len(x_columns)-len(x_col)):
        x_col.append(x_columns[i])
        x = df[x_col]
        results = sm.OLS(y, x).fit()
        print(results.summary())
        x_col.pop(len(x_col)-1)
        print(p_val)
        print(results.pvalues[len(x_col)])
        if (p_val > results.pvalues[len(x_col)]):
            p_val = results.pvalues[len(x_col)]
            location = i
    if (p_val <.05):
        print(location)
        x_col.append(x_columns[location])
        x_columns.remove(x_columns[int(location)])
crunch_nums()
                                    OLS Regression Results                                    
==============================================================================================
Dep. Variable:     NUMBER OF PERSONS INJURED   R-squared (uncentered):                   0.215
Model:                                   OLS   Adj. R-squared (uncentered):              0.215
Method:                        Least Squares   F-statistic:                              1049.
Date:                       Tue, 19 Apr 2022   Prob (F-statistic):                        0.00
Time:                               05:18:31   Log-Likelihood:                         -8529.0
No. Observations:                       7659   AIC:                                  1.706e+04
Df Residuals:                           7657   BIC:                                  1.708e+04
Df Model:                                  2                                                  
Covariance Type:                   nonrobust                                                  
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
hour           0.0210      0.001     21.614      0.000       0.019       0.023
date           0.0060      0.001      8.154      0.000       0.005       0.007
==============================================================================
Omnibus:                     4031.194   Durbin-Watson:                   1.943
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            33946.805
Skew:                           2.396   Prob(JB):                         0.00
Kurtosis:                      12.133   Cond. No.                         3.01
==============================================================================

Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.
1
4.0666379515353367e-16
                                    OLS Regression Results                                    
==============================================================================================
Dep. Variable:     NUMBER OF PERSONS INJURED   R-squared (uncentered):                   0.209
Model:                                   OLS   Adj. R-squared (uncentered):              0.209
Method:                        Least Squares   F-statistic:                              1013.
Date:                       Tue, 19 Apr 2022   Prob (F-statistic):                        0.00
Time:                               05:18:31   Log-Likelihood:                         -8557.5
No. Observations:                       7659   AIC:                                  1.712e+04
Df Residuals:                           7657   BIC:                                  1.713e+04
Df Model:                                  2                                                  
Covariance Type:                   nonrobust                                                  
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
hour           0.0263      0.001     39.070      0.000       0.025       0.028
BROOK          0.0593      0.020      3.011      0.003       0.021       0.098
==============================================================================
Omnibus:                     3992.980   Durbin-Watson:                   1.959
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            33490.382
Skew:                           2.368   Prob(JB):                         0.00
Kurtosis:                      12.084   Cond. No.                         32.6
==============================================================================

Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.
4.0666379515353367e-16
0.0026154978758844666
                                    OLS Regression Results                                    
==============================================================================================
Dep. Variable:     NUMBER OF PERSONS INJURED   R-squared (uncentered):                   0.209
Model:                                   OLS   Adj. R-squared (uncentered):              0.209
Method:                        Least Squares   F-statistic:                              1012.
Date:                       Tue, 19 Apr 2022   Prob (F-statistic):                        0.00
Time:                               05:18:31   Log-Likelihood:                         -8558.0
No. Observations:                       7659   AIC:                                  1.712e+04
Df Residuals:                           7657   BIC:                                  1.713e+04
Df Model:                                  2                                                  
Covariance Type:                   nonrobust                                                  
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
hour           0.0266      0.001     41.659      0.000       0.025       0.028
BRONX          0.0722      0.025      2.847      0.004       0.022       0.122
==============================================================================
Omnibus:                     3981.961   Durbin-Watson:                   1.957
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            33171.456
Skew:                           2.362   Prob(JB):                         0.00
Kurtosis:                      12.035   Cond. No.                         41.9
==============================================================================

Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.
4.0666379515353367e-16
0.004426633639098556
                                    OLS Regression Results                                    
==============================================================================================
Dep. Variable:     NUMBER OF PERSONS INJURED   R-squared (uncentered):                   0.208
Model:                                   OLS   Adj. R-squared (uncentered):              0.208
Method:                        Least Squares   F-statistic:                              1007.
Date:                       Tue, 19 Apr 2022   Prob (F-statistic):                        0.00
Time:                               05:18:31   Log-Likelihood:                         -8561.8
No. Observations:                       7659   AIC:                                  1.713e+04
Df Residuals:                           7657   BIC:                                  1.714e+04
Df Model:                                  2                                                  
Covariance Type:                   nonrobust                                                  
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
hour           0.0271      0.001     44.270      0.000       0.026       0.028
STATEN         0.0367      0.053      0.699      0.484      -0.066       0.140
==============================================================================
Omnibus:                     3980.742   Durbin-Watson:                   1.957
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            33169.872
Skew:                           2.361   Prob(JB):                         0.00
Kurtosis:                      12.036   Cond. No.                         86.9
==============================================================================

Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.
4.0666379515353367e-16
0.4843669202067119
                                    OLS Regression Results                                    
==============================================================================================
Dep. Variable:     NUMBER OF PERSONS INJURED   R-squared (uncentered):                   0.208
Model:                                   OLS   Adj. R-squared (uncentered):              0.208
Method:                        Least Squares   F-statistic:                              1007.
Date:                       Tue, 19 Apr 2022   Prob (F-statistic):                        0.00
Time:                               05:18:31   Log-Likelihood:                         -8561.9
No. Observations:                       7659   AIC:                                  1.713e+04
Df Residuals:                           7657   BIC:                                  1.714e+04
Df Model:                                  2                                                  
Covariance Type:                   nonrobust                                                  
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
hour           0.0270      0.001     41.304      0.000       0.026       0.028
QUEENS         0.0114      0.021      0.532      0.594      -0.031       0.053
==============================================================================
Omnibus:                     3980.405   Durbin-Watson:                   1.957
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            33143.808
Skew:                           2.361   Prob(JB):                         0.00
Kurtosis:                      12.031   Cond. No.                         35.5
==============================================================================

Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.
4.0666379515353367e-16
0.5944459717669641
                                    OLS Regression Results                                    
==============================================================================================
Dep. Variable:     NUMBER OF PERSONS INJURED   R-squared (uncentered):                   0.208
Model:                                   OLS   Adj. R-squared (uncentered):              0.208
Method:                        Least Squares   F-statistic:                              1007.
Date:                       Tue, 19 Apr 2022   Prob (F-statistic):                        0.00
Time:                               05:18:31   Log-Likelihood:                         -8562.0
No. Observations:                       7659   AIC:                                  1.713e+04
Df Residuals:                           7657   BIC:                                  1.714e+04
Df Model:                                  2                                                  
Covariance Type:                   nonrobust                                                  
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
hour           0.0271      0.001     42.898      0.000       0.026       0.028
MANHAT         0.0115      0.028      0.405      0.686      -0.044       0.067
==============================================================================
Omnibus:                     3980.428   Durbin-Watson:                   1.958
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            33157.904
Skew:                           2.361   Prob(JB):                         0.00
Kurtosis:                      12.034   Cond. No.                         47.0
==============================================================================

Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.
4.0666379515353367e-16
0.6858402170710972
                                    OLS Regression Results                                    
==============================================================================================
Dep. Variable:     NUMBER OF PERSONS INJURED   R-squared (uncentered):                   0.209
Model:                                   OLS   Adj. R-squared (uncentered):              0.208
Method:                        Least Squares   F-statistic:                              1009.
Date:                       Tue, 19 Apr 2022   Prob (F-statistic):                        0.00
Time:                               05:18:31   Log-Likelihood:                         -8560.3
No. Observations:                       7659   AIC:                                  1.712e+04
Df Residuals:                           7657   BIC:                                  1.714e+04
Df Model:                                  2                                                  
Covariance Type:                   nonrobust                                                  
============================================================================================
                               coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------------
hour                         0.0271      0.001     44.790      0.000       0.026       0.028
NUMBER OF PERSONS KILLED     0.3296      0.175      1.888      0.059      -0.013       0.672
==============================================================================
Omnibus:                     3969.378   Durbin-Watson:                   1.957
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            32855.187
Skew:                           2.355   Prob(JB):                         0.00
Kurtosis:                      11.987   Cond. No.                         289.
==============================================================================

Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.
4.0666379515353367e-16
0.059032381034271476
0

9.6.8. Add Date

crunch_nums()
                                    OLS Regression Results                                    
==============================================================================================
Dep. Variable:     NUMBER OF PERSONS INJURED   R-squared (uncentered):                   0.215
Model:                                   OLS   Adj. R-squared (uncentered):              0.215
Method:                        Least Squares   F-statistic:                              1049.
Date:                       Tue, 19 Apr 2022   Prob (F-statistic):                        0.00
Time:                               05:12:14   Log-Likelihood:                         -8529.0
No. Observations:                       7659   AIC:                                  1.706e+04
Df Residuals:                           7657   BIC:                                  1.708e+04
Df Model:                                  2                                                  
Covariance Type:                   nonrobust                                                  
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
hour           0.0210      0.001     21.614      0.000       0.019       0.023
date           0.0060      0.001      8.154      0.000       0.005       0.007
==============================================================================
Omnibus:                     4031.194   Durbin-Watson:                   1.943
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            33946.805
Skew:                           2.396   Prob(JB):                         0.00
Kurtosis:                      12.133   Cond. No.                         3.01
==============================================================================

Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.
1
4.0666379515353367e-16
                                    OLS Regression Results                                    
==============================================================================================
Dep. Variable:     NUMBER OF PERSONS INJURED   R-squared (uncentered):                   0.209
Model:                                   OLS   Adj. R-squared (uncentered):              0.209
Method:                        Least Squares   F-statistic:                              1013.
Date:                       Tue, 19 Apr 2022   Prob (F-statistic):                        0.00
Time:                               05:12:14   Log-Likelihood:                         -8557.5
No. Observations:                       7659   AIC:                                  1.712e+04
Df Residuals:                           7657   BIC:                                  1.713e+04
Df Model:                                  2                                                  
Covariance Type:                   nonrobust                                                  
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
hour           0.0263      0.001     39.070      0.000       0.025       0.028
BROOK          0.0593      0.020      3.011      0.003       0.021       0.098
==============================================================================
Omnibus:                     3992.980   Durbin-Watson:                   1.959
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            33490.382
Skew:                           2.368   Prob(JB):                         0.00
Kurtosis:                      12.084   Cond. No.                         32.6
==============================================================================

Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.
4.0666379515353367e-16
0.0026154978758844666
                                    OLS Regression Results                                    
==============================================================================================
Dep. Variable:     NUMBER OF PERSONS INJURED   R-squared (uncentered):                   0.209
Model:                                   OLS   Adj. R-squared (uncentered):              0.209
Method:                        Least Squares   F-statistic:                              1012.
Date:                       Tue, 19 Apr 2022   Prob (F-statistic):                        0.00
Time:                               05:12:14   Log-Likelihood:                         -8558.0
No. Observations:                       7659   AIC:                                  1.712e+04
Df Residuals:                           7657   BIC:                                  1.713e+04
Df Model:                                  2                                                  
Covariance Type:                   nonrobust                                                  
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
hour           0.0266      0.001     41.659      0.000       0.025       0.028
BRONX          0.0722      0.025      2.847      0.004       0.022       0.122
==============================================================================
Omnibus:                     3981.961   Durbin-Watson:                   1.957
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            33171.456
Skew:                           2.362   Prob(JB):                         0.00
Kurtosis:                      12.035   Cond. No.                         41.9
==============================================================================

Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.
4.0666379515353367e-16
0.004426633639098556
                                    OLS Regression Results                                    
==============================================================================================
Dep. Variable:     NUMBER OF PERSONS INJURED   R-squared (uncentered):                   0.208
Model:                                   OLS   Adj. R-squared (uncentered):              0.208
Method:                        Least Squares   F-statistic:                              1007.
Date:                       Tue, 19 Apr 2022   Prob (F-statistic):                        0.00
Time:                               05:12:14   Log-Likelihood:                         -8561.8
No. Observations:                       7659   AIC:                                  1.713e+04
Df Residuals:                           7657   BIC:                                  1.714e+04
Df Model:                                  2                                                  
Covariance Type:                   nonrobust                                                  
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
hour           0.0271      0.001     44.270      0.000       0.026       0.028
STATEN         0.0367      0.053      0.699      0.484      -0.066       0.140
==============================================================================
Omnibus:                     3980.742   Durbin-Watson:                   1.957
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            33169.872
Skew:                           2.361   Prob(JB):                         0.00
Kurtosis:                      12.036   Cond. No.                         86.9
==============================================================================

Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.
4.0666379515353367e-16
0.4843669202067119
                                    OLS Regression Results                                    
==============================================================================================
Dep. Variable:     NUMBER OF PERSONS INJURED   R-squared (uncentered):                   0.208
Model:                                   OLS   Adj. R-squared (uncentered):              0.208
Method:                        Least Squares   F-statistic:                              1007.
Date:                       Tue, 19 Apr 2022   Prob (F-statistic):                        0.00
Time:                               05:12:14   Log-Likelihood:                         -8561.9
No. Observations:                       7659   AIC:                                  1.713e+04
Df Residuals:                           7657   BIC:                                  1.714e+04
Df Model:                                  2                                                  
Covariance Type:                   nonrobust                                                  
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
hour           0.0270      0.001     41.304      0.000       0.026       0.028
QUEENS         0.0114      0.021      0.532      0.594      -0.031       0.053
==============================================================================
Omnibus:                     3980.405   Durbin-Watson:                   1.957
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            33143.808
Skew:                           2.361   Prob(JB):                         0.00
Kurtosis:                      12.031   Cond. No.                         35.5
==============================================================================

Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.
4.0666379515353367e-16
0.5944459717669641
                                    OLS Regression Results                                    
==============================================================================================
Dep. Variable:     NUMBER OF PERSONS INJURED   R-squared (uncentered):                   0.208
Model:                                   OLS   Adj. R-squared (uncentered):              0.208
Method:                        Least Squares   F-statistic:                              1007.
Date:                       Tue, 19 Apr 2022   Prob (F-statistic):                        0.00
Time:                               05:12:14   Log-Likelihood:                         -8562.0
No. Observations:                       7659   AIC:                                  1.713e+04
Df Residuals:                           7657   BIC:                                  1.714e+04
Df Model:                                  2                                                  
Covariance Type:                   nonrobust                                                  
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
hour           0.0271      0.001     42.898      0.000       0.026       0.028
MANHAT         0.0115      0.028      0.405      0.686      -0.044       0.067
==============================================================================
Omnibus:                     3980.428   Durbin-Watson:                   1.958
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            33157.904
Skew:                           2.361   Prob(JB):                         0.00
Kurtosis:                      12.034   Cond. No.                         47.0
==============================================================================

Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.
4.0666379515353367e-16
0.6858402170710972
                                    OLS Regression Results                                    
==============================================================================================
Dep. Variable:     NUMBER OF PERSONS INJURED   R-squared (uncentered):                   0.209
Model:                                   OLS   Adj. R-squared (uncentered):              0.208
Method:                        Least Squares   F-statistic:                              1009.
Date:                       Tue, 19 Apr 2022   Prob (F-statistic):                        0.00
Time:                               05:12:14   Log-Likelihood:                         -8560.3
No. Observations:                       7659   AIC:                                  1.712e+04
Df Residuals:                           7657   BIC:                                  1.714e+04
Df Model:                                  2                                                  
Covariance Type:                   nonrobust                                                  
============================================================================================
                               coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------------
hour                         0.0271      0.001     44.790      0.000       0.026       0.028
NUMBER OF PERSONS KILLED     0.3296      0.175      1.888      0.059      -0.013       0.672
==============================================================================
Omnibus:                     3969.378   Durbin-Watson:                   1.957
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            32855.187
Skew:                           2.355   Prob(JB):                         0.00
Kurtosis:                      11.987   Cond. No.                         289.
==============================================================================

Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.
4.0666379515353367e-16
0.059032381034271476

9.6.9. Add Bronx

crunch_nums()
                                    OLS Regression Results                                    
==============================================================================================
Dep. Variable:     NUMBER OF PERSONS INJURED   R-squared (uncentered):                   0.216
Model:                                   OLS   Adj. R-squared (uncentered):              0.216
Method:                        Least Squares   F-statistic:                              421.9
Date:                       Tue, 19 Apr 2022   Prob (F-statistic):                        0.00
Time:                               05:19:25   Log-Likelihood:                         -8524.1
No. Observations:                       7659   AIC:                                  1.706e+04
Df Residuals:                           7654   BIC:                                  1.709e+04
Df Model:                                  5                                                  
Covariance Type:                   nonrobust                                                  
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
hour           0.0202      0.001     20.088      0.000       0.018       0.022
date           0.0056      0.001      7.367      0.000       0.004       0.007
BRONX          0.0624      0.026      2.412      0.016       0.012       0.113
BROOK          0.0472      0.020      2.333      0.020       0.008       0.087
STATEN         0.0417      0.053      0.793      0.428      -0.061       0.145
==============================================================================
Omnibus:                     4043.897   Durbin-Watson:                   1.945
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            34284.761
Skew:                           2.403   Prob(JB):                         0.00
Kurtosis:                      12.184   Cond. No.                         137.
==============================================================================

Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.
1
0.42787080095123353

9.6.10. Add Brooklyn

In the new model we have Hour, Date, Bronx, and Brooklyn. A very different outcome than we saw with the other style of selection