Describing Models with Patsy
Contents
9.1. Describing Models with Patsy¶
From patsy
documentation:
patsy
is a Python package for describing statistical models (especially linear models, or models that have a linear component) and building design matrices. It is closely inspired by and compatible with the formula mini-language used in R and S.
Patsy’s goal is to become the standard high-level interface to describing statistical models in Python, regardless of what particular model or library is being used underneath.
Pros of patsy
:
Allows data transformations to be specified using arbitrary Python code: instead of x, we can write log(x) or (x > 0)
Gives a range of convenient options for coding categorical variables
Provides basic built-in transformations
Provides a language for easy-to-read specification of linear constraints
Features a simple API for integration into statistical packages. Specifically used in
statsmodels
, which we will be discussing further in the course.
Cons of patsy
:
Will not do statistics. Simply allows you to describe models in general terms
9.1.1. Installing Patsy¶
To install patsy
, use a simple pip command.
pip install patsy
Then, you should be able to import the package. The dmatrices
and dmatrix
objects are the most important.
from patsy import dmatrices, dmatrix
9.1.2. Basic example¶
First let’s read in our NYC job data to have something to work with. We’ll first take the Initial Cost
column, which has string values in the form of “$XXXX” and turn them into integer values. Then, we’ll calculate the difference in the proposed zoning square feet and the existing zoning square feet.
import pandas as pd
import numpy as np
nyc_jobs = pd.read_csv("../data/nyc_DobJobApp_2021.csv")
nyc_jobs["cost"] = [x.split("$")[1] for x in nyc_jobs["Initial Cost"]]
nyc_jobs["cost"] = [int(float(x)) for x in nyc_jobs["cost"]]
nyc_jobs["zoning_diff"] = nyc_jobs["Proposed Zoning Sqft"] - nyc_jobs["Existing Zoning Sqft"]
nyc_jobs_zoning = nyc_jobs.drop(nyc_jobs[nyc_jobs.zoning_diff <= 0].index)
nyc_jobs_zoning = nyc_jobs_zoning.drop(nyc_jobs_zoning[nyc_jobs_zoning.cost == 0].index)
/tmp/ipykernel_4195/772679357.py:4: DtypeWarning: Columns (23,24,25,26,29,31,32,59,60) have mixed types. Specify dtype option on import or set low_memory=False.
nyc_jobs = pd.read_csv("../data/nyc_DobJobApp_2021.csv")
Now we can create a new matrix with patsy
’s dmatrices
.
dmatrices("cost ~ zoning_diff", nyc_jobs_zoning)
(DesignMatrix with shape (1016, 1)
cost
456000
205000
998000
643000
1183365
527000
44000
405000
705795
333000
380220
242850
100
426156
321000
232775
670000
110000
217500
119000
800000
320000
721632
391300
89500
337962
199980
125000
45000
1000000
[986 rows omitted]
Terms:
'cost' (column 0)
(to view full data, use np.asarray(this_obj)),
DesignMatrix with shape (1016, 2)
Intercept zoning_diff
1 3800
1 571
1 1505
1 2209
1 1182
1 1894
1 436
1 600
1 987
1 227
1 1302
1 879
1 2318
1 22
1 302
1 659
1 126
1 724
1 1678
1 786
1 5355
1 56
1 2536
1 2506
1 1624
1 1025
1 876
1 273
1 916
1 1422
[986 rows omitted]
Terms:
'Intercept' (column 0)
'zoning_diff' (column 1)
(to view full data, use np.asarray(this_obj)))
The return value is a Python tuple containing two DesignMatrix
objects. The first dmatrix is the left-hand side of the formula, and the second is the right-hand side. They’re just oridinary numpy arrays with extra metadata. We can pass these to a regression function to perform some analysis.
outcome, predictors = dmatrices("cost ~ zoning_diff", nyc_jobs_zoning)
betas = np.linalg.lstsq(predictors, outcome)[0]
betas
/tmp/ipykernel_4195/1476851890.py:2: FutureWarning: `rcond` parameter will change to the default of machine precision times ``max(M, N)`` where M and N are the input matrix dimensions.
To use the future default and silence this warning we advise to pass `rcond=None`, to keep using the old, explicitly pass `rcond=-1`.
betas = np.linalg.lstsq(predictors, outcome)[0]
array([[218111.1726294 ],
[ 409.22199844]])
The metadata is stored in an extra attribute, .design_info
, which has valuable information.
predictors.design_info.column_names
['Intercept', 'zoning_diff']
9.1.3. Transforming Code¶
You can transform code with patsy
. For instance, you can use simple Python code.
dmatrix("np.log(zoning_diff)", nyc_jobs_zoning)
DesignMatrix with shape (1016, 2)
Intercept np.log(zoning_diff)
1 8.24276
1 6.34739
1 7.31655
1 7.70030
1 7.07496
1 7.54645
1 6.07764
1 6.39693
1 6.89467
1 5.42495
1 7.17166
1 6.77878
1 7.74846
1 3.09104
1 5.71043
1 6.49072
1 4.83628
1 6.58479
1 7.42536
1 6.66696
1 8.58579
1 4.02535
1 7.83834
1 7.82644
1 7.39265
1 6.93245
1 6.77537
1 5.60947
1 6.82002
1 7.25982
[986 rows omitted]
Terms:
'Intercept' (column 0)
'np.log(zoning_diff)' (column 1)
(to view full data, use np.asarray(this_obj))
There are also various built-in transformation functions from patsy
, such as center
and standardize
. The documentation for these and other functions are in the patsy.builtins
API reference.
dmatrix("center(zoning_diff) + standardize(zoning_diff)", nyc_jobs_zoning)
DesignMatrix with shape (1016, 3)
Intercept center(zoning_diff) standardize(zoning_diff)
1 996.12303 0.07860
1 -2232.87697 -0.17619
1 -1298.87697 -0.10249
1 -594.87697 -0.04694
1 -1621.87697 -0.12798
1 -909.87697 -0.07180
1 -2367.87697 -0.18684
1 -2203.87697 -0.17390
1 -1816.87697 -0.14337
1 -2576.87697 -0.20334
1 -1501.87697 -0.11851
1 -1924.87697 -0.15189
1 -485.87697 -0.03834
1 -2781.87697 -0.21951
1 -2501.87697 -0.19742
1 -2144.87697 -0.16925
1 -2677.87697 -0.21131
1 -2079.87697 -0.16412
1 -1125.87697 -0.08884
1 -2017.87697 -0.15923
1 2551.12303 0.20130
1 -2747.87697 -0.21683
1 -267.87697 -0.02114
1 -297.87697 -0.02350
1 -1179.87697 -0.09310
1 -1778.87697 -0.14037
1 -1927.87697 -0.15213
1 -2530.87697 -0.19971
1 -1887.87697 -0.14897
1 -1381.87697 -0.10904
[986 rows omitted]
Terms:
'Intercept' (column 0)
'center(zoning_diff)' (column 1)
'standardize(zoning_diff)' (column 2)
(to view full data, use np.asarray(this_obj))
You can define your own transformation functions if you’d like as well.
def square(x):
return(x*2)
dmatrix("zoning_diff + square(zoning_diff)", nyc_jobs_zoning)
DesignMatrix with shape (1016, 3)
Intercept zoning_diff square(zoning_diff)
1 3800 7600
1 571 1142
1 1505 3010
1 2209 4418
1 1182 2364
1 1894 3788
1 436 872
1 600 1200
1 987 1974
1 227 454
1 1302 2604
1 879 1758
1 2318 4636
1 22 44
1 302 604
1 659 1318
1 126 252
1 724 1448
1 1678 3356
1 786 1572
1 5355 10710
1 56 112
1 2536 5072
1 2506 5012
1 1624 3248
1 1025 2050
1 876 1752
1 273 546
1 916 1832
1 1422 2844
[986 rows omitted]
Terms:
'Intercept' (column 0)
'zoning_diff' (column 1)
'square(zoning_diff)' (column 2)
(to view full data, use np.asarray(this_obj))
9.1.4. Special Transformations¶
There are other special built-in transformations that patsy
provides which are particularly useful when dealing with difficult characters. Even though patsy
uses a “+” in the syntax, the code still needs to be Python code. This means that something like “Existing Zoning Sqft” will produce an error. The Q()
transformation however can fix that.
# dmatrix("Initial Cost", nyc_jobs)
# Produces error
dmatrix("Q('Existing Zoning Sqft')", nyc_jobs_zoning)
DesignMatrix with shape (1016, 2)
Intercept Q('Existing Zoning Sqft')
1 4310
1 1276
1 1525
1 1713
1 1954
1 1014
1 2711
1 3346
1 5016
1 2136
1 1972
1 2359
1 3739
1 2094
1 1501
1 2117
1 2373
1 1224
1 1015
1 2592
1 1827
1 1743
1 1033
1 5009
1 3043
1 1246
1 2886
1 903
1 2900
1 4098
[986 rows omitted]
Terms:
'Intercept' (column 0)
"Q('Existing Zoning Sqft')" (column 1)
(to view full data, use np.asarray(this_obj))
Though the Q()
function makes this possible, it is still not recommended. The best option here would be to rename “Initial Cost” to something with better style, such as “initial_cost.”
Another useful transformation is I()
. In the case where you actually want to use addition, instead of writing your ownn function, you can use I()
to tell patsy
you want to use addition (or subtraction in this case).
dmatrix("I(Q('Proposed Zoning Sqft') - Q('Existing Zoning Sqft'))", nyc_jobs_zoning)
DesignMatrix with shape (1016, 2)
Intercept I(Q('Proposed Zoning Sqft') - Q('Existing Zoning Sqft'))
1 3800
1 571
1 1505
1 2209
1 1182
1 1894
1 436
1 600
1 987
1 227
1 1302
1 879
1 2318
1 22
1 302
1 659
1 126
1 724
1 1678
1 786
1 5355
1 56
1 2536
1 2506
1 1624
1 1025
1 876
1 273
1 916
1 1422
[986 rows omitted]
Terms:
'Intercept' (column 0)
"I(Q('Proposed Zoning Sqft') - Q('Existing Zoning Sqft'))" (column 1)
(to view full data, use np.asarray(this_obj))
9.1.5. Categorical Data¶
patsy
handles categorical data well. If you include string or boolean type data, it will automatically turn them categorical. In this instance, the T.Borough shows that the data has been transformed into treatment code. You can also add the return_type = dataframe
to have the return value be a pandas.dataframe
instead.
dmatrices("cost ~ Borough", nyc_jobs_zoning, return_type = "dataframe")
( cost
48 456000.0
55 205000.0
58 998000.0
62 643000.0
72 1183365.0
... ...
18693 120000.0
18813 100000.0
18998 1000.0
19007 1000.0
19749 800000.0
[1016 rows x 1 columns],
Intercept Borough[T.BROOKLYN] Borough[T.MANHATTAN] \
48 1.0 0.0 0.0
55 1.0 1.0 0.0
58 1.0 0.0 0.0
62 1.0 0.0 0.0
72 1.0 1.0 0.0
... ... ... ...
18693 1.0 0.0 0.0
18813 1.0 0.0 1.0
18998 1.0 0.0 0.0
19007 1.0 1.0 0.0
19749 1.0 0.0 0.0
Borough[T.QUEENS] Borough[T.STATEN ISLAND]
48 1.0 0.0
55 0.0 0.0
58 1.0 0.0
62 1.0 0.0
72 0.0 0.0
... ... ...
18693 1.0 0.0
18813 0.0 0.0
18998 1.0 0.0
19007 0.0 0.0
19749 1.0 0.0
[1016 rows x 5 columns])
9.1.6. Interaction Terms¶
Interaction terms are a strong reason to use patsy
. You can easily apply interaction with :
.
dmatrix("0 + Borough:Q('Building Type')", nyc_jobs_zoning, return_type = "dataframe")
Borough[BRONX]:Q('Building Type')[1-2-3 FAMILY] | Borough[BROOKLYN]:Q('Building Type')[1-2-3 FAMILY] | Borough[MANHATTAN]:Q('Building Type')[1-2-3 FAMILY] | Borough[QUEENS]:Q('Building Type')[1-2-3 FAMILY] | Borough[STATEN ISLAND]:Q('Building Type')[1-2-3 FAMILY] | Borough[BRONX]:Q('Building Type')[OTHERS] | Borough[BROOKLYN]:Q('Building Type')[OTHERS] | Borough[MANHATTAN]:Q('Building Type')[OTHERS] | Borough[QUEENS]:Q('Building Type')[OTHERS] | Borough[STATEN ISLAND]:Q('Building Type')[OTHERS] | |
---|---|---|---|---|---|---|---|---|---|---|
48 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
55 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
58 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
62 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
72 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
18693 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
18813 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
18998 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
19007 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
19749 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
1016 rows × 10 columns
With both main effects and interaction effects in the model, patsy
goes from lower-order effects to higher-order effects. This means that each set of columns is measures the additional contribution of this effect. This removes redundancy. The short-hand is *
. Here, *
means the same as x1 + x2 + x1:x2
.
dmatrix("Borough*Q('Building Type')", nyc_jobs_zoning, return_type = "dataframe")
Intercept | Borough[T.BROOKLYN] | Borough[T.MANHATTAN] | Borough[T.QUEENS] | Borough[T.STATEN ISLAND] | Q('Building Type')[T.OTHERS] | Borough[T.BROOKLYN]:Q('Building Type')[T.OTHERS] | Borough[T.MANHATTAN]:Q('Building Type')[T.OTHERS] | Borough[T.QUEENS]:Q('Building Type')[T.OTHERS] | Borough[T.STATEN ISLAND]:Q('Building Type')[T.OTHERS] | |
---|---|---|---|---|---|---|---|---|---|---|
48 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 |
55 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
58 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
62 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
72 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
18693 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
18813 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 |
18998 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
19007 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
19749 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 |
1016 rows × 10 columns
You can also write interaction terms between categorical and numerical variables.
dmatrix("Borough:zoning_diff", nyc_jobs_zoning, return_type = "dataframe")
Intercept | Borough[BRONX]:zoning_diff | Borough[BROOKLYN]:zoning_diff | Borough[MANHATTAN]:zoning_diff | Borough[QUEENS]:zoning_diff | Borough[STATEN ISLAND]:zoning_diff | |
---|---|---|---|---|---|---|
48 | 1.0 | 0.0 | 0.0 | 0.0 | 3800.0 | 0.0 |
55 | 1.0 | 0.0 | 571.0 | 0.0 | 0.0 | 0.0 |
58 | 1.0 | 0.0 | 0.0 | 0.0 | 1505.0 | 0.0 |
62 | 1.0 | 0.0 | 0.0 | 0.0 | 2209.0 | 0.0 |
72 | 1.0 | 0.0 | 1182.0 | 0.0 | 0.0 | 0.0 |
... | ... | ... | ... | ... | ... | ... |
18693 | 1.0 | 0.0 | 0.0 | 0.0 | 456.0 | 0.0 |
18813 | 1.0 | 0.0 | 0.0 | 3434.0 | 0.0 | 0.0 |
18998 | 1.0 | 0.0 | 0.0 | 0.0 | 1349.0 | 0.0 |
19007 | 1.0 | 0.0 | 498.0 | 0.0 | 0.0 | 0.0 |
19749 | 1.0 | 0.0 | 0.0 | 0.0 | 230.0 | 0.0 |
1016 rows × 6 columns
The code below removes some redundancy, similar to earlier. This will have one slope for the Bronx, and then each other slope is difference between the specific borough and the Bronx.
dmatrix("zoning_diff + Borough:zoning_diff", nyc_jobs_zoning, return_type = "dataframe")
Intercept | zoning_diff | Borough[T.BROOKLYN]:zoning_diff | Borough[T.MANHATTAN]:zoning_diff | Borough[T.QUEENS]:zoning_diff | Borough[T.STATEN ISLAND]:zoning_diff | |
---|---|---|---|---|---|---|
48 | 1.0 | 3800.0 | 0.0 | 0.0 | 3800.0 | 0.0 |
55 | 1.0 | 571.0 | 571.0 | 0.0 | 0.0 | 0.0 |
58 | 1.0 | 1505.0 | 0.0 | 0.0 | 1505.0 | 0.0 |
62 | 1.0 | 2209.0 | 0.0 | 0.0 | 2209.0 | 0.0 |
72 | 1.0 | 1182.0 | 1182.0 | 0.0 | 0.0 | 0.0 |
... | ... | ... | ... | ... | ... | ... |
18693 | 1.0 | 456.0 | 0.0 | 0.0 | 456.0 | 0.0 |
18813 | 1.0 | 3434.0 | 0.0 | 3434.0 | 0.0 | 0.0 |
18998 | 1.0 | 1349.0 | 0.0 | 0.0 | 1349.0 | 0.0 |
19007 | 1.0 | 498.0 | 498.0 | 0.0 | 0.0 | 0.0 |
19749 | 1.0 | 230.0 | 0.0 | 0.0 | 230.0 | 0.0 |
1016 rows × 6 columns