9.1. Describing Models with Patsy

From patsy documentation:

patsy is a Python package for describing statistical models (especially linear models, or models that have a linear component) and building design matrices. It is closely inspired by and compatible with the formula mini-language used in R and S.

Patsy’s goal is to become the standard high-level interface to describing statistical models in Python, regardless of what particular model or library is being used underneath.

Pros of patsy:

  • Allows data transformations to be specified using arbitrary Python code: instead of x, we can write log(x) or (x > 0)

  • Gives a range of convenient options for coding categorical variables

  • Provides basic built-in transformations

  • Provides a language for easy-to-read specification of linear constraints

  • Features a simple API for integration into statistical packages. Specifically used in statsmodels, which we will be discussing further in the course.

Cons of patsy:

  • Will not do statistics. Simply allows you to describe models in general terms

9.1.1. Installing Patsy

To install patsy, use a simple pip command.

pip install patsy

Then, you should be able to import the package. The dmatrices and dmatrix objects are the most important.

from patsy import dmatrices, dmatrix

9.1.2. Basic example

First let’s read in our NYC job data to have something to work with. We’ll first take the Initial Cost column, which has string values in the form of “$XXXX” and turn them into integer values. Then, we’ll calculate the difference in the proposed zoning square feet and the existing zoning square feet.

import pandas as pd
import numpy as np

nyc_jobs = pd.read_csv("../data/nyc_DobJobApp_2021.csv")

nyc_jobs["cost"] = [x.split("$")[1] for x in nyc_jobs["Initial Cost"]]
nyc_jobs["cost"] = [int(float(x)) for x in nyc_jobs["cost"]]

nyc_jobs["zoning_diff"] = nyc_jobs["Proposed Zoning Sqft"] - nyc_jobs["Existing Zoning Sqft"]

nyc_jobs_zoning = nyc_jobs.drop(nyc_jobs[nyc_jobs.zoning_diff <= 0].index)
nyc_jobs_zoning = nyc_jobs_zoning.drop(nyc_jobs_zoning[nyc_jobs_zoning.cost == 0].index)
/tmp/ipykernel_4195/772679357.py:4: DtypeWarning: Columns (23,24,25,26,29,31,32,59,60) have mixed types. Specify dtype option on import or set low_memory=False.
  nyc_jobs = pd.read_csv("../data/nyc_DobJobApp_2021.csv")

Now we can create a new matrix with patsy’s dmatrices.

dmatrices("cost ~ zoning_diff", nyc_jobs_zoning)
(DesignMatrix with shape (1016, 1)
      cost
    456000
    205000
    998000
    643000
   1183365
    527000
     44000
    405000
    705795
    333000
    380220
    242850
       100
    426156
    321000
    232775
    670000
    110000
    217500
    119000
    800000
    320000
    721632
    391300
     89500
    337962
    199980
    125000
     45000
   1000000
   [986 rows omitted]
   Terms:
     'cost' (column 0)
   (to view full data, use np.asarray(this_obj)),
 DesignMatrix with shape (1016, 2)
   Intercept  zoning_diff
           1         3800
           1          571
           1         1505
           1         2209
           1         1182
           1         1894
           1          436
           1          600
           1          987
           1          227
           1         1302
           1          879
           1         2318
           1           22
           1          302
           1          659
           1          126
           1          724
           1         1678
           1          786
           1         5355
           1           56
           1         2536
           1         2506
           1         1624
           1         1025
           1          876
           1          273
           1          916
           1         1422
   [986 rows omitted]
   Terms:
     'Intercept' (column 0)
     'zoning_diff' (column 1)
   (to view full data, use np.asarray(this_obj)))

The return value is a Python tuple containing two DesignMatrix objects. The first dmatrix is the left-hand side of the formula, and the second is the right-hand side. They’re just oridinary numpy arrays with extra metadata. We can pass these to a regression function to perform some analysis.

outcome, predictors = dmatrices("cost ~ zoning_diff", nyc_jobs_zoning)
betas = np.linalg.lstsq(predictors, outcome)[0]
betas
/tmp/ipykernel_4195/1476851890.py:2: FutureWarning: `rcond` parameter will change to the default of machine precision times ``max(M, N)`` where M and N are the input matrix dimensions.
To use the future default and silence this warning we advise to pass `rcond=None`, to keep using the old, explicitly pass `rcond=-1`.
  betas = np.linalg.lstsq(predictors, outcome)[0]
array([[218111.1726294 ],
       [   409.22199844]])

The metadata is stored in an extra attribute, .design_info, which has valuable information.

predictors.design_info.column_names
['Intercept', 'zoning_diff']

9.1.3. Transforming Code

You can transform code with patsy. For instance, you can use simple Python code.

dmatrix("np.log(zoning_diff)", nyc_jobs_zoning)
DesignMatrix with shape (1016, 2)
  Intercept  np.log(zoning_diff)
          1              8.24276
          1              6.34739
          1              7.31655
          1              7.70030
          1              7.07496
          1              7.54645
          1              6.07764
          1              6.39693
          1              6.89467
          1              5.42495
          1              7.17166
          1              6.77878
          1              7.74846
          1              3.09104
          1              5.71043
          1              6.49072
          1              4.83628
          1              6.58479
          1              7.42536
          1              6.66696
          1              8.58579
          1              4.02535
          1              7.83834
          1              7.82644
          1              7.39265
          1              6.93245
          1              6.77537
          1              5.60947
          1              6.82002
          1              7.25982
  [986 rows omitted]
  Terms:
    'Intercept' (column 0)
    'np.log(zoning_diff)' (column 1)
  (to view full data, use np.asarray(this_obj))

There are also various built-in transformation functions from patsy, such as center and standardize. The documentation for these and other functions are in the patsy.builtins API reference.

dmatrix("center(zoning_diff) + standardize(zoning_diff)", nyc_jobs_zoning)
DesignMatrix with shape (1016, 3)
  Intercept  center(zoning_diff)  standardize(zoning_diff)
          1            996.12303                   0.07860
          1          -2232.87697                  -0.17619
          1          -1298.87697                  -0.10249
          1           -594.87697                  -0.04694
          1          -1621.87697                  -0.12798
          1           -909.87697                  -0.07180
          1          -2367.87697                  -0.18684
          1          -2203.87697                  -0.17390
          1          -1816.87697                  -0.14337
          1          -2576.87697                  -0.20334
          1          -1501.87697                  -0.11851
          1          -1924.87697                  -0.15189
          1           -485.87697                  -0.03834
          1          -2781.87697                  -0.21951
          1          -2501.87697                  -0.19742
          1          -2144.87697                  -0.16925
          1          -2677.87697                  -0.21131
          1          -2079.87697                  -0.16412
          1          -1125.87697                  -0.08884
          1          -2017.87697                  -0.15923
          1           2551.12303                   0.20130
          1          -2747.87697                  -0.21683
          1           -267.87697                  -0.02114
          1           -297.87697                  -0.02350
          1          -1179.87697                  -0.09310
          1          -1778.87697                  -0.14037
          1          -1927.87697                  -0.15213
          1          -2530.87697                  -0.19971
          1          -1887.87697                  -0.14897
          1          -1381.87697                  -0.10904
  [986 rows omitted]
  Terms:
    'Intercept' (column 0)
    'center(zoning_diff)' (column 1)
    'standardize(zoning_diff)' (column 2)
  (to view full data, use np.asarray(this_obj))

You can define your own transformation functions if you’d like as well.

def square(x):
    return(x*2)

dmatrix("zoning_diff + square(zoning_diff)", nyc_jobs_zoning)
DesignMatrix with shape (1016, 3)
  Intercept  zoning_diff  square(zoning_diff)
          1         3800                 7600
          1          571                 1142
          1         1505                 3010
          1         2209                 4418
          1         1182                 2364
          1         1894                 3788
          1          436                  872
          1          600                 1200
          1          987                 1974
          1          227                  454
          1         1302                 2604
          1          879                 1758
          1         2318                 4636
          1           22                   44
          1          302                  604
          1          659                 1318
          1          126                  252
          1          724                 1448
          1         1678                 3356
          1          786                 1572
          1         5355                10710
          1           56                  112
          1         2536                 5072
          1         2506                 5012
          1         1624                 3248
          1         1025                 2050
          1          876                 1752
          1          273                  546
          1          916                 1832
          1         1422                 2844
  [986 rows omitted]
  Terms:
    'Intercept' (column 0)
    'zoning_diff' (column 1)
    'square(zoning_diff)' (column 2)
  (to view full data, use np.asarray(this_obj))

9.1.4. Special Transformations

There are other special built-in transformations that patsy provides which are particularly useful when dealing with difficult characters. Even though patsy uses a “+” in the syntax, the code still needs to be Python code. This means that something like “Existing Zoning Sqft” will produce an error. The Q() transformation however can fix that.

# dmatrix("Initial Cost", nyc_jobs)
# Produces error
dmatrix("Q('Existing Zoning Sqft')", nyc_jobs_zoning)
DesignMatrix with shape (1016, 2)
  Intercept  Q('Existing Zoning Sqft')
          1                       4310
          1                       1276
          1                       1525
          1                       1713
          1                       1954
          1                       1014
          1                       2711
          1                       3346
          1                       5016
          1                       2136
          1                       1972
          1                       2359
          1                       3739
          1                       2094
          1                       1501
          1                       2117
          1                       2373
          1                       1224
          1                       1015
          1                       2592
          1                       1827
          1                       1743
          1                       1033
          1                       5009
          1                       3043
          1                       1246
          1                       2886
          1                        903
          1                       2900
          1                       4098
  [986 rows omitted]
  Terms:
    'Intercept' (column 0)
    "Q('Existing Zoning Sqft')" (column 1)
  (to view full data, use np.asarray(this_obj))

Though the Q() function makes this possible, it is still not recommended. The best option here would be to rename “Initial Cost” to something with better style, such as “initial_cost.”

Another useful transformation is I(). In the case where you actually want to use addition, instead of writing your ownn function, you can use I() to tell patsy you want to use addition (or subtraction in this case).

dmatrix("I(Q('Proposed Zoning Sqft') - Q('Existing Zoning Sqft'))", nyc_jobs_zoning)
DesignMatrix with shape (1016, 2)
  Intercept  I(Q('Proposed Zoning Sqft') - Q('Existing Zoning Sqft'))
          1                                                      3800
          1                                                       571
          1                                                      1505
          1                                                      2209
          1                                                      1182
          1                                                      1894
          1                                                       436
          1                                                       600
          1                                                       987
          1                                                       227
          1                                                      1302
          1                                                       879
          1                                                      2318
          1                                                        22
          1                                                       302
          1                                                       659
          1                                                       126
          1                                                       724
          1                                                      1678
          1                                                       786
          1                                                      5355
          1                                                        56
          1                                                      2536
          1                                                      2506
          1                                                      1624
          1                                                      1025
          1                                                       876
          1                                                       273
          1                                                       916
          1                                                      1422
  [986 rows omitted]
  Terms:
    'Intercept' (column 0)
    "I(Q('Proposed Zoning Sqft') - Q('Existing Zoning Sqft'))" (column 1)
  (to view full data, use np.asarray(this_obj))

9.1.5. Categorical Data

patsy handles categorical data well. If you include string or boolean type data, it will automatically turn them categorical. In this instance, the T.Borough shows that the data has been transformed into treatment code. You can also add the return_type = dataframe to have the return value be a pandas.dataframe instead.

dmatrices("cost ~ Borough", nyc_jobs_zoning, return_type = "dataframe")
(            cost
 48      456000.0
 55      205000.0
 58      998000.0
 62      643000.0
 72     1183365.0
 ...          ...
 18693   120000.0
 18813   100000.0
 18998     1000.0
 19007     1000.0
 19749   800000.0
 
 [1016 rows x 1 columns],
        Intercept  Borough[T.BROOKLYN]  Borough[T.MANHATTAN]  \
 48           1.0                  0.0                   0.0   
 55           1.0                  1.0                   0.0   
 58           1.0                  0.0                   0.0   
 62           1.0                  0.0                   0.0   
 72           1.0                  1.0                   0.0   
 ...          ...                  ...                   ...   
 18693        1.0                  0.0                   0.0   
 18813        1.0                  0.0                   1.0   
 18998        1.0                  0.0                   0.0   
 19007        1.0                  1.0                   0.0   
 19749        1.0                  0.0                   0.0   
 
        Borough[T.QUEENS]  Borough[T.STATEN ISLAND]  
 48                   1.0                       0.0  
 55                   0.0                       0.0  
 58                   1.0                       0.0  
 62                   1.0                       0.0  
 72                   0.0                       0.0  
 ...                  ...                       ...  
 18693                1.0                       0.0  
 18813                0.0                       0.0  
 18998                1.0                       0.0  
 19007                0.0                       0.0  
 19749                1.0                       0.0  
 
 [1016 rows x 5 columns])

9.1.6. Interaction Terms

Interaction terms are a strong reason to use patsy. You can easily apply interaction with :.

dmatrix("0 + Borough:Q('Building Type')", nyc_jobs_zoning, return_type = "dataframe")
Borough[BRONX]:Q('Building Type')[1-2-3 FAMILY] Borough[BROOKLYN]:Q('Building Type')[1-2-3 FAMILY] Borough[MANHATTAN]:Q('Building Type')[1-2-3 FAMILY] Borough[QUEENS]:Q('Building Type')[1-2-3 FAMILY] Borough[STATEN ISLAND]:Q('Building Type')[1-2-3 FAMILY] Borough[BRONX]:Q('Building Type')[OTHERS] Borough[BROOKLYN]:Q('Building Type')[OTHERS] Borough[MANHATTAN]:Q('Building Type')[OTHERS] Borough[QUEENS]:Q('Building Type')[OTHERS] Borough[STATEN ISLAND]:Q('Building Type')[OTHERS]
48 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
55 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
58 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0
62 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0
72 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ...
18693 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0
18813 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0
18998 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0
19007 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
19749 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0

1016 rows × 10 columns

With both main effects and interaction effects in the model, patsy goes from lower-order effects to higher-order effects. This means that each set of columns is measures the additional contribution of this effect. This removes redundancy. The short-hand is *. Here, * means the same as x1 + x2 + x1:x2.

dmatrix("Borough*Q('Building Type')", nyc_jobs_zoning, return_type = "dataframe")
Intercept Borough[T.BROOKLYN] Borough[T.MANHATTAN] Borough[T.QUEENS] Borough[T.STATEN ISLAND] Q('Building Type')[T.OTHERS] Borough[T.BROOKLYN]:Q('Building Type')[T.OTHERS] Borough[T.MANHATTAN]:Q('Building Type')[T.OTHERS] Borough[T.QUEENS]:Q('Building Type')[T.OTHERS] Borough[T.STATEN ISLAND]:Q('Building Type')[T.OTHERS]
48 1.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0 1.0 0.0
55 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
58 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0
62 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0
72 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ...
18693 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0
18813 1.0 0.0 1.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0
18998 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0
19007 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
19749 1.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0 1.0 0.0

1016 rows × 10 columns

You can also write interaction terms between categorical and numerical variables.

dmatrix("Borough:zoning_diff", nyc_jobs_zoning, return_type = "dataframe")
Intercept Borough[BRONX]:zoning_diff Borough[BROOKLYN]:zoning_diff Borough[MANHATTAN]:zoning_diff Borough[QUEENS]:zoning_diff Borough[STATEN ISLAND]:zoning_diff
48 1.0 0.0 0.0 0.0 3800.0 0.0
55 1.0 0.0 571.0 0.0 0.0 0.0
58 1.0 0.0 0.0 0.0 1505.0 0.0
62 1.0 0.0 0.0 0.0 2209.0 0.0
72 1.0 0.0 1182.0 0.0 0.0 0.0
... ... ... ... ... ... ...
18693 1.0 0.0 0.0 0.0 456.0 0.0
18813 1.0 0.0 0.0 3434.0 0.0 0.0
18998 1.0 0.0 0.0 0.0 1349.0 0.0
19007 1.0 0.0 498.0 0.0 0.0 0.0
19749 1.0 0.0 0.0 0.0 230.0 0.0

1016 rows × 6 columns

The code below removes some redundancy, similar to earlier. This will have one slope for the Bronx, and then each other slope is difference between the specific borough and the Bronx.

dmatrix("zoning_diff + Borough:zoning_diff", nyc_jobs_zoning, return_type = "dataframe")
Intercept zoning_diff Borough[T.BROOKLYN]:zoning_diff Borough[T.MANHATTAN]:zoning_diff Borough[T.QUEENS]:zoning_diff Borough[T.STATEN ISLAND]:zoning_diff
48 1.0 3800.0 0.0 0.0 3800.0 0.0
55 1.0 571.0 571.0 0.0 0.0 0.0
58 1.0 1505.0 0.0 0.0 1505.0 0.0
62 1.0 2209.0 0.0 0.0 2209.0 0.0
72 1.0 1182.0 1182.0 0.0 0.0 0.0
... ... ... ... ... ... ...
18693 1.0 456.0 0.0 0.0 456.0 0.0
18813 1.0 3434.0 0.0 3434.0 0.0 0.0
18998 1.0 1349.0 0.0 0.0 1349.0 0.0
19007 1.0 498.0 498.0 0.0 0.0 0.0
19749 1.0 230.0 0.0 0.0 230.0 0.0

1016 rows × 6 columns