2. Categorical Predictors

The syntax for handling categorical predictors is different between standard regression models/two-stage-models (i.e. Lm and Lm2) and multi-level models (Lmer) in pymer4. This is because formula parsing is passed to R for Lmer models, but handled by Python for other models.

Lm and Lm2 Models

Lm and Lm2 models use patsy to parse model formulae. Patsy is very powerful and has built-in support for handling categorical coding schemes by wrapping a predictor in then C() within the module formula. Patsy can also perform some pre-processing such as scaling and standardization using special functions like center(). Here are some examples.

# import basic libraries and sample data
import os
import pandas as pd
from pymer4.utils import get_resource_path
from pymer4.models import Lm

# IV3 is a categorical predictors with 3 levels in the sample data
df = pd.read_csv(os.path.join(get_resource_path(), "sample_data.csv"))

Dummy-coded/Treatment contrasts

# Estimate a model using Treatment contrasts (dummy-coding)
# with '1.0' as the reference level
# This is the default of the C() function
model = Lm("DV ~ C(IV3, levels=[1.0, 0.5, 1.5])", data=df)
print(model.fit())
Formula: DV~C(IV3,levels=[1.0,0.5,1.5])

Family: gaussian         Estimator: OLS

Std-errors: non-robust  CIs: standard 95%       Inference: parametric

Number of observations: 564      R^2: 0.004      R^2_adj: 0.001

Log-likelihood: -2728.620        AIC: 5463.241   BIC: 5476.246

Fixed effects:

                                       Estimate  2.5_ci  ...  P-val  Sig
Intercept                                42.721  38.334  ...  0.000  ***
C(IV3, levels=[1.0, 0.5, 1.5])[T.0.5]     1.463  -4.741  ...  0.643
C(IV3, levels=[1.0, 0.5, 1.5])[T.1.5]    -3.419  -9.622  ...  0.280

[3 rows x 8 columns]

Orthogonal Polynomial Contrasts

# Patsy can do this using the Poly argument to the
# C() function
model = Lm("DV ~ C(IV3, Poly)", data=df)
print(model.fit())
Formula: DV~C(IV3,Poly)

Family: gaussian         Estimator: OLS

Std-errors: non-robust  CIs: standard 95%       Inference: parametric

Number of observations: 564      R^2: 0.004      R^2_adj: 0.001

Log-likelihood: -2728.620        AIC: 5463.241   BIC: 5476.246

Fixed effects:

                        Estimate  2.5_ci  97.5_ci  ...  T-stat  P-val  Sig
Intercept                 42.069  39.537   44.602  ...  32.627  0.000  ***
C(IV3, Poly).Linear       -3.452  -7.838    0.935  ...  -1.546  0.123
C(IV3, Poly).Quadratic    -0.798  -5.185    3.588  ...  -0.357  0.721

[3 rows x 8 columns]

Sum-to-zero contrasts

# Similar to before but with the Sum argument
model = Lm("DV ~ C(IV3, Sum)", data=df)
print(model.fit())
Formula: DV~C(IV3,Sum)

Family: gaussian         Estimator: OLS

Std-errors: non-robust  CIs: standard 95%       Inference: parametric

Number of observations: 564      R^2: 0.004      R^2_adj: 0.001

Log-likelihood: -2728.620        AIC: 5463.241   BIC: 5476.246

Fixed effects:

                    Estimate  2.5_ci  97.5_ci     SE   DF  T-stat  P-val  Sig
Intercept             42.069  39.537   44.602  1.289  561  32.627  0.000  ***
C(IV3, Sum)[S.0.5]     2.115  -1.467    5.697  1.823  561   1.160  0.247
C(IV3, Sum)[S.1.0]     0.652  -2.930    4.234  1.823  561   0.357  0.721

Scaling/Centering

# Moderation with IV2, but centering IV2 first
model = Lm("DV ~ center(IV2) * C(IV3, Sum)", data=df)
print(model.fit())
Formula: DV~center(IV2)*C(IV3,Sum)

Family: gaussian         Estimator: OLS

Std-errors: non-robust  CIs: standard 95%       Inference: parametric

Number of observations: 564      R^2: 0.511      R^2_adj: 0.507

Log-likelihood: -2528.051        AIC: 5068.102   BIC: 5094.113

Fixed effects:

                                Estimate  2.5_ci  97.5_ci  ...  T-stat  P-val  Sig
Intercept                         42.051  40.268   43.833  ...  46.329  0.000  ***
C(IV3, Sum)[S.0.5]                 0.580  -1.942    3.102  ...   0.452  0.652
C(IV3, Sum)[S.1.0]                 0.383  -2.136    2.903  ...   0.299  0.765
center(IV2)                        0.746   0.685    0.807  ...  24.012  0.000  ***
center(IV2):C(IV3, Sum)[S.0.5]     0.050  -0.037    0.137  ...   1.132  0.258
center(IV2):C(IV3, Sum)[S.1.0]    -0.057  -0.144    0.029  ...  -1.306  0.192

[6 rows x 8 columns]

Please refer to the patsy documentation for more details when working categorical predictors in Lm or Lm2 models.

Lmer Models

Lmer() models currently have support for handling categorical predictors in one of three ways based on how R’s factor() works (see the note at the end of this tutorial):

  • Dummy-coded factor levels (treatment contrasts) in which each model term is the difference between a factor level and a selected reference level

  • Orthogonal polynomial contrasts in which each model term is a polynomial contrast across factor levels (e.g. linear, quadratic, cubic, etc)

  • Custom contrasts for each level of a factor, which should be provided in the manner expected by R.

To make re-parameterizing models easier, factor codings are passed as a dictionary to the factors argument of a model’s .fit(). This obviates the need for adjusting data-frame properties as in R. Note that this is different from Lm and Lm2 models above which expect factor codings in their formulae (because patsy does).

Each of these ways also enables you to easily compute post-hoc comparisons between factor levels, as well as interactions between continuous predictors and each factor level. See tutorial 3 for more on post-hoc tests.

from pymer4.models import Lmer

# We're going to fit a multi-level logistic regression using the
# dichotomous DV_l variable and the same categorical predictor (IV3)
# as before
model = Lmer("DV_l ~ IV3 + (IV3|Group)", data=df, family="binomial")

Dummy-coding factors

First we’ll use dummy-coding/treatment contrasts with 1.0 as the reference level. This will compute two coefficients: 0.5 > 1.0 and 1.5 > 1.0.

print(model.fit(factors={"IV3": ["1.0", "0.5", "1.5"]}))
boundary (singular) fit: see help('isSingular')

Linear mixed model fit by maximum likelihood  ['lmerMod']
Formula: DV_l~IV3+(IV3|Group)

Family: binomial         Inference: parametric

Number of observations: 564      Groups: {'Group': 47.0}

Log-likelihood: -389.003         AIC: 796.006

Random effects:

              Name    Var    Std
Group  (Intercept)  0.022  0.148
Group       IV30.5  0.060  0.246
Group       IV31.5  0.038  0.196

               IV1     IV2  Corr
Group  (Intercept)  IV30.5  -1.0
Group  (Intercept)  IV31.5  -1.0
Group       IV30.5  IV31.5   1.0

Fixed effects:

             Estimate  2.5_ci  97.5_ci     SE  ...  Prob_97.5_ci  Z-stat  P-val  Sig
(Intercept)    -0.129  -0.419    0.162  0.148  ...         0.540  -0.867  0.386
IV31            0.129  -0.283    0.540  0.210  ...         0.632   0.612  0.541
IV32           -0.128  -0.539    0.283  0.210  ...         0.570  -0.612  0.541

[3 rows x 13 columns]

Polynomial contrast coding

Second we’ll use orthogonal polynomial contrasts. This is accomplished using the ordered=True argument and specifying the order of the linear contrast in increasing order. R will automatically compute higher order polynomial contrats that are orthogonal to this linear contrast. In this example, since there are 3 factor levels this will result in two polynomial terms: a linear contrast we specify below corresponding to 0.5 < 1.0 < 1.5 and an orthogonal quadratic contrast automatically determined by R, corresponding to 0.5 > 1 < 1.5

print(model.fit(factors={"IV3": ["0.5", "1.0", "1.5"]}, ordered=True))
boundary (singular) fit: see help('isSingular')

boundary (singular) fit: see help('isSingular')

Linear mixed model fit by maximum likelihood  ['lmerMod']
Formula: DV_l~IV3+(IV3|Group)

Family: binomial         Inference: parametric

Number of observations: 564      Groups: {'Group': 47.0}

Log-likelihood: -389.003         AIC: 796.006

Random effects:

              Name    Var    Std
Group  (Intercept)  0.010  0.098
Group       IV31.0  0.060  0.246
Group       IV31.5  0.003  0.050

               IV1     IV2  Corr
Group  (Intercept)  IV31.0  -1.0
Group  (Intercept)  IV31.5  -1.0
Group       IV31.0  IV31.5   1.0

Fixed effects:

             Estimate  2.5_ci  97.5_ci     SE  ...  Prob_97.5_ci  Z-stat  P-val  Sig
(Intercept)    -0.128  -0.294    0.037  0.085  ...         0.509  -1.518  0.129
IV31           -0.182  -0.469    0.106  0.147  ...         0.526  -1.238  0.216
IV32            0.000  -0.292    0.292  0.149  ...         0.572   0.001  1.000

[3 rows x 13 columns]

Custom contrasts

Lmer models can also take custom factor contrasts based on how they are expected by R (see the note at the end of this tutorial for how contrasts work in R). Remember that there can be at most k-1 model terms representing any k level factor without over-parameterizing a model. If you specify a custom contrast, R will generate set of orthogonal contrasts for the rest of your model terms.

# Compare level '1.0' to the mean of levels '0.5' and '1.5'
# and let R determine the second contrast orthogonal to it

print(model.fit(factors={"IV3": {"1.0": 1, "0.5": -0.5, "1.5": -0.5}}))
boundary (singular) fit: see help('isSingular')

boundary (singular) fit: see help('isSingular')

boundary (singular) fit: see help('isSingular')

Linear mixed model fit by maximum likelihood  ['lmerMod']
Formula: DV_l~IV3+(IV3|Group)

Family: binomial         Inference: parametric

Number of observations: 564      Groups: {'Group': 47.0}

Log-likelihood: -389.003         AIC: 796.006

Random effects:

              Name    Var    Std
Group  (Intercept)  0.022  0.148
Group       IV30.5  0.060  0.246
Group       IV31.5  0.038  0.196

               IV1     IV2  Corr
Group  (Intercept)  IV30.5  -1.0
Group  (Intercept)  IV31.5  -1.0
Group       IV30.5  IV31.5   1.0

Fixed effects:

             Estimate  2.5_ci  97.5_ci     SE  ...  Prob_97.5_ci  Z-stat  P-val  Sig
(Intercept)    -0.128  -0.294    0.037  0.085  ...         0.509  -1.518  0.129
IV31           -0.000  -0.358    0.357  0.182  ...         0.588  -0.001  1.000
IV32           -0.182  -0.469    0.106  0.147  ...         0.526  -1.238  0.216

[3 rows x 13 columns]

User-created contrasts (without R)

Another option available to you is fitting a model with only your desired contrast(s) rather than a full set of k-1 contrasts. Contrary to how statistics is usually taught, you don’t ever have to include a full set of k-1 contrasts for a k level factor! The upside to doing this is that you won’t need to rely on R to compute anything for you (aside from the model fit), and you will have a model with exactly the number of terms as contrasts you desire, giving you complete control. The downside is that post-hoc tests will no longer be available (see tutorial 3 for more information on post-hoc tests), but it’s unlikely you’re doing post-hoc tests if you are computing a subset of specific contrasts anyway. This is also a useful approach if you don’t want to use patsy’s formula syntax with Lm and Lm2 as noted above.

This can be accomplished by creating new columns in your dataframe to test specific hypotheses and is trivial to do with pandas map and assign methods. For example, here we manually compute a linear contrast by creating a new column in our dataframe and treating it as a continuous variable.

# Create a new column in the dataframe with a custom (linear) contrast
df = df.assign(IV3_custom_lin=df["IV3"].map({0.5: -1, 1.0: 0, 1.5: 1}))
print(df.head())
   Group   IV1  DV_l         DV       IV2  IV3  IV3_custom_lin
0      1  20.0     0   7.936508  4.563492  0.5              -1
1      1  20.0     0  15.277778  0.000000  1.0               0
2      1  20.0     1   0.000000  0.000000  1.5               1
3      1  20.0     1   9.523810  0.000000  0.5              -1
4      1  12.5     0   0.000000  0.000000  1.0               0

Now we can use this variable as a continuous predictor without the need for the factors argument. Notice how the z-stat and p-value of the estimate are the same as the linear polynomial contrast estimated above. The coefficients differ in scale only because R uses [~-0.707, ~0, ~0.707] for its polynomial contrasts rather than [-1, 0, 1] like we did.

# Estimate model
model = Lmer(
    "DV_l ~ IV3_custom_lin + (IV3_custom_lin|Group)", data=df, family="binomial"
)
print(model.fit())
boundary (singular) fit: see help('isSingular')

Linear mixed model fit by maximum likelihood  ['lmerMod']
Formula: DV_l~IV3_custom_lin+(IV3_custom_lin|Group)

Family: binomial         Inference: parametric

Number of observations: 564      Groups: {'Group': 47.0}

Log-likelihood: -389.016         AIC: 788.031

Random effects:

                 Name  Var  Std
Group     (Intercept)  0.0  0.0
Group  IV3_custom_lin  0.0  0.0

               IV1             IV2 Corr
Group  (Intercept)  IV3_custom_lin

Fixed effects:

                Estimate  2.5_ci  97.5_ci  ...  Z-stat  P-val  Sig
(Intercept)       -0.128  -0.294    0.037  ...  -1.517  0.129
IV3_custom_lin    -0.128  -0.331    0.075  ...  -1.239  0.215

[2 rows x 13 columns]

A note on how contrasts in R work

Note

This is just for folks curious about how contrasts in R work

Specifying multiple custom contrasts in R has always been a point of confusion amongst users. This because the contrasts() command in R doesn’t actually expect contrast weights (i.e. a design matrix) as one would intuit. Rather, it is made for generating contrast coding schemes which are the inverse of the contrast weight matrix. For a longer explanation with examples see this reference and this reference. For these situations pymer4 offers a few utility functions to convert between these matrix types if desired in pymer4.utils: R2con() and con2R().

Gallery generated by Sphinx-Gallery