Note

Go to the end to download the full example code

1. Basic Usage Guide¶

pymer4 comes with sample data for testing purposes which we’ll utilize for most of the tutorials. This sample data has:

Two kinds of dependent variables: DV (continuous), DV_l (dichotomous)
Three kinds of independent variables: IV1 (continuous), IV2 (continuous), IV3 (categorical)
One grouping variable for multi-level modeling: Group.

Let’s check it out below:

# import some basic libraries
import os
import pandas as pd

# import utility function for sample data path
from pymer4.utils import get_resource_path

# Load and checkout sample data
df = pd.read_csv(os.path.join(get_resource_path(), "sample_data.csv"))
print(df.head())

   Group   IV1  DV_l         DV       IV2  IV3
    1  20.0     0   7.936508  4.563492  0.5
    1  20.0     0  15.277778  0.000000  1.0
    1  20.0     1   0.000000  0.000000  1.5
    1  20.0     1   9.523810  0.000000  0.5
    1  12.5     0   0.000000  0.000000  1.0

Standard regression models¶

Fitting a standard regression model is accomplished using the Lm model class in pymer4. All we need to do is initialize a model with a formula, some data, and call its .fit() method.

By default the output of .fit() has been formated to be a blend of summary() in R and .summary() from statsmodels. This includes metadata about the model, data, and overall fit as well as estimates and inference results of model terms.

# Import the linear regression model class
from pymer4.models import Lm

# Initialize model using 2 predictors and sample data
model = Lm("DV ~ IV1 + IV2", data=df)

# Fit it
print(model.fit())

Formula: DV~IV1+IV2

Family: gaussian         Estimator: OLS

Std-errors: non-robust  CIs: standard 95%       Inference: parametric

Number of observations: 564      R^2: 0.512      R^2_adj: 0.510

Log-likelihood: -2527.681        AIC: 5061.363   BIC: 5074.368

Fixed effects:

           Estimate  2.5_ci  97.5_ci     SE   DF  T-stat  P-val  Sig
Intercept     1.657  -4.107    7.422  2.935  561   0.565  0.573
IV1           0.334  -0.023    0.690  0.181  561   1.839  0.066    .
IV2           0.747   0.686    0.807  0.031  561  24.158  0.000  ***

All information about the model as well as data, residuals, estimated coefficients, etc are saved as attributes and can be accessed like this:

# Print model AIC
print(model.AIC)

5061.3629635837815

# Look at residuals (just the first 10)
print(model.residuals[:10])

[-3.79994762  6.94860187 -8.32917613  1.19463387 -5.8271851  -6.88457421
  0.40673658  9.77173122 -7.33135842 -7.37107236]

A copy of the dataframe used to estimate the model with added columns for residuals and fits are are available at model.data. Residuals and fits can also be directly accessed using model.residuals and model.fits respectively

# Look at model data
print(model.data.head())

   Group   IV1  DV_l         DV       IV2  IV3       fits  residuals
    1  20.0     0   7.936508  4.563492  0.5  11.736456  -3.799948
    1  20.0     0  15.277778  0.000000  1.0   8.329176   6.948602
    1  20.0     1   0.000000  0.000000  1.5   8.329176  -8.329176
    1  20.0     1   9.523810  0.000000  0.5   8.329176   1.194634
    1  12.5     0   0.000000  0.000000  1.0   5.827185  -5.827185

This makes it easy to assess overall model fit visually, for example using seaborn

# import dataviz
import seaborn as sns

# plot model predicted values against true values
sns.regplot(x="fits", y="DV", data=model.data, fit_reg=True)

<Axes: xlabel='fits', ylabel='DV'>

Robust and WLS estimation¶

Lm models can also perform inference using robust-standard errors or perform weight-least-squares (experimental feature) for models with categorical predictors (equivalent to Welch’s t-test).

# Refit previous model using robust standard errors
print(model.fit(robust="hc1"))

Formula: DV~IV1+IV2

Family: gaussian         Estimator: OLS

Std-errors: robust (hc1)        CIs: standard 95%       Inference: parametric

Number of observations: 564      R^2: 0.512      R^2_adj: 0.510

Log-likelihood: -2527.681        AIC: 5061.363   BIC: 5074.368

Fixed effects:

           Estimate  2.5_ci  97.5_ci     SE   DF  T-stat  P-val  Sig
Intercept     1.657  -3.429    6.744  2.590  561   0.640  0.522
IV1           0.334  -0.026    0.693  0.183  561   1.823  0.069    .
IV2           0.747   0.678    0.815  0.035  561  21.444  0.000  ***

# Since WLS is only supported with 2 groups for now, filter the data first
df_two_groups = df.query("IV3 in [0.5, 1.0]").reset_index(drop=True)

# Fit new a model using a categorical predictor with unequal variances (WLS)
model = Lm("DV ~ IV3", data=df_two_groups)
print(model.fit(weights="IV3"))

Formula: DV~IV3

Family: gaussian         Estimator: WLS

Std-errors: non-robust  CIs: standard 95%       Inference: parametric

Number of observations: 376      R^2: 0.999      R^2_adj: 0.999

Log-likelihood: -532.518         AIC: 1069.036   BIC: 1076.896

Fixed effects:

           Estimate  2.5_ci  97.5_ci     SE       DF  T-stat  P-val  Sig
Intercept    45.647  35.787   55.507  5.015  373.483   9.103  0.000  ***
IV3          -2.926 -15.261    9.410  6.273  373.483  -0.466  0.641

Multi-level models¶

Fitting a multi-level model works similarly and actually just calls lmer or glmer in R behind the scenes. The corresponding output is also formatted to be very similar to output of summary() in R.

# Import the lmm model class
from pymer4.models import Lmer

# Initialize model instance using 1 predictor with random intercepts and slopes
model = Lmer("DV ~ IV2 + (IV2|Group)", data=df)

# Fit it
print(model.fit())

Model failed to converge with max|grad| = 0.00358015 (tol = 0.002, component 1)

Linear mixed model fit by REML [’lmerMod’]
Formula: DV~IV2+(IV2|Group)

Family: gaussian         Inference: parametric

Number of observations: 564      Groups: {'Group': 47.0}

Log-likelihood: -2249.281        AIC: 4510.562

Random effects:

                 Name      Var     Std
Group     (Intercept)  203.390  14.261
Group             IV2    0.136   0.369
Residual               121.537  11.024

               IV1  IV2   Corr
Group  (Intercept)  IV2 -0.585

Fixed effects:

             Estimate  2.5_ci  97.5_ci     SE      DF  T-stat  P-val  Sig
(Intercept)    10.300   4.806   15.795  2.804  20.183   3.674  0.001   **
IV2             0.682   0.556    0.808  0.064  42.402  10.599  0.000  ***

Similar to Lm models, Lmer models save details in model attributes and have additional methods that can be called using the same syntax as described above.

# Get population level coefficients
print(model.coefs)

              Estimate    2.5_ci    97.5_ci  ...     T-stat         P-val  Sig
(Intercept)  10.300430  4.805524  15.795335  ...   3.674034  1.487605e-03   **
IV2           0.682128  0.555987   0.808268  ...  10.598847  1.706855e-13  ***

[2 rows x 8 columns]

# Get group level coefficients (just the first 5)
# Each row here is a unique intercept and slope
# which vary because we parameterized our rfx that way above
print(model.fixef.head(5))

   (Intercept)       IV2
   4.482095  0.885138
  17.991023  0.622143
   8.706144  0.838055
  10.143487  0.865341
  10.071328  0.182101

# Get group level deviates from population level coefficients (i.e. rfx)
print(model.ranef.head(5))

   X.Intercept.       IV2
   -5.818335  0.203011
    7.690593 -0.059985
   -1.594286  0.155927
   -0.156943  0.183213
   -0.229102 -0.500026

Lmer models also have some basic plotting abilities that Lm models do not

# Visualize coefficients with group/cluster fits overlaid ("forest plot")
model.plot_summary()

<Axes: xlabel='Estimate'>

Plot coefficients for each group/cluster as separate regressions

model.plot("IV2", plot_ci=True, ylabel="predicted DV")

<Axes: xlabel='IV2', ylabel='predicted DV'>

Because Lmer models rely on R, they have also some extra arguments to the .fit() method for controlling things like optimizer behavior, as well as additional methods such for post-hoc tests and ANOVAs. See tutorial 2 for information about this functionality.

Two-stage summary statistics models¶

Fitting Lm2 models are also very similar

# Import the lm2 model class
from pymer4.models import Lm2

# This time we use the 'group' argument when initializing the model
model = Lm2("DV ~ IV2", group="Group", data=df)

# Fit it
print(model.fit())

/home/runner/work/pymer4/pymer4/pymer4/stats.py:657: RuntimeWarning: invalid value encountered in scalar divide
  return 1 - (rss / tss)
Formula: DV~IV2

Family: gaussian

Std-errors: non-robust  CIs: standard 95%       Inference: parametric

Number of observations: 564      Groups: {'Group': 47}

Fixed effects:

           Estimate  2.5_ci  97.5_ci     SE  DF  T-stat  P-val  Sig
Intercept    14.240   4.891   23.589  4.644  46   3.066  0.004   **
IV2           0.614   0.445    0.782  0.084  46   7.340  0.000  ***

Like Lmer models, Lm2 models also store group/cluster level estimates and have some basic plotting functionality

# Get group level coefficients, just the first 5
print(model.fixef.head(5))

       Intercept       IV2
Group
     3.039903  1.781832
    23.388350  0.524852
     4.904321  0.919913
    23.304669  0.719425
    18.378387 -0.256136

# Visualize coefficients with group/cluster fits overlaid ("forest plot")
model.plot_summary()

<Axes: xlabel='Estimate'>

Model Persistence¶

All pymer4 models can be saved and loaded from disk. Doing so will persist all model attributes and data i.e. anything accessible with the ‘.’ syntax. Models are saved and loaded using Joblib Therefore all filenames must end with .joblib. For Lmer models, an additional file ending in .rds will be saved in the same directory as the HDF5 file. This is the R model object readable in R using readRDS.

Prior to version 0.8.1 models were saved to HDF5 files using deepdish but this library is no longer maintained. If you have old models saved as .h5 or .hdf5 files you should use the same version of pymer4 that you used to estimate those models.

To persist models you can use the dedicated save_model and load_model functions from the pymer4.io module

# Import functions
from pymer4.io import save_model, load_model

# Save the Lm2 model above
save_model(model, "mymodel.joblib")
# Load it back up
model = load_model("mymodel.joblib")
# Check that it looks the same
print(model)

pymer4.models.Lm2(fitted=True, formula=DV~IV2, family=gaussian, group=Group)

Wrap Up¶

This was a quick overview of the 3 major model classes in pymer4. However, it’s highly recommended to check out the API to see all the features and options that each model class has including things like permutation-based inference (Lm and Lm2 models) and fine-grain control of optimizer and tolerance settings (Lmer models).

Gallery generated by Sphinx-Gallery