Linear Regression#

The lm() class makes it easy to work with regression models estimated via Ordinary-Least-Squares (OLS)

Core Methods and Attributes

Method

Description

.fit()

Fit a model using lm() in R

.summary()

Get a nicely formatted table containing .result_fit

.empredict()

Compute marginal predictions are arbitrary values of predictors

.simulate()

Simulate new observations from the fitted model

.predict()

Make predictions given a new dataset

Models always store the results of their last estimation as polars Dataframes in the following attributes:

Attribute

Stores

.params

parameter estimates

.result_fit

parameter estimates, errors, & inferential statistics

.result_fit_stats

model fit statistics

.result_anova

ANOVA table (next tutorial)

from pymer4.models import lm
from pymer4 import load_dataset
import seaborn as sns

data = load_dataset("advertising")
data.head()
shape: (5, 4)
tvradionewspapersales
f64f64f64f64
230.137.869.222.1
44.539.345.110.4
17.245.969.39.3
151.541.358.518.5
180.810.858.412.9

Univariate regression#

Let’s fit a model to estimate the following relationship:

\[ Sales \sim \beta_0 + \beta_1 * \text{TV Ad Spending} \]
Hide code cell source
grid = sns.lmplot(x="sales", y="tv", data=data, height=4, aspect=1.25)
grid.set_axis_labels("TV Ad Spending ($100k)", "Sales ($100k)")
sns.despine();

We initialize a model with a formula and a dataframe and call its .fit method.

Using summary=True will return a nicely-formatted table

model = lm("sales ~ tv", data=data)
model.fit(summary=True)
Formula: lm(sales~tv)
Number of observations: 200
Confidence intervals: parametric
---------------------
R-squared: 0.6119
R-squared-adj: 0.6099
F(1, 198) = 312.145, p = <.001
Log-likelihood: -519
AIC: 1044 | BIC: 1053
Residual error: 3.259
Estimate SE CI-low CI-high T-stat df p
(Intercept) 7.033 0.458 6.130 7.935 15.360 198 <.001 ***
tv 0.048 0.003 0.042 0.053 17.668 198 <.001 ***
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1

All estimation results are always stored within the model and accessible as a dataframe

model.result_fit
shape: (2, 8)
termestimatestd_errorconf_lowconf_hight_statdfp_value
strf64f64f64f64f64i64f64
"(Intercept)"7.0325940.4578436.1297197.93546815.3602751981.4063e-35
"tv"0.0475370.0026910.0422310.05284317.6676261981.4674e-42

Result fit statistics (presented in the summary) are also stored within the model

model.result_fit_stats
shape: (1, 12)
r_squaredadj_r_squaredsigmastatisticp_valuedflogLikAICBICdeviancedf_residualnobs
f64f64f64f64f64f64f64f64f64f64i64i64
0.6118750.6099153.258656312.1449941.4674e-421.0-519.0456641044.0913281053.986282102.530583198200

Models conventiently keep track of the data use for estimation and augment this dataframe with additional columns of observation level residuals, predictions, and additional fit statistics

model.data.head()
shape: (5, 10)
tvradionewspapersalesfittedresidhatsigmacooksdstd_resid
f64f64f64f64f64f64f64f64f64f64
230.137.869.222.117.9707754.1292250.0097033.2535130.0079431.273349
44.539.345.110.49.1479741.2520260.0121693.2656840.000920.386575
17.245.969.39.37.8502241.4497760.0164943.2652560.0016880.448615
151.541.358.518.514.2343954.2656050.0050143.2526780.0043391.312301
180.810.858.412.915.627218-2.7272180.0057773.2610990.002047-0.839343

This makes it easy to visually inspect fits

Hide code cell source
grid = sns.lmplot(x="fitted", y="sales", fit_reg=False, data=model.data, height=4, aspect=1.5)
grid.set_axis_labels("Sales Predicted ($100k)", "Sales ($100k)")
sns.despine();

And model residuals

Hide code cell source
grid = sns.displot(x="resid", data=model.data, height=4, aspect=1.5)
grid.set_axis_labels("Residuals", "Frequency")
grid.ax.axvline(0, color="k", linestyle="--", linewidth=2)
sns.despine();

Multiple Regression#

Let’s extend our model to estimate a multiple regression using another continous predictor

\[ Sales \sim \beta_0 + \beta_1 * \text{TV Ad Spending} + \beta_2 * \text{Radio Ad Spending}\]
model = lm("sales ~ tv + radio", data=data)
model.fit(summary=True)
Formula: lm(sales~tv+radio)
Number of observations: 200
Confidence intervals: parametric
---------------------
R-squared: 0.8972
R-squared-adj: 0.8962
F(2, 197) = 859.618, p = <.001
Log-likelihood: -386
AIC: 780 | BIC: 793
Residual error: 1.681
Estimate SE CI-low CI-high T-stat df p
(Intercept) 2.921 0.294 2.340 3.502 9.919 197 <.001 ***
tv 0.046 0.001 0.043 0.048 32.909 197 <.001 ***
radio 0.188 0.008 0.172 0.204 23.382 197 <.001 ***
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1

Marginal Predictions#

When working with multiple predictors it can be helpful to visualize marginal estimates to interpret parameters. These are model predictions across values of one predictor while holding another fixed at 1 or more values.

We can use .empredict() to generate predictions for specific value combinations of our tv and radio. Let’s use it to explore the relationship between \(\text{Sales}\) and \(\text{TV Ad Spending}\) across different values of \(\text{Radio Ad Spending}\).

Using the "data" argument for tv tells the model to use all the observed values of that variable in the dataset, so we’re generating 3 predictions per data-point: radio = 0, radio = 20 and radio = 40

# Predict across the range of observed TV values at Radio values of 0, 20, and 40
marginal_predictions = model.empredict({"tv": "data", "radio": [0, 20, 40]})

We can then visualize these marginal slopes

Hide code cell source
ax = sns.scatterplot(
    data=model.data,
    x="tv",
    y="sales",
    hue="radio",
    size="radio",
)

ax = sns.lineplot(
    x="tv", y="prediction", hue="radio", errorbar=None, data=marginal_predictions, ax=ax, legend=False
)
ax.set(xlabel='TV Ad Spending ($100k)', ylabel='Sales ($100k)');
sns.despine();

Hmm it looks like there might be an interaction between the variables that we’re missing…

Let’s extend our model to estimate a multiple regression with an interaction

\[ Sales \sim \beta_0 + \beta_1 * \text{TV Ad Spending} + \beta_2 * \text{Radio Ad Spending} + \beta_3*\text{TV * Radio Ad Spending} \]
with_interaction = lm("sales ~ tv * radio", data=data)
with_interaction.fit()

Model comparison#

Before interpreting the interaction estimate we can use model comparison to check whether this additional parameter results in a meaningful reduction in model-error

Note

compare() is powered by R’s anova() which uses F-tests (by default) to compare lm/glm models and likelihood-ratio tests to compare lmer/glmer models

from pymer4.models import compare

compare(model, with_interaction)
Analysis of Deviance Table
Model 1: lm(sales~tv+radio)
Model 2: lm(sales~tv*radio)
AIC BIC logLik Res_Df RSS Df Sum of Sq F Pr(>F)
1.00 780.4 793.6 −386.2 197.00 556.91
2.00 550.3 566.8 −270.1 196.00 174.48 1.00 382.43 429.59 <.001 ***
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1

In models with multiple predictors and interactions, it’s good practice to center or zscore predictors to improve interpretability and reduce multicollinearity.

If we inspect our model’s design matrix we can see the strong correlations between predictors which can reduce the stability of our estimates

Hide code cell source
ax = sns.heatmap(
    with_interaction.design_matrix.drop("(Intercept)").corr(),
    cmap="vlag",
    center=0,
    vmin=-1,
    vmax=1,
    annot=True,
    xticklabels=with_interaction.design_matrix.columns[1:],
    yticklabels=with_interaction.design_matrix.columns[1:],
)
ax.set_title("Correlation matrix of model predictors");

And we can inspect the variance inflation factor of each term and the increased uncertainty of our estimates. The effect of multi-collinearity would increase the width of our confidence intervals by a factor of about ~2-2.5x

with_interaction.vif()
shape: (3, 3)
termvifci_increase_factor
strf64f64
"tv"3.7278481.930764
"radio"3.9076511.976778
"tv:radio"6.937862.633982
Hide code cell source
# Note this visualization requires the altair package
# which is not installed by default

from altair import X, Y, Color, Scale

(
    with_interaction.vif()
    .plot.bar(
        y=Y("term").axis(title=("")),
        x=X("ci_increase_factor:Q", scale=Scale(domain=[.9, 2.7])).axis(
            title=("CI Increase by a Factor of"),
            format='~s',
            labelExpr="datum.label + 'x'"
        ),
        color=Color("vif:Q", scale=Scale(scheme="blues", domain=[1, 7])).title("VIF"),
    )
    .properties(height=100, width=400)
)

Transforming Predictors#

We could manually center our variable by creating new columns in our dataframe, but the .set_transforms() method can handle this for us!

Now our slopes for \(\beta_0\), \(\beta_1\), and \(\beta_2\) are interpretable as other terms held at their means

Note

You can used .unset_transforms() to undo any transformations and reset variables to their original values and .show_transforms() to see what/if any variables have been transformed

with_interaction.set_transforms({"tv": "center", "radio": "center"})
with_interaction.fit(summary=True)
Formula: lm(sales~tv*radio)
Number of observations: 200
Confidence intervals: parametric
---------------------
R-squared: 0.9678
R-squared-adj: 0.9673
F(3, 196) = 1963.057, p = <.001
Log-likelihood: -270
AIC: 550 | BIC: 566
Residual error: 0.944
Estimate SE CI-low CI-high T-stat df p
(Intercept) 13.947 0.067 13.815 14.079 208.737 196 <.001 ***
tv 0.044 0.001 0.043 0.046 56.673 196 <.001 ***
radio 0.189 0.005 0.180 0.198 41.806 196 <.001 ***
tv:radio 0.001 0.000 0.001 0.001 20.727 196 <.001 ***
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1

If we inspect the model’s data we can see that it made a backup of the original data columns and replaced values with mean-centered values. This make it easy to keep the same model formula and transform your variables as needed.

with_interaction.data.glimpse()
Rows: 200
Columns: 12
$ tv         <f64> 83.0575, -102.54249999999999, -129.8425, 4.45750000000001, 33.75750000000002, -138.3425, -89.54249999999999, -26.842499999999987, -138.4425, 52.75750000000002
$ radio      <f64> 14.536000000000001, 16.036, 22.636000000000003, 18.036, -12.463999999999995, 25.636000000000003, 9.536000000000001, -3.6639999999999944, -21.163999999999994, -20.663999999999994
$ newspaper  <f64> 69.2, 45.1, 69.3, 58.5, 58.4, 75.0, 23.5, 11.6, 1.0, 21.2
$ sales      <f64> 22.1, 10.4, 9.3, 18.5, 12.9, 7.2, 11.8, 13.2, 4.8, 10.6
$ fitted     <f64> 21.686389994098406, 10.634545599203337, 9.261214108544735, 17.63410792693333, 12.636919029730022, 8.789897605877398, 10.844280097486708, 12.171526528493542, 6.994718246057654, 11.206063903969167
$ resid      <f64> 0.41361000590159236, -0.23454559920333828, 0.03878589145526057, 0.8658920730666715, 0.2630809702699821, -1.5898976058774013, 0.9557199025132928, 1.0284734715064587, -2.1947182460576498, -0.6060639039691598
$ hat        <f64> 0.017369747127019233, 0.026436325056488314, 0.05425641179967553, 0.012417691773576945, 0.010414816585386587, 0.07087976521621937, 0.01492916315399197, 0.0057684281110012515, 0.0553012086193285, 0.021874324715662953
$ sigma      <f64> 0.9454595555094961, 0.9457784128013348, 0.9459272807185634, 0.9438714223516425, 0.9457419895847894, 0.938527958101928, 0.9434147988547761, 0.9430433221311386, 0.9320081079907699, 0.9449131143163492
$ cooksd     <f64> 0.0008642458743608146, 0.000430892510699134, 2.5626767031503242e-05, 0.002680796259339063, 0.00020671221250243943, 0.058285261875678805, 0.003946424590643695, 0.0017334470503899675, 0.08381996205406109, 0.002358436473146814
$ std_resid  <f64> 0.4422287472812875, -0.25193940977658613, 0.04227056711874558, 0.9234813152998302, 0.28029402654502694, -1.7481721440801545, 1.0205819867787846, 1.0932017507705427, -2.3932225922640367, -0.6494894322243936
$ tv_orig    <f64> 230.1, 44.5, 17.2, 151.5, 180.8, 8.7, 57.5, 120.2, 8.6, 199.8
$ radio_orig <f64> 37.8, 39.3, 45.9, 41.3, 10.8, 48.9, 32.8, 19.6, 2.1, 2.6

And we’ve reduce the multi-collinearity between our predictors

Hide code cell source
ax = sns.heatmap(
    with_interaction.design_matrix.drop("(Intercept)").corr(),
    cmap="vlag",
    center=0,
    vmin=-1,
    vmax=1,
    annot=True,
    xticklabels=with_interaction.design_matrix.columns[1:],
    yticklabels=with_interaction.design_matrix.columns[1:],
)
ax.set_title("Correlation matrix of centered model predictors");

And no increase in the variance inflation factor of our predictors

with_interaction.vif()
shape: (3, 3)
termvifci_increase_factor
strf64f64
"tv"1.0102911.005132
"radio"1.0030581.001528
"tv:radio"1.0072611.003624
Hide code cell source
# Note this visualization requires the altair package
# which is not installed by default
(
    with_interaction.vif()
    .plot.bar(
        y=Y("term").axis(title=("")),
        x=X("ci_increase_factor:Q", scale=Scale(domain=[.9, 2.7])).axis(
            title=("CI Increase by a Factor of"),
            format='~s',
            labelExpr="datum.label + 'x'"
        ),
        color=Color("vif:Q", scale=Scale(scheme="blues", domain=[1, 7])).title("VIF"),
    )
    .properties(height=100, width=400)
)

Visualizing our marginal predictions show that the model with an interaction fits the data better as the relationship between \(\text{Sales}\) and \(\text{TV Ad Spending}\) varies across levels of \(\text{Radio Ad Spending}\)

# Predict across the range of observed TV values at Radio values of 0, 20, and 40
marginal_predictions = model.empredict({"tv": "data", "radio": [0, 20, 40]})

Note

By default .empredict allows you to specify values on the original scale of the predictors, because this is often easier to think about with respect to your data. You can disable this by setting apply_transforms = False, in which case you should pass values on the transformed scale, e.g. mean-centered scale [0, 20, 40] - mean(radio)

Hide code cell source
ax = sns.scatterplot(
    data=with_interaction.data,
    x="tv",
    y="sales",
    hue="radio",
    size="radio",
)

ax.legend(title='Radio (centered)')
ax = sns.lineplot(
    x="tv", y="prediction", hue="radio", errorbar=None, data=marginal_predictions, ax=ax, legend=False
)
ax.set(xlabel='TV Ad Spending (centered)', ylabel='Sales ($100k)');
sns.despine();

Bootstrapped Confidence Intervals#

It’s easy to get a variety of bootstrapped confidence intervals for any model using .fit(conf_method='boot')

model.fit(conf_method="boot", nboot=500, summary=True)
Formula: lm(sales~tv+radio)
Number of observations: 200
Confidence intervals: boot
Bootstrap Iterations: 500
---------------------
R-squared: 0.8972
R-squared-adj: 0.8962
F(2, 197) = 859.618, p = <.001
Log-likelihood: -386
AIC: 780 | BIC: 793
Residual error: 1.681
Estimate SE CI-low CI-high T-stat df p
(Intercept) 2.921 0.294 2.196 3.520 9.919 197 <.001 ***
tv 0.046 0.001 0.042 0.050 32.909 197 <.001 ***
radio 0.188 0.008 0.169 0.206 23.382 197 <.001 ***
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1

Individual bootstraps are stored in the .result_boots attribute of the model by default (you can control this with save_boots=False), which makes it easy to create visualizations:

Hide code cell source
grid = sns.FacetGrid(data=model.result_boots.unpivot(), col='variable', hue='variable', sharex=False, sharey=False, height=2.5)
grid.map(sns.kdeplot, 'value')
grid.set_xlabels('')
grid.set_titles("{col_name}")
grid.figure.suptitle("Bootstrapped Parameter Distributions")
grid.figure.tight_layout();

Simulating from the model#

Simulating new datasets, i.e. new \(Sales\) from a model is just as easy

# Simulate 3 datasets
new_data = model.simulate(n=3)
new_data.head()
shape: (5, 3)
sim_1sim_2sim_3
f64f64f64
19.5021721.24381722.361988
12.65413315.18496815.532642
10.93202415.00464611.323162
20.29935917.06074116.959926
13.777939.38160212.524089