Linear Regression

Linear Regression#

The lm() class makes it easy to work with regression models estimated via Ordinary-Least-Squares (OLS)

Core Methods and Attributes

Method	Description
`.fit()`	Fit a model using `lm()` in R
`.summary()`	Get a nicely formatted table containing `.result_fit`
`.empredict()`	Compute marginal predictions are arbitrary values of predictors
`.simulate()`	Simulate new observations from the fitted model
`.predict()`	Make predictions given a new dataset

Models always store the results of their last estimation as polars Dataframes in the following attributes:

Attribute	Stores
`.params`	parameter estimates
`.result_fit`	parameter estimates, errors, & inferential statistics
`.result_fit_stats`	model fit statistics
`.result_anova`	ANOVA table (next tutorial)

from pymer4.models import lm
from pymer4 import load_dataset
import seaborn as sns

data = load_dataset("advertising")
data.head()

shape: (5, 4)

tv	radio	newspaper	sales
f64	f64	f64	f64
230.1	37.8	69.2	22.1
44.5	39.3	45.1	10.4
17.2	45.9	69.3	9.3
151.5	41.3	58.5	18.5
180.8	10.8	58.4	12.9

Univariate regression#

Let’s fit a model to estimate the following relationship:

\[ Sales \sim \beta_0 + \beta_1 * \text{TV Ad Spending} \]

../_images/941cd61ffc2299d0423baf0ff894715adb23727fee78eb49e6894ddabbd86e28.png

We initialize a model with a formula and a dataframe and call its .fit method.

Using summary=True will return a nicely-formatted table

model = lm("sales ~ tv", data=data)
model.fit(summary=True)

	Estimate	SE	CI-low	CI-high	T-stat	df	p
Formula: lm(sales~tv)
Number of observations: 200 Confidence intervals: parametric --------------------- R-squared: 0.6119 R-squared-adj: 0.6099 F(1, 198) = 312.145, p = <.001 Log-likelihood: -519 AIC: 1044 \| BIC: 1053 Residual error: 3.259
(Intercept)	7.033	0.458	6.130	7.935	15.360	198	<.001	***
tv	0.048	0.003	0.042	0.053	17.668	198	<.001	***
Signif. codes: 0 0.001 0.01 * 0.05 . 0.1*

All estimation results are always stored within the model and accessible as a dataframe

model.result_fit

shape: (2, 8)

term	estimate	std_error	conf_low	conf_high	t_stat	df	p_value
str	f64	f64	f64	f64	f64	i64	f64
"(Intercept)"	7.032594	0.457843	6.129719	7.935468	15.360275	198	1.4063e-35
"tv"	0.047537	0.002691	0.042231	0.052843	17.667626	198	1.4674e-42

Result fit statistics (presented in the summary) are also stored within the model

model.result_fit_stats

shape: (1, 12)

r_squared	adj_r_squared	sigma	statistic	p_value	df	logLik	AIC	BIC	deviance	df_residual	nobs
f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	i64	i64
0.611875	0.609915	3.258656	312.144994	1.4674e-42	1.0	-519.045664	1044.091328	1053.98628	2102.530583	198	200

Models conventiently keep track of the data use for estimation and augment this dataframe with additional columns of observation level residuals, predictions, and additional fit statistics

model.data.head()

shape: (5, 10)

tv	radio	newspaper	sales	fitted	resid	hat	sigma	cooksd	std_resid
f64	f64	f64	f64	f64	f64	f64	f64	f64	f64
230.1	37.8	69.2	22.1	17.970775	4.129225	0.009703	3.253513	0.007943	1.273349
44.5	39.3	45.1	10.4	9.147974	1.252026	0.012169	3.265684	0.00092	0.386575
17.2	45.9	69.3	9.3	7.850224	1.449776	0.016494	3.265256	0.001688	0.448615
151.5	41.3	58.5	18.5	14.234395	4.265605	0.005014	3.252678	0.004339	1.312301
180.8	10.8	58.4	12.9	15.627218	-2.727218	0.005777	3.261099	0.002047	-0.839343

This makes it easy to visually inspect fits

../_images/28679b6dbbdfdaeac0fe777f36cf8e70d4a6ccdb7c7af8fbef50f914ea725ae7.png

And model residuals

../_images/a64ac7570e665a989bdb251aa6ccf205cddd9b1dc6477127b1ab76a99512cd9b.png

Multiple Regression#

Let’s extend our model to estimate a multiple regression using another continous predictor

\[ Sales \sim \beta_0 + \beta_1 * \text{TV Ad Spending} + \beta_2 * \text{Radio Ad Spending}\]

model = lm("sales ~ tv + radio", data=data)
model.fit(summary=True)

	Estimate	SE	CI-low	CI-high	T-stat	df	p
Formula: lm(sales~tv+radio)
Number of observations: 200 Confidence intervals: parametric --------------------- R-squared: 0.8972 R-squared-adj: 0.8962 F(2, 197) = 859.618, p = <.001 Log-likelihood: -386 AIC: 780 \| BIC: 793 Residual error: 1.681
(Intercept)	2.921	0.294	2.340	3.502	9.919	197	<.001	***
tv	0.046	0.001	0.043	0.048	32.909	197	<.001	***
radio	0.188	0.008	0.172	0.204	23.382	197	<.001	***
Signif. codes: 0 0.001 0.01 * 0.05 . 0.1*

Marginal Predictions#

When working with multiple predictors it can be helpful to visualize marginal estimates to interpret parameters. These are model predictions across values of one predictor while holding another fixed at 1 or more values.

We can use .empredict() to generate predictions for specific value combinations of our tv and radio. Let’s use it to explore the relationship between \(\text{Sales}\) and \(\text{TV Ad Spending}\) across different values of \(\text{Radio Ad Spending}\).

Using the "data" argument for tv tells the model to use all the observed values of that variable in the dataset, so we’re generating 3 predictions per data-point: radio = 0, radio = 20 and radio = 40

# Predict across the range of observed TV values at Radio values of 0, 20, and 40
marginal_predictions = model.empredict({"tv": "data", "radio": [0, 20, 40]})

We can then visualize these marginal slopes

../_images/8ff18d45f591f0020cf653de0f91e91e3bb32fc8437422be41582797fd64f167.png

Hmm it looks like there might be an interaction between the variables that we’re missing…

Let’s extend our model to estimate a multiple regression with an interaction

\[ Sales \sim \beta_0 + \beta_1 * \text{TV Ad Spending} + \beta_2 * \text{Radio Ad Spending} + \beta_3*\text{TV * Radio Ad Spending} \]

with_interaction = lm("sales ~ tv * radio", data=data)
with_interaction.fit()

Model comparison#

Before interpreting the interaction estimate we can use model comparison to check whether this additional parameter results in a meaningful reduction in model-error

Note

compare() is powered by R’s anova() which uses F-tests (by default) to compare lm/glm models and likelihood-ratio tests to compare lmer/glmer models

from pymer4.models import compare

compare(model, with_interaction)

	AIC	BIC	logLik	Res_Df	RSS	Df	Sum of Sq	F	Pr(>F)
Analysis of Deviance Table
Model 1: lm(sales~tv+radio) Model 2: lm(sales~tv*radio)
1.00	780.4	793.6	−386.2	197.00	556.91
2.00	550.3	566.8	−270.1	196.00	174.48	1.00	382.43	429.59	<.001	***
Signif. codes: 0 0.001 0.01 * 0.05 . 0.1*

In models with multiple predictors and interactions, it’s good practice to center or zscore predictors to improve interpretability and reduce multicollinearity.

If we inspect our model’s design matrix we can see the strong correlations between predictors which can reduce the stability of our estimates

../_images/0279bda4d1c7b5b6e539f521d1723e38b71921eda0f15fc44417acd1c79a0819.png

And we can inspect the variance inflation factor of each term and the increased uncertainty of our estimates. The effect of multi-collinearity would increase the width of our confidence intervals by a factor of about ~2-2.5x

with_interaction.vif()

shape: (3, 3)

term	vif	ci_increase_factor
str	f64	f64
"tv"	3.727848	1.930764
"radio"	3.907651	1.976778
"tv:radio"	6.93786	2.633982

Transforming Predictors#

We could manually center our variable by creating new columns in our dataframe, but the .set_transforms() method can handle this for us!

Now our slopes for \(\beta_0\), \(\beta_1\), and \(\beta_2\) are interpretable as other terms held at their means

Note

You can used .unset_transforms() to undo any transformations and reset variables to their original values and .show_transforms() to see what/if any variables have been transformed

with_interaction.set_transforms({"tv": "center", "radio": "center"})
with_interaction.fit(summary=True)

	Estimate	SE	CI-low	CI-high	T-stat	df	p
Formula: lm(sales~tv*radio)
Number of observations: 200 Confidence intervals: parametric --------------------- R-squared: 0.9678 R-squared-adj: 0.9673 F(3, 196) = 1963.057, p = <.001 Log-likelihood: -270 AIC: 550 \| BIC: 566 Residual error: 0.944
(Intercept)	13.947	0.067	13.815	14.079	208.737	196	<.001	***
tv	0.044	0.001	0.043	0.046	56.673	196	<.001	***
radio	0.189	0.005	0.180	0.198	41.806	196	<.001	***
tv:radio	0.001	0.000	0.001	0.001	20.727	196	<.001	***
Signif. codes: 0 0.001 0.01 * 0.05 . 0.1*

If we inspect the model’s data we can see that it made a backup of the original data columns and replaced values with mean-centered values. This make it easy to keep the same model formula and transform your variables as needed.

with_interaction.data.glimpse()

Rows: 200
Columns: 12
$ tv         <f64> 83.0575, -102.54249999999999, -129.8425, 4.45750000000001, 33.75750000000002, -138.3425, -89.54249999999999, -26.842499999999987, -138.4425, 52.75750000000002
$ radio      <f64> 14.536000000000001, 16.036, 22.636000000000003, 18.036, -12.463999999999995, 25.636000000000003, 9.536000000000001, -3.6639999999999944, -21.163999999999994, -20.663999999999994
$ newspaper  <f64> 69.2, 45.1, 69.3, 58.5, 58.4, 75.0, 23.5, 11.6, 1.0, 21.2
$ sales      <f64> 22.1, 10.4, 9.3, 18.5, 12.9, 7.2, 11.8, 13.2, 4.8, 10.6
$ fitted     <f64> 21.686389994098406, 10.634545599203337, 9.261214108544735, 17.63410792693333, 12.636919029730022, 8.789897605877398, 10.844280097486708, 12.171526528493542, 6.994718246057654, 11.206063903969167
$ resid      <f64> 0.41361000590159236, -0.23454559920333828, 0.03878589145526057, 0.8658920730666715, 0.2630809702699821, -1.5898976058774013, 0.9557199025132928, 1.0284734715064587, -2.1947182460576498, -0.6060639039691598
$ hat        <f64> 0.017369747127019233, 0.026436325056488314, 0.05425641179967553, 0.012417691773576945, 0.010414816585386587, 0.07087976521621937, 0.01492916315399197, 0.0057684281110012515, 0.0553012086193285, 0.021874324715662953
$ sigma      <f64> 0.9454595555094961, 0.9457784128013348, 0.9459272807185634, 0.9438714223516425, 0.9457419895847894, 0.938527958101928, 0.9434147988547761, 0.9430433221311386, 0.9320081079907699, 0.9449131143163492
$ cooksd     <f64> 0.0008642458743608146, 0.000430892510699134, 2.5626767031503242e-05, 0.002680796259339063, 0.00020671221250243943, 0.058285261875678805, 0.003946424590643695, 0.0017334470503899675, 0.08381996205406109, 0.002358436473146814
$ std_resid  <f64> 0.4422287472812875, -0.25193940977658613, 0.04227056711874558, 0.9234813152998302, 0.28029402654502694, -1.7481721440801545, 1.0205819867787846, 1.0932017507705427, -2.3932225922640367, -0.6494894322243936
$ tv_orig    <f64> 230.1, 44.5, 17.2, 151.5, 180.8, 8.7, 57.5, 120.2, 8.6, 199.8
$ radio_orig <f64> 37.8, 39.3, 45.9, 41.3, 10.8, 48.9, 32.8, 19.6, 2.1, 2.6

And we’ve reduce the multi-collinearity between our predictors

../_images/8d0739de30f6ef59976aaf96cd802bb6b81124717a511a2c2188b137562f1708.png

And no increase in the variance inflation factor of our predictors

with_interaction.vif()

shape: (3, 3)

term	vif	ci_increase_factor
str	f64	f64
"tv"	1.010291	1.005132
"radio"	1.003058	1.001528
"tv:radio"	1.007261	1.003624

Visualizing our marginal predictions show that the model with an interaction fits the data better as the relationship between \(\text{Sales}\) and \(\text{TV Ad Spending}\) varies across levels of \(\text{Radio Ad Spending}\)

# Predict across the range of observed TV values at Radio values of 0, 20, and 40
marginal_predictions = model.empredict({"tv": "data", "radio": [0, 20, 40]})

Note

By default .empredict allows you to specify values on the original scale of the predictors, because this is often easier to think about with respect to your data. You can disable this by setting apply_transforms = False, in which case you should pass values on the transformed scale, e.g. mean-centered scale [0, 20, 40] - mean(radio)

../_images/08b6e97e955eb165950fb0a7f4f6c4cf4130f14404202c6365799814ce182ab0.png

Bootstrapped Confidence Intervals#

It’s easy to get a variety of bootstrapped confidence intervals for any model using .fit(conf_method='boot')

model.fit(conf_method="boot", nboot=500, summary=True)

	Estimate	SE	CI-low	CI-high	T-stat	df	p
Formula: lm(sales~tv+radio)
Number of observations: 200 Confidence intervals: boot Bootstrap Iterations: 500 --------------------- R-squared: 0.8972 R-squared-adj: 0.8962 F(2, 197) = 859.618, p = <.001 Log-likelihood: -386 AIC: 780 \| BIC: 793 Residual error: 1.681
(Intercept)	2.921	0.294	2.196	3.520	9.919	197	<.001	***
tv	0.046	0.001	0.042	0.050	32.909	197	<.001	***
radio	0.188	0.008	0.169	0.206	23.382	197	<.001	***
Signif. codes: 0 0.001 0.01 * 0.05 . 0.1*

Individual bootstraps are stored in the .result_boots attribute of the model by default (you can control this with save_boots=False), which makes it easy to create visualizations:

../_images/4cedd32ac385ba3138abfafa2a1a959401f7753c32e7b59360ccf102bf3c4a6c.png

Simulating from the model#

Simulating new datasets, i.e. new \(Sales\) from a model is just as easy

# Simulate 3 datasets
new_data = model.simulate(n=3)
new_data.head()

shape: (5, 3)

sim_1	sim_2	sim_3
f64	f64	f64
19.50217	21.243817	22.361988
12.654133	15.184968	15.532642
10.932024	15.004646	11.323162
20.299359	17.060741	16.959926
13.77793	9.381602	12.524089

Linear Regression

Contents

Linear Regression#

Univariate regression#

Multiple Regression#

Marginal Predictions#

Model comparison#

Transforming Predictors#

Bootstrapped Confidence Intervals#

Simulating from the model#