API Reference¶
pymer4.models.Lmer
: Lmer¶
Model class for estimating lme4
multi-level models in python
- class pymer4.models.Lmer(formula, data, family='gaussian')[source]¶
Model class to hold data outputted from fitting lmer in R and converting to Python object. This class stores as much information as it can about a merMod object computed using lmer and lmerTest in R. Most attributes will not be computed until the fit method is called.
- Parameters:
formula (str) – Complete lmer-style model formula
data (pandas.core.frame.DataFrame) – input data
family (string) – what distribution family (i.e.) link function to use for the generalized model; default is gaussian (linear model)
- Variables:
fitted (bool) – whether model has been fit
formula (str) – model formula
data (pd.DataFrame) – model copy of input data
grps (dict) – groups and number of observations per groups recognized by lmer
design_matrix (pd.DataFrame) – model design matrix determined by lmer
AIC (float) – model Akaike information criterion
BIC (float) – model Bayesian information criterion
logLike (float) – model Log-likelihood
family (string) – model family
warnings (list) – warnings output from R or Python
ranef (pd.DataFrame/list) – cluster-level differences from population parameters, i.e. difference between coefs and fixefs; returns list if multiple cluster variables are used to specify random effects (e.g. subjects and items)
fixef (pd.DataFrame/list) – cluster-level parameters; returns list if multiple cluster variables are used to specify random effects (e.g. subjects and items)
coefs (pandas.core.frame.DataFrame/list) – model summary table of population parameters
ranef_var (pd.DataFrame) – random effects variances
ranef_corr (pd.DataFrame) – random effects correlations
residuals (numpy.ndarray) – model residuals
fits (numpy.ndarray) – model fits/predictions
model_obj (lmer model) – rpy2 lmer model object
factors (dict) – factors used to fit the model if any
- anova(force_orthogonal=False)[source]¶
Return a type-3 ANOVA table from a fitted model. Like R, this method does not ensure that contrasts are orthogonal to ensure correct type-3 SS computation. However, the force_orthogonal flag can refit the regression model with orthogonal polynomial contrasts automatically guaranteeing valid SS type 3 inferences. Note that this will overwrite factors specified in the last call to .fit()
- Parameters:
force_orthogonal (bool) – whether factors in the model should be recoded using polynomial contrasts to ensure valid type-3 SS calculations. If set to True, previous factor specifications will be saved in model.factors_prev_; default False
- Returns:
Type 3 ANOVA results
- Return type:
pd.DataFrame
- confint(parm=None, level=0.95, method='Wald', zeta=None, nsim=500, boot_type='perc', quiet=False, oldnames=False, seed=None)[source]¶
Compute confidence intervals on the parameters of a Lmer object (this is a wrapper for confint.merMod in lme4). :param self: the Lmer object for which confidence intervals should be computed :type self: Lmer :param parm: parameter names for which intervals are sought. Specified by an integer vector of positions
(leave blank to compute ci for all parameters)
- Parameters:
level (float) – confidence level <1, typically above 0.90
method (str) – which method to compute confidence intervals; ‘profile’, ‘Wald’ (default), or ‘boot’ (parametric bootstrap)
zeta (float) – (for method = “profile” only:) likelihood cutoff (if not specified, as by default, computed from level in R).
nsim (int) – number of bootstrap intervals if bootstrapped confidence intervals are requests; default 500
boot_type (str) – bootstrap confidence interval type (one of “perc”,”basic”,”norm”, as defined in boot_ci in R)
quiet (bool) – (logical) suppress messages about computationally intensive profiling?
oldnames – (logical) use old-style names for variance-covariance parameters, e.g. “.sig01”, rather than newer (more informative) names such as “sd_(Intercept)|Subject”?
seed (int) – seed to be passed to bootMer for repeatability.
- Returns:
confidence intervals for the parameters of interest
- Return type:
pd.DataFrame
Examples
The following examples demonstrate how to get different types of confidence intervals.
The default Wald estimates for all parameters
>>> model.confint()
Boostrap estimate for the variance component of the random intercept from factor “Group” and the error variance. You will need more than 100 repeats for a real application
>>> model.confint(method="boot", nsim=100, parm=["sd_(Intercept)|Group","sigma"])
Same as above but using oldnames for those variances, which is still the devault in lme4
>>> model.confint(method="boot", nsim=100, oldnames=True, parm=[".sig01",".sigma"])
- fit(conf_int='Wald', n_boot=500, factors=None, permute=False, ordered=False, verbose=False, REML=True, rank=False, rank_group='', rank_exclude_cols=[], no_warnings=False, control='', old_optimizer=False, **kwargs)[source]¶
Main method for fitting model object. Will modify the model’s data attribute to add columns for residuals and fits for convenience. Factors should be specified as a dictionary with values as a list or themselves a dictionary of human readable contrasts not R-style contrast codes as these will be auto-converted for you. See the factors docstring and examples below. After fitting, the .factors attribute will store a reference to the user-specified dictionary. The .contrast_codes model attributes will store the requested comparisons in converted R format. Note that Lmer estimate naming conventions differs a bit from R: Lmer.coefs = summary(model); Lmer.fixefs = coefs(model); Lmer.ranef = ranef(model)
- Parameters:
conf_int (str) – which method to compute confidence intervals; ‘profile’, ‘Wald’ (default), or ‘boot’ (parametric bootstrap)
n_boot (int) – number of bootstrap intervals if bootstrapped confidence intervals are requests; default 500
factors (dict) – dictionary with column names specified as keys and values as a list for dummy/treatment/polynomial contrast or a dict with keys as factor leves and values as desired comparisons in human readable format See examples below
permute (int) – if non-zero, computes parameter significance tests by permuting test stastics rather than parametrically. Permutation is done by shuffling observations within clusters to respect random effects structure of data.
ordered (bool) – whether factors should be treated as ordered polynomial contrasts; this will parameterize a model with K-1 orthogonal polynomial regressors beginning with a linear contrast based on the factor order provided. Ordering applies to all contrasts!; default is False
summarize/summary (bool) – whether to print a model summary after fitting; default is True
verbose (bool) – whether to print when and which model and confidence interval are being fitted
REML (bool) – whether to fit using restricted maximum likelihood estimation instead of maximum likelihood estimation; default True
rank (bool) – covert predictors in model formula to ranks by group prior to estimation. Model object will still contain original data not ranked data; default False
rank_group (str) – column name to group data on prior to rank conversion
rank_exclude_cols (list/str) – columns in model formula to not apply rank conversion to
no_warnings (bool) – turn off auto-printing warnings messages; warnings are always stored in the .warnings attribute; default False
control (str) – string containing options to be passed to (g)lmer control. See https://bit.ly/2OQONTH for options
old_optimizer (bool) – use the old bobyqa optimizer that was the default in lmer4 <= 1.1_20, i.e. prior to 02/04/2019. This is not compatible with the control setting as it’s meant to be a quick shorthand (e.g. to reproduce previous model results). However, the same setting can be manually requested using the control option if preferred. (For optimizer change discussions see: https://bit.ly/2MrP9Nq and https://bit.ly/2Vx5jte )
- Returns:
R/statsmodels style summary
- Return type:
pd.DataFrame
Examples
The following examples demonstrate how to treat variables as categorical factors.
Dummy-Coding: Treat Col1 as a factor which 3 levels: A, B, C. Use dummy-coding with A as the reference level. Model intercept will be mean of A, and parameters will be B-A, and C-A.
>>> model.fit(factors = {"Col1": ['A','B','C']})
Orthogonal Polynomials: Treat Col1 as a factor which 3 levels: A, B, C. Estimate a linear contrast of C > B > A. Model intercept will be grand-mean of all levels, and parameters will be linear contrast, and orthogonal polynomial contrast (auto-computed).
>>> model.fit(factors = {"Col1": ['A','B','C']}, ordered=True)
Custom-contrast: Treat Col1 as a factor which 3 levels: A, B, C. Compare A to the mean of B and C. Model intercept will be the grand-mean of all levels, and parameters will be the desired contrast, a well as an automatically determined orthogonal contrast.
>>> model.fit(factors = {"Col1": {'A': 1, 'B': -.5, 'C': -.5}}))
Here is an example specifying stricter deviance and paramter values stopping criteria.
>>> model.fit(control="optCtrl = list(ftol_abs=1e-8, xtol_abs=1e-8)")
Here is an example specifying a different optimizer in addition to stricter deviance and paramter values stopping criteria.
>>> model.fit(control="optimizer='Nelder_Mead', optCtrl = list(FtolAbs=1e-8, XtolRel=1e-8)")
Here is an example using the default optimization in previous versions of lme4 prior to the 2019 update.
>>> model.fit(old_optimizer=True)
- plot(param, figsize=(8, 6), xlabel='', ylabel='', plot_fixef=True, plot_ci=True, grps=[], ax=None)[source]¶
Plot random and group level parameters from a fitted model
- Parameters:
param (str) – model parameter (column name) to plot
figsize (tup) – matplotlib desired figsize
xlabel (str) – x-axis label
ylabel (str) – y-axis label
plot_fixef (bool) – plot population effect fit of param?; default True
plot_ci (bool) – plot computed ci’s of population effect?; default True
grps (list) – plot specific group fits only; must correspond to index values in model.fixef
ax (matplotlib.axes.Axes) – axis handle for an existing plot; if provided will ensure that random parameter plots appear behind all other plot objects.
- Returns:
matplotlib axis handle
- Return type:
plt.axis
- plot_summary(figsize=(12, 6), error_bars='ci', ranef=True, axlim=None, plot_intercept=True, ranef_alpha=0.5, coef_fmt='o', orient='v', ranef_idx=0)[source]¶
Create a forestplot overlaying estimated coefficients with random effects (i.e. BLUPs). By default display the 95% confidence intervals computed during fitting.
- Parameters:
error_bars (str) – one of ‘ci’ or ‘se’ to change which error bars are plotted; default ‘ci’
ranef (bool) – overlay BLUP estimates on figure; default True
axlim (tuple) – lower and upper limit of plot; default min and max of BLUPs
plot_intercept (bool) – plot the intercept estimate; default True
ranef_alpha (float) – opacity of random effect points; default .5
coef_fmt (str) – matplotlib marker style for population coefficients
ranef_idx (int) – if multiple random effects clusters were specified this value indicates which one should be plotted; uses 0-based indexing; default 0 (first)
- Returns:
matplotlib axis handle
- Return type:
plt.axis
- post_hoc(marginal_vars, grouping_vars=None, p_adjust='tukey', summarize=True, dof_asymptotic_at=99999, verbose=False)[source]¶
Post-hoc pair-wise tests corrected for multiple comparisons (Tukey method) implemented using the emmeans package. This method provide both marginal means/trends along with marginal pairwise differences. More info can be found at: https://cran.r-project.org/web/packages/emmeans/emmeans.pdf
- Parameters:
marginal_var (str/list) – what variable(s) to compute marginal means/trends for; unique combinations of factor levels of these variable(s) will determine family-wise error correction
grouping_vars (str/list) – what variable(s) to group on. Trends/means/comparisons of other variable(s), will be computed at each level of these variable(s)
p_adjust (str) – multiple comparisons adjustment method. One of: tukey, bonf, fdr, hochberg, hommel, holm, dunnet, mvt (monte-carlo multi-variate T, aka exact tukey/dunnet). Default tukey
summarize (bool) – output effects and contrasts or don’t (always stored in
model.marginal_contrasts); (model object as model.marginal_estimates and) –
True (default) –
dof_asymptotic_at (int, optional) – at what number of observations to
t-stats; (assuming a normal distribution and return z-stats instead of) –
3000 (Default 100000 which differs from emmeans default of) –
verbose (bool) – whether to print R messages to the console
- Returns:
marginal_estimates (pd.Dataframe): unique factor level effects (e.g. means/coefs)
marginal_contrasts (pd.DataFrame): contrasts between factor levels
- Return type:
Multiple
Examples
Pairwise comparison of means of A at each level of B
>>> model.post_hoc(marginal_vars='A',grouping_vars='B')
Pairwise differences of slopes of C between levels of A at each level of B
>>> model.post_hoc(marginal_vars='C',grouping_vars=['A','B'])
Pairwise differences of each unique A,B cell
>>> model.post_hoc(marginal_vars=['A','B'])
- predict(data, use_rfx=True, pred_type='response', skip_data_checks=False, verify_predictions=True, verbose=False)[source]¶
Make predictions given new data. Input must be a dataframe that contains the same columns as the model.matrix excluding the intercept (i.e. all the predictor variables used to fit the model). If using random effects to make predictions, input data must also contain a column for the group identifier that were used to fit the model random effects terms. Using random effects to make predictions only makes sense if predictions are being made about the same groups/clusters. If any predictors are categorical, you can skip verifying column names by setting skip_data_checks=True.
- Parameters:
data (pandas.core.frame.DataFrame) – input data to make predictions on
use_rfx (bool) – whether to condition on random effects when making
True (access these directly from model.fits or model.data['fits']. Default) –
pred_type (str) – whether the prediction should be on the ‘response’ scale
link ((default); or on the 'link' scale of the predictors passed through the) –
probability (function (e.g. log-odds scale in a logit model instead of) –
values) –
skip_data_checks (bool) – whether to skip checks that input data have the
predicting (same columns as the original data the model were trained on. If) –
set (using a model trained with categorical variables it can be helpful to) –
True –
verify_predictions (bool) – whether to ensure that the predicted data are not
when (identical to original model fits. Only useful to set this to False) –
on (making predictions on the same data the model was fit) –
to (but its faster) –
True –
verbose (bool) – whether to print R messages to console
- Returns:
prediction values
- Return type:
np.ndarray
- simulate(num_datasets, use_rfx=True, verbose=False)[source]¶
Simulate new responses based upon estimates from a fitted model. By default group/cluster means for simulated data will match those of the original data. Unlike predict, this is a non-deterministic operation because lmer will sample random-efects values for all groups/cluster and then sample data points from their respective conditional distributions.
- Parameters:
num_datasets (int) – number of simulated datasets to generate. Each simulation always generates a dataset that matches the size of the original data
use_rfx (bool) – wehther to match group/cluster means in simulated data
verbose (bool) – whether to print R messages to console
- Returns:
simulated data values
- Return type:
np.ndarray
pymer4.models.Lm
: Lm¶
Model class for estimating standard regression models
- class pymer4.models.Lm(formula, data, family='gaussian')[source]¶
Model class to perform OLS regression. Formula specification works just like in R based on columns of a dataframe. Formulae are parsed by patsy which makes it easy to utilize specifiy columns as factors. This is different from Lmer. See patsy for more information on the different use cases.
- Parameters:
formula (str) – Complete lm-style model formula
data (pandas.core.frame.DataFrame) – input data
family (string) – what distribution family (i.e.) link function to use for the generalized model; default is gaussian (linear model)
- Variables:
fitted (bool) – whether model has been fit
formula (str) – model formula
data (pd.DataFrame) – model copy of input data
design_matrix (pd.DataFrame) – model design matrix determined by patsy
AIC (float) – model akaike information criterion
logLike (float) – model Log-likelihood
family (string) – model family
warnings (list) – warnings output from Python
coefs (pd.DataFrame) – model summary table of parameters
residuals (numpy.ndarray) – model residuals
fits (numpy.ndarray) – model fits/predictions
estimator (string) – ‘OLS’ or ‘WLS’
se_type (string) – how standard errors are computed
sig_type (string) – how inference is performed
- fit(robust=False, conf_int='standard', permute=False, rank=False, verbose=False, n_boot=500, n_jobs=1, n_lags=1, cluster=None, weights=None, wls_dof_correction=True, **kwargs)[source]¶
Fit a variety of OLS models. By default will fit a model that makes parametric assumptions (under a t-distribution) replicating the output of software like R. 95% confidence intervals (CIs) are also estimated parametrically by default. However, empirical bootstrapping can also be used to compute CIs; this procedure resamples with replacement from the data themselves, not residuals or data generated from fitted parameters and will be used for inference unless permutation tests are requested. Permutation testing will shuffle observations to generate a null distribution of t-statistics to perform inference on each regressor (permuted t-test).
Alternatively, OLS robust to heteroscedasticity can be fit by computing sandwich standard error estimates (good ref: https://bit.ly/2VRb7jK). This is similar to Stata’s robust routine. Of the choices below, ‘hc1’ or ‘hc3’ are amongst the more popular. Robust estimators include:
‘hc0’: Huber (1967) original sandwich estimator
‘hc1’: Hinkley (1977) DOF adjustment to ‘hc0’ to account for small sample sizes (default)
‘hc2’: different kind of small-sample adjustment of ‘hc0’ by leverage values in hat matrix
‘hc3’: MacKinnon and White (1985) HC3 sandwich estimator; provides more robustness in smaller samples than hc2, Long & Ervin (2000)
‘hac’: Newey-West (1987) estimator for robustness to heteroscedasticity as well as serial auto-correlation at given lags.
‘cluster’ : cluster-robust standard errors (see Cameron & Miller 2015 for review). Provides robustness to errors that cluster according to specific groupings (e.g. repeated observations within a person/school/site). This acts as post-modeling “correction” for what a multi-level model explicitly estimates and is popular in the econometrics literature. DOF correction differs slightly from stat/statsmodels which use num_clusters - 1, where as pymer4 uses num_clusters - num_coefs
Finally, weighted-least-squares (WLS) can be computed as an alternative to to hetereoscedasticity robust standard errors. This can be estimated by providing an array or series of weights (1 / variance of each group) with the same length as the number of observations or a column to use to compute group variances (which can be the same as the predictor column). This is often useful if some predictor(s) is categorical (e.g. dummy-coded) and taking into account unequal group variances is desired (i.e. in the simplest case this would be equivalent to peforming Welch’s t-test).
- Parameters:
robust (bool/str) – whether to use heteroscedasticity robust s.e. and optionally which estimator type to use (‘hc0’,’hc1’, ‘hc2’, hc3’,’hac’,’cluster’). If robust = True, default robust estimator is ‘hc1’; default False
conf_int (str) – whether confidence intervals should be computed through bootstrap (‘boot’) or assuming a t-distribution (‘standard’); default ‘standard’
permute (int) – if non-zero, computes parameter significance tests by permuting t-stastics rather than parametrically; works with robust estimators
rank (bool) – convert all predictors and dependent variable to ranks before estimating model; default False
summarize (bool) – whether to print a model summary after fitting; default True
verbose (bool) – whether to print which model, standard error, confidence interval, and inference type are being fitted
n_boot (int) – how many bootstrap resamples to use for confidence intervals (ignored unless conf_int=’boot’)
n_jobs (int) – number of cores for parallelizing bootstrapping or permutations; default 1
n_lags (int) – number of lags for robust estimator type ‘hac’ (ignored unless robust=’hac’); default 1
cluster (str) – column name identifying clusters/groups for robust estimator type ‘cluster’ (ignored unless robust=’cluster’)
weights (string/pd.Series/np.ndarray) – weights to perform WLS instead of OLS. Pass in a column name in data to use to compute group variances and automatically adjust dof. Otherwise provide an array or series containing 1 / variance of each observation, in which case dof correction will not occur.
wls_dof_correction (bool) – whether to apply Welch-Satterthwaite approximate correction for dof when using weights based on an existing column in the data, ignored otherwise. Set to False to force standard dof calculation
- Returns:
R/statsmodels style summary
- Return type:
pd.DataFrame
Examples
Simple multiple regression model with parametric assumptions
>>> model = Lm('DV ~ IV1 + IV2', data=df) >>> model.fit()
Same as above but with robust standard errors
>>> model.fit(robust='hc1')
Same as above but with cluster-robust standard errors. The cluster argument should refer to a column in the dataframe.
>>> model.fit(robust='cluster', cluster='Group')
Simple regression with categorical predictor, i.e. between groups t-test assuming equal variances
>>> model = Lm('DV ~ Group', data=df) >>> model.fit()
Same as above but don’t assume equal variances and have pymer4 compute the between group variances automatically, i.e. WLS (preferred).
>>> model.fit(weights='Group')
Manually compute the variance of each group and use the inverse of that as the weights. In this case WLS is estimated but dof correction won’t be applied because it’s not trivial to compute.
>>> weights = 1 / df.groupby("Group")['DV'].transform(np.var,ddof=1) model.fit(weights=weights)
- predict(data, pred_type='response')[source]¶
Make predictions given new data. Input must be a dataframe that contains the same columns as the model.matrix excluding the intercept (i.e. all the predictor variables used to fit the model). Will automatically use/ignore intercept to make a prediction if it was/was not part of the original fitted model.
- Parameters:
data (pd.DataFrame) – input data to make predictions on
pred_type (str) – whether the prediction should be on the ‘response’ scale
link ((default) or on the 'link' scale of the predictors passed through the) –
probability (function (e.g. log-odds scale in a logit model instead of) –
'binomial' (values) or 'probs' if self.family ==) –
- Returns:
prediction values
- Return type:
np.ndarray
- summary()[source]¶
Summarize the output of a fitted model.
- Returns:
R/statsmodels style summary
- Return type:
pd.DataFrame
- to_corrs(corr_type='semi', ztrans_corrs=False)[source]¶
Transform fitted model coefficients (excluding the intercept) to partial or semi-partial correlations with dependent variable. The is useful for rescaling coefficients to a correlation scale (-1 to 1) and does not change how inferences are performed. Semi-partial correlations are computed as the correlation between a DV and each predictor after the influence of all other predictors have been regressed out from that predictor. They are interpretable in the same way as the original coefficients. Partial correlations reflect the unique variance a predictor explains in the DV accounting for correlations between predictors and what is not explained by other predictors; this value is always >= the semi-partial correlation. They are not interpretable in the same way as the original coefficients. Partial correlations are computed as the correlations between a DV and each predictor after the influence of all other predictors have been regressed out from that predictor and the DV. Good ref: https://bit.ly/2GNwXh5
- Parameters:
corr_type (string) – ‘semi’ or ‘partial’
ztrans_partial_corrs (bool) – whether to fisher z-transform (arctan) partial correlations before reporting them; default False
- Returns:
partial or semi-partial correlations
- Return type:
pd.Series
pymer4.models.Lm2
: Lm2¶
Model class for estimating multi-level models in python using the summary-statistics approach
- class pymer4.models.Lm2(formula, data, group, family='gaussian')[source]¶
Model class to perform two-stage OLS regression. Practically, this class fits a separate Lm() model to each cluster/group in the data and performs inference on the coefficients of each model (i.e. 1-sample t-test per coefficient). The results from this second level regression are reported. This is an alternative to using Lmer, as it implicitly allows intercept and slopes to vary by group, however with no prior/smoothing/regularization on the random effects. See https://bit.ly/2SwHhQU and Gelman (2005). This approach maybe less preferable to Lmer if the number of observations per group are few, but the number of groups is large, in which case the 1st-level estimates are much noisier and are not smoothed/regularized as in Lmer. It maybe preferable when a “maximal” rfx Lmer model is not estimable. Formula specification works just like in R based on columns of a dataframe. Formulae are parsed by patsy which makes it easy to utilize specific columns as factors. This is different from Lmer. See patsy for more information on the different use cases.
- Parameters:
formula (str) – Complete lm-style model formula
data (pd.DataFrame) – input data
family (string) – what distribution family (i.e.) link function to use for the generalized model; default is gaussian (linear model)
group (list/string) – the grouping variable to use to run the 1st-level regression; if a list is provided will run multiple levels feeding the coefficients from the previous level into the subsequent level
- Variables:
fitted (bool) – whether model has been fit
formula (str) – model formula
data (pandas.core.frame.DataFrame) – model copy of input data
grps (dict) – groups and number of observations per groups recognized by lmer
AIC (float) – model akaike information criterion
logLike (float) – model Log-likelihood
family (string) – model family
warnings (list) – warnings output from Python
fixef (pd.DataFrame) – cluster-level parameters
coefs (pd.DataFrame) – model summary table of population parameters
residuals (numpy.ndarray) – model residuals
fits (numpy.ndarray) – model fits/predictions
se_type (string) – how standard errors are computed
sig_type (string) – how inference is performed
- fit(robust=False, conf_int='standard', permute=False, perm_on='t-stat', rank=False, verbose=False, n_boot=500, n_jobs=1, n_lags=1, to_corrs=False, ztrans_corrs=True, cluster=None, **kwargs)[source]¶
Fit a variety of second-level OLS models; all 1st-level models are standard OLS. By default will fit a model that makes parametric assumptions (under a t-distribution) replicating the output of software like R. 95% confidence intervals (CIs) are also estimated parametrically by default. However, empirical bootstrapping can also be used to compute CIs, which will resample with replacement from the first level regression estimates and uses these CIs to perform inference unless permutation tests are requested. Permutation testing will perform a one-sample sign-flipped permutation test on the estimates directly (perm_on=’coef’) or the t-statistic (perm_on=’t-stat’). Permutation is a bit different than Lm which always permutes based on the t-stat.
Heteroscedasticity robust standard errors can also be computed, but these are applied at the second-level, not at the first level. See the Lm() documentatation for more information about robust standard errors.
- Parameters:
robust (bool/str) – whether to use heteroscedasticity robust s.e. and optionally which estimator type to use (‘hc0’,’hc3’,’hac’,’cluster’). If robust = True, default robust estimator is ‘hc0’; default False
conf_int (str) – whether confidence intervals should be computed through bootstrap (‘boot’) or assuming a t-distribution (‘standard’); default ‘standard’
permute (int) – if non-zero, computes parameter significance tests by permuting t-stastics rather than parametrically; works with robust estimators
perm_on (str) – permute based on a null distribution of the ‘coef’ of first-level estimates or the ‘t-stat’ of first-level estimates; default ‘t-stat’
rank (bool) – convert all predictors and dependent variable to ranks before estimating model; default False
to_corrs (bool/string) – for each first level model estimate a semi-partial or partial correlations instead of betas and perform inference over these partial correlation coefficients. note this is different than Lm(); default False
ztrans_corrs (bool) – whether to fisher-z transform (arcsin) first-level correlations before running second-level model. Ignored if to_corrs is False; default True
summarize (bool) – whether to print a model summary after fitting; default True
verbose (bool) – whether to print which model, standard error, confidence interval, and inference type are being fitted
n_boot (int) – how many bootstrap resamples to use for confidence intervals (ignored unless conf_int=’boot’)
n_jobs (int) – number of cores for parallelizing bootstrapping or permutations; default 1
n_lags (int) – number of lags for robust estimator type ‘hac’ (ignored unless robust=’hac’); default 1
cluster (str) – column name identifying clusters/groups for robust estimator type ‘cluster’ (ignored unless robust=’cluster’)
- Returns:
R style summary() table
- Return type:
DataFrame
- plot_summary(figsize=(12, 6), error_bars='ci', ranef=True, axlim=None, ranef_alpha=0.5, coef_fmt='o', orient='v', **kwargs)[source]¶
Create a forestplot overlaying estimated coefficients with first-level effects. By default display the 95% confidence intervals computed during fitting.
- Parameters:
error_bars (str) – one of ‘ci’ or ‘se’ to change which error bars are plotted; default ‘ci’
ranef (bool) – overlay BLUP estimates on figure; default True
axlim (tuple) – lower and upper limit of plot; default min and max of BLUPs
ranef_alpha (float) – opacity of random effect points; default .5
coef_fmt (str) – matplotlib marker style for population coefficients
- Returns:
matplotlib axis handle
- Return type:
plt.axis
pymer4.simulate
: Simulation Functions¶
Functions for generating data for use with various model types
- pymer4.simulate.easy_multivariate_normal(num_obs, num_features, corrs, mu=0.0, sigma=1.0, seed=None, forcePSD=True, return_new_corrs=False, nit=100)[source]¶
Function to more easily generate multivariate normal samples provided a correlation matrix or list of correlations (upper triangle of correlation matrix) instead of a covariance matrix. Defaults to returning approximately standard normal (mu = 0; sigma = 1) variates. Unlike numpy, if the desired correlation matrix is not positive-semi-definite, will by default issue a warning and find the nearest PSD correlation matrix and generate data with this matrix. This new matrix can optionally be returned used the return_new_corrs argument.
- Parameters:
num_obs (int) – number of observations/samples to generate (rows)
corrs (ndarray/list/float) – num_features x num_features 2d array, flattend numpy array of length (num_features * (num_features-1)) / 2, or scalar for same correlation on all off-diagonals
num_features (int) – number of features/variables/dimensions to generate (columns)
mu (float/list) – mean of each feature across observations; default 0.0
sigma (float/list) – sd of each feature across observations; default 1.0
forcePD (bool) – whether to find and use a new correlation matrix if the requested one is not positive semi-definite; default False
return_new_corrs (bool) – return the nearest correlation matrix that is positive semi-definite used to generate data; default False
nit (int) – number of iterations to search for the nearest positive-semi-definite correlation matrix is the requested correlation matrix is not PSD; default 100
- Returns:
2d numpy array of correlated data organized as num_obs x num_features
- Return type:
np.ndarray
- pymer4.simulate.simulate_lm(num_obs, num_coef, coef_vals=None, corrs=None, mus=0.0, sigmas=1.0, noise_params=(0, 1), family='gaussian', seed=None)[source]¶
Function to quickly simulate a regression model dataset, with continuous predictors. Provided a number of observations, number of coefficients, and optionally correlations between predictors, means, and standard deviations of predictors, returns a pandas dataframe with simulated data that can be used to estimate a linear regression using Lm(). Using the family=’binomial’ argument can generate discrete dependent variable values for use with logistic regression.
Defaults to returning standard normal (mu = 0; sigma = 1) predictors with no explicit correlations.
- Parameters:
num_obs (int) – number of total observations, i.e. rows of data
num_coef (int) – number of coefficients/regressors, i.e. columns of data
coef_vals (list,optional) – “true” values of coefficients to generate data. If not provided will be randomly generated. Must include a coefficient for the intercept as well (i.e. mean of data)
corrs (ndarray,list,float) – correlations between coefficients provided as 2d num_coef x num_coef, 1d flattend numpy array/list of length (num_features * (num_features-1)) / 2, or a float to be treated as the same correlation between all coefficients
mus (float/list/ndarray) – means of columns of predictors
sigmas (float/list/ndarray) – stds of columns of predictors
noise_params (tup, optional) – mean and std of noise added to simulated data
family (str) – distribution family for the dependent variable. Currently only ‘gaussian’ (continuous DV) or ‘binomial’ (discrete DV) are available.
seed (int) – seed for reproducible random number generation
- Returns:
data (pd.DataFrame): dataframe organized as num_obs x num_coef
coefs (np.array): ground-truth coefficient values
- Return type:
Multiple
- pymer4.simulate.simulate_lmm(num_obs, num_coef, num_grps, coef_vals=None, corrs=None, grp_sigmas=0.25, mus=0.0, sigmas=1.0, noise_params=(0, 1), family='gaussian', seed=None)[source]¶
Function to quickly simulate a multi-level regression model dataset, with continuous predictors. Provided a number of observations, number of coefficients, number of groups/clusters, and optionally correlations between predictors, means, and standard deviations of predictors, returns a pandas dataframe with simulated data that can be used to estimate a multi-level model using Lmer(). Using the family=’binomial’ argument can generate discrete dependent variable values for use with logistic multi-level models.
Defaults to returning standard normal (mu = 0; sigma = 1) predictors with no explicit correlations and low variance between groups (sigma = .25).
- Parameters:
num_obs (int) – number of observations per cluster/stratum/group
num_coef (int) – number of coefficients/regressors, i.e. columns of data
num_grps (int) – number of cluster/stratums/groups
coef_vals (list,optional) – “true” values of coefficients to generate data. If not provided will be randomly generated. Must include a coefficient for the intercept as well (i.e. mean of data)
corrs (ndarray,list,float) – correlations between coefficients provided as 2d num_coef x num_coef, 1d flattend numpy array/list of length (num_features * (num_features-1)) / 2, or a float to be treated
coefficients (as the same correlation between all) –
grp_sigmas (int or list) – grp level std around population coefficient values; can be a single value in which case same std is applied around all coefficients or a list for different std; default .25
mus (float/list/ndarray) – means of columns of predictors
sigmas (float/list/ndarray) – stds of columns of predictors
noise_params (tup, optional) – mean and std of noise added to each group’s simulated data
family (str) – distribution family for the dependent variable. Currently only ‘gaussian’ (continuous DV) or ‘binomial’ (discrete DV) are available.
seed (int) – seed for reproducible random number generation
- Returns:
data (pd.DataFrame): dataframe organized as num_obs x num_coef
blups (pd.DataFrame): ground-truth group/cluster level coefficients, organized as num_grps x num_coef (i.e. BLUPs)
coefs (np.array): ground-truth population-level coefficients
- Return type:
Multiple
pymer4.stats
: Statistics Functions¶
General purpose functions for various parametric and non-parametric statistical routines
User-facing statistics functions and tests.
- pymer4.stats.boot_func(x, y=None, func=None, func_args={}, paired=False, n_boot=500, n_jobs=1, seed=None)[source]¶
Bootstrap an arbitrary function by resampling from x and y independently or jointly.
- Parameters:
x (list-like) – list of values for first group
y (list-like) – list of values for second group; optional
function (callable) – function that accepts x or y
paired (bool) – whether to resample x and y jointly or independently
n_boot (int) – number of bootstrap iterations
n_jobs (int) – number of parallel cores; default 1
- Returns:
original_stat (float): function result with given data
ci (np.array): lower and upper bounds of 95% confidence intervals
- Return type:
Multiple
- pymer4.stats.cohens_d(x, y=None, paired=False, n_boot=1000, equal_var=False, value=0, n_jobs=1, seed=None)[source]¶
Compute Cohen’s d for one or two samples (paired or independent). For paired samples Cohen’s Dz is computed (ref: https://bit.ly/2J54P61). If x and y are not the same size this will use the same pooled SD calculation in Welch’s ttest to account for unequal variances. Unequal variance calculation will almost always produce a smaller estimate than the standard formula, except as the variance of the group with fewer observations increases. In that case, this estimate can be larger than the standard formula. This can be turned off with the equal_var=True argument. Percentile boot-strapped confidence intervals can also be returned
- Parameters:
x (list-like) – array or list of observations from first group
y (list-like) – array or list of observations from second group or second set of observations from the same group; optional
paired (bool) – whether to treat x any y (if provided) as paired or independent
n_boot (int) – number of bootstrap samples to run; set to 0 to skip computing
equal_var (bool) – should we pool standard deviation as in Welch’s t-test
value (float) – a value to see if the effect size is bigger than; eff size - value will be computed; default 0
n_jobs (int) – number of parallel cores to use for bootstraping; default 1
seed (int or None) – numerical seed for reproducibility of bootstrapping
- Returns:
effect_size (float): cohen’s d
ci (np.array): lower and upper bounds of 95% bootstrapped confidence intervals; optional
- Return type:
Multiple
- pymer4.stats.correct_pvals(ps, method='holm')[source]¶
Provided a list of p-values return a corrected list according to specified method. You can use this list to then compare against a chosen alpha (e.g. .05). Closely follows the implementats in statsmodels.stats.multitest.multipletests
- Parameters:
ps (list/np.ndarray) – list of p-values
method (str, optional) – correction method. Default to ‘holm’ for holm-bonferroni
procedure (step-down) –
- Returns:
adjusted p-values
- Return type:
np.ndarray
- pymer4.stats.discrete_inverse_logit(arr)[source]¶
Apply a discretized inverse logit transform to an array of values. Useful for converting normally distributed values to binomial classes.
- Parameters:
arr (np.array) – 1d numpy array
- Returns:
transformed values
- Return type:
np.array
- pymer4.stats.lrt(models, refit=True)[source]¶
Compute a likelihood ratio test between two Lmer models. This produces identical results to R’s anova() function when comparing models. Will automatically determine the the model order based on comparing all models to the one that has the fewest parameters.
# Possible additions: # 1) Generalize function to perform LRT, or vuong test # 2) Offer nested and non-nested vuong test, as well as AIC/BIC correction # 3) Given a single model expand out to all separate term tests
- Args:
models (list): a list of two Lmer models to be compared refit (bool): should REML models be refitted as ML before comparison (defaults to True)
- Returns:
df (pandas.DataFrame): dataframe of the anova results
- pymer4.stats.perm_test(x, y=None, stat='tstat', n_perm=1000, equal_var=False, tails=2, return_dist=False, n_jobs=1, seed=None)[source]¶
General purpose permutation test between two samples. Can handle a wide varierty of permutation tests including ttest, paired ttest, mean diff test, cohens d, pearson r, spearman r.
- Parameters:
x (list-like) – array or list of observations from first group
y (list-like) – array or list of observations from second group
stat (string) – one of [‘tstat’, ‘tstat-paired’, ‘mean’, ‘cohensd’, ‘pearsonr’, ‘spearmanr’]; ‘mean’ will just compute permutations on the difference between the mean of x and mean of y. Useful if statistics are precomputed (e.g. x and y contain correlation values, or t-stats).
n_perm (int) – number of permutations; set to 0 to return parametric results
equal_var (bool) – should assume equal variances for tstat and cohensd
tails (int) – perform one or two-tailed p-value computations; default 2
return_dists (bool) – return permutation distribution
n_jobs (int) – number of parallel cores to use for bootstraping; default 1
seed (int) – for reproducing results
- Returns:
original_stat (float): the original statistic
perm_p_val (float): the permuted p-value
perm_dist (np.array): array of permuted statistic; optional
- Return type:
Multiple
- pymer4.stats.rsquared(y, res, has_constant=True)[source]¶
Compute the R^2, coefficient of determination. This statistic is a ratio of “explained variance” to “total variance”
- Parameters:
y (np.ndarray) – 1d array of dependent variable
res (np.ndarray) – 1d array of residuals
has_constant (bool) – whether the fitted model included a constant (intercept)
- Returns:
coefficient of determination
- Return type:
float
- pymer4.stats.rsquared_adj(r, nobs, df_res, has_constant=True)[source]¶
Compute the adjusted R^2, coefficient of determination.
- Parameters:
r (float) – rsquared value
nobs (int) – number of observations the model was fit on
df_res (int) – degrees of freedom of the residuals (nobs - number of model params)
has_constant (bool) – whether the fitted model included a constant (intercept)
- Returns:
adjusted coefficient of determination
- Return type:
float
- pymer4.stats.tost_equivalence(x, y, lower, upper, paired=False, equal_var=False, n_perm=1000, n_boot=5000, plot=False, seed=None)[source]¶
Function to perform equivalence testing using TOST: two-one-sided-tests (Lakens et al, 2018). This works by defining a lower and upper bound of an “equivalence” range for the mean difference between x and y. This is a user-defined range that one might not feel is a particularly meangingful mean difference; conceptually similar to the Bayesian “region of practical equivalence (rope).” Specifically this uses, two one-sided t-tests against and lower and upper seperately to find out whether lower < mean diff < higher. n_perm only controls the permutation for the original two-sided test.
- Parameters:
x (list-like) – array or list of observations from first group
y (list-like) – array or list of observations from second group
lower (float) – lower bound of equivalence region
upper (float) – upper bound of equivalence region
equal_var (bool) – should assume equal variances for t-stat and effect size calcs
n_perm (int) – number of times to permute groups; set to 0 to turn off
n_boot (int) – number of bootstrap samples for confidence intervals
plot (bool) – return an equivalence plot depicting where the mean difference and 95% CIs fall relative to the equivalence range
return_dists (bool) – optionally return the permuted distributions
- Returns:
a dictionary of TOST results
- Return type:
dict
- pymer4.stats.vif(df, has_intercept=True, exclude_intercept=True, tol=5.0, check_only=False)[source]¶
Compute variance inflation factor amongst columns of a dataframe to be used as a design matrix. Uses the same method as Matlab and R (diagonal elements) of the inverted correlation matrix. Prints a warning if any vifs are >= to tol. If check_only is true it will only return a 1 if any vifs are higher than tol.
- Parameters:
df (pandas.DataFrame) – dataframe of design matrix output from patsy
has_intercept (bool) – whether the first column of the dataframe is the intercept
exclude_intercept (bool) – exclude intercept from computation and assumed intercept is the first column; default True
tol (float) – tolerance check to print warning if any vifs exceed this value
check_only (bool) – restrict return to a dictionary of vifs that exceed tol only rather than all; default False
- Returns:
dictionary with keys as column names and values as vifs
- Return type:
dict
pymer4.utils
: Utility Functions¶
Miscellaneous helper functions
Utility functions
- pymer4.utils.isPSD(mat, tol=1e-08)[source]¶
Check if matrix is positive-semi-definite by virtue of all its eigenvalues being >= 0. The cholesky decomposition does not work for edge cases because np.linalg.cholesky fails on matrices with exactly 0 valued eigenvalues, whereas in Matlab this is not true, so that method appropriate. Ref: https://goo.gl/qKWWzJ
- Parameters:
mat (np.ndarray) – 2d numpy array
- Returns:
whether matrix is postive-semi-definite
- Return type:
bool
- pymer4.utils.nearestPSD(mat, nit=100)[source]¶
Higham (2000) algorithm to find the nearest positive semi-definite matrix that minimizes the Frobenius distance/norm. Statsmodels using something very similar in corr_nearest(), but with spectral SGD to search for a local minima. Reference: https://goo.gl/Eut7UU
- Parameters:
mat (np.ndarray) – 2d numpy array
nit (int) – number of iterations to run algorithm; more iterations improves accuracy but increases computation time.
- Returns:
closest positive-semi-definite 2d numpy array
- Return type:
np.ndarray
- pymer4.utils.result_to_table(to_format, drop_intercept=True, iv_name='Predictor', comparison_name='b', ci_name='ci', round=True, pval_text='< .001', pval_thresh=0.001, fetch_name_col='index')[source]¶
Nicely format the .coefs attribute of a fitted model. The intended use of this function is to nicely format the .coefs of a fitted model such that the resultant dataframe can be copied outside of python/jupyter or saved to another file (e.g. googlesheet). It’s particularly well suited for use with gspread_pandas.
- Parameters:
model (pymer.model) – pymer4 model object that’s already been fit
drop_intercept (bool, optional) – remove the model intercept results from the table; Default True
iv_name (str, optional) – column name of the model’s independent variables. Defaults to “Predictor”.
round (bool, optional) – round all numeric values to 3 decimal places. Defaults to True.
pval_text (str, optional) – what to replace p-values with when they are < pval_thres. Defaults to “< .001”.
pval_thresh (float, optional) – threshold to replace p-values with. Primarily intended to be used for very small p-values (e.g. .0001), where the tradition is to display ‘< .001’ instead of the exact p-values. Defaults to 0.001.
- Returns:
formatted dataframe of results
- Return type:
pd.DataFrame
Example
Send model results to a google sheet, assuming model.fit() has already been called:
>>> from gspread_pandas import Spread >>> spread = Spread('My_Results_Sheet') >>> formatted_results = result_to_table(model) >>> spread.df_to_sheet(formatted_results, replace=True, index=False)
Now ‘My_Results_Sheet’ will have a copy of formatted_results which can be copy and pasted into a google doc as a nice auto-updating table. On new model fits, simple repeat the steps above to replace the values in the google sheet, thus triggering an update of the linked table in a google doc.
pymer4.io
: Save/Load Functions¶
Functions for persisting models to disk
- pymer4.io.load_model(filepath)[source]¶
Function for loading pymer4 models. A file path ending in .joblib should be provided. For Lmer models an additional filepath.robj should be located in the same directory.
- Parameters:
model (pymer4.models) – an instance of a pymer4 model
filepath (str) – full filepath string ending with .joblib
- pymer4.io.save_model(model, filepath, **kwargs)[source]¶
Function for saving pymer4 models. All models are saved using joblib.dump files so filepath extensions should end with .joblib. For Lmer models an additional filepath.robj file will be created to retain all R objects.
- Parameters:
model (pymer4.models) – an instance of a pymer4 model
filepath (str) – full filepath string ending .joblib
kwargs – optional keyword arguments to joblib.dump