API Reference¶
pymer4.models.Lmer
: Lmer¶
Model class for estimating lme4
multilevel models in python

class
pymer4.models.
Lmer
(formula, data, family='gaussian')[source]¶ Model class to hold data outputted from fitting lmer in R and converting to Python object. This class stores as much information as it can about a merMod object computed using lmer and lmerTest in R. Most attributes will not be computed until the fit method is called.
 Parameters
formula (str) – Complete lmerstyle model formula
data (pandas.core.frame.DataFrame) – input data
family (string) – what distribution family (i.e.) link function to use for the generalized model; default is gaussian (linear model)
 Variables
fitted (bool) – whether model has been fit
formula (str) – model formula
data (pd.DataFrame) – model copy of input data
grps (dict) – groups and number of observations per groups recognized by lmer
design_matrix (pd.DataFrame) – model design matrix determined by lmer
AIC (float) – model akaike information criterion
logLike (float) – model Loglikelihood
family (string) – model family
warnings (list) – warnings output from R or Python
ranef (pd.DataFrame/list) – clusterlevel differences from population parameters, i.e. difference between coefs and fixefs; returns list if multiple cluster variables are used to specify random effects (e.g. subjects and items)
fixef (pd.DataFrame/list) – clusterlevel parameters; returns list if multiple cluster variables are used to specify random effects (e.g. subjects and items)
coefs (pandas.core.frame.DataFrame/list) – model summary table of population parameters
ranef_var (pd.DataFrame) – random effects variances
ranef_corr (pd.DataFrame) – random effects correlations
residuals (numpy.ndarray) – model residuals
fits (numpy.ndarray) – model fits/predictions
model_obj (lmer model) – rpy2 lmer model object
factors (dict) – factors used to fit the model if any

anova
(force_orthogonal=False)[source]¶ Return a type3 ANOVA table from a fitted model. Like R, this method does not ensure that contrasts are orthogonal to ensure correct type3 SS computation. However, the force_orthogonal flag can refit the regression model with orthogonal polynomial contrasts automatically guaranteeing valid SS type 3 inferences. Note that this will overwrite factors specified in the last call to .fit()
 Parameters
force_orthogonal (bool) – whether factors in the model should be recoded using polynomial contrasts to ensure valid type3 SS calculations. If set to True, previous factor specifications will be saved in model.factors_prev_; default False
 Returns
Type 3 ANOVA results
 Return type
pd.DataFrame

fit
(conf_int='Wald', n_boot=500, factors=None, permute=False, ordered=False, verbose=False, REML=True, rank=False, rank_group='', rank_exclude_cols=[], no_warnings=False, control='', old_optimizer=False, **kwargs)[source]¶ Main method for fitting model object. Will modify the model’s data attribute to add columns for residuals and fits for convenience. Factors should be specified as a dictionary with values as a list or themselves a dictionary of human readable contrasts not Rstyle contrast codes as these will be autoconverted for you. See the factors docstring and examples below. After fitting, the .factors attribute will store a reference to the userspecified dictionary. The .contrast_codes model attributes will store the requested comparisons in converted R format. Note that Lmer estimate naming conventions differs a bit from R: Lmer.coefs = summary(model); Lmer.fixefs = coefs(model); Lmer.ranef = ranef(model)
 Parameters
conf_int (str) – which method to compute confidence intervals; ‘profile’, ‘Wald’ (default), or ‘boot’ (parametric bootstrap)
n_boot (int) – number of bootstrap intervals if bootstrapped confidence intervals are requests; default 500
factors (dict) – dictionary with column names specified as keys and values as a list for dummy/treatment/polynomial contrast or a dict with keys as factor leves and values as desired comparisons in human readable format See examples below
permute (int) – if nonzero, computes parameter significance tests by permuting test stastics rather than parametrically. Permutation is done by shuffling observations within clusters to respect random effects structure of data.
ordered (bool) – whether factors should be treated as ordered polynomial contrasts; this will parameterize a model with K1 orthogonal polynomial regressors beginning with a linear contrast based on the factor order provided; default is False
summarize/summary (bool) – whether to print a model summary after fitting; default is True
verbose (bool) – whether to print when and which model and confidence interval are being fitted
REML (bool) – whether to fit using restricted maximum likelihood estimation instead of maximum likelihood estimation; default True
rank (bool) – covert predictors in model formula to ranks by group prior to estimation. Model object will still contain original data not ranked data; default False
rank_group (str) – column name to group data on prior to rank conversion
rank_exclude_cols (list/str) – columns in model formula to not apply rank conversion to
no_warnings (bool) – turn off autoprinting warnings messages; warnings are always stored in the .warnings attribute; default False
control (str) – string containing options to be passed to (g)lmer control. See https://bit.ly/2OQONTH for options
old_optimizer (bool) – use the old bobyqa optimizer that was the default in lmer4 <= 1.1_20, i.e. prior to 02/04/2019. This is not compatible with the control setting as it’s meant to be a quick shorthand (e.g. to reproduce previous model results). However, the same setting can be manually requested using the control option if preferred. (For optimizer change discussions see: https://bit.ly/2MrP9Nq and https://bit.ly/2Vx5jte )
 Returns
R/statsmodels style summary
 Return type
pd.DataFrame
Examples
The following examples demonstrate how to treat variables as categorical factors.
DummyCoding: Treat Col1 as a factor which 3 levels: A, B, C. Use dummycoding with A as the reference level. Model intercept will be mean of A, and parameters will be BA, and CA.
>>> model.fit(factors = {"Col1": ['A','B','C']})
Orthogonal Polynomials: Treat Col1 as a factor which 3 levels: A, B, C. Estimate a linear contrast of C > B > A. Model intercept will be grandmean of all levels, and parameters will be linear contrast, and orthogonal polynomial contrast (autocomputed).
>>> model.fit(factors = {"Col1": ['A','B','C']}, ordered=True)
Customcontrast: Treat Col1 as a factor which 3 levels: A, B, C. Compare A to the mean of B and C. Model intercept will be the grandmean of all levels, and parameters will be the desired contrast, a well as an automatically determined orthogonal contrast.
>>> model.fit(factors = {"Col1": {'A': 1, 'B': .5, 'C': .5}}))
Here is an example specifying stricter deviance and paramter values stopping criteria.
>>> model.fit(control="optCtrl = list(ftol_abs=1e8, xtol_abs=1e8)")
Here is an example specifying a different optimizer in addition to stricter deviance and paramter values stopping criteria.
>>> model.fit(control="optimizer='Nelder_Mead', optCtrl = list(FtolAbs=1e8, XtolRel=1e8)")
Here is an example using the default optimization in previous versions of lme4 prior to the 2019 update.
>>> model.fit(old_optimizer=True)

plot
(param, figsize=(8, 6), xlabel='', ylabel='', plot_fixef=True, plot_ci=True, grps=[], ax=None)[source]¶ Plot random and group level parameters from a fitted model
 Parameters
param (str) – model parameter (column name) to plot
figsize (tup) – matplotlib desired figsize
xlabel (str) – xaxis label
ylabel (str) – yaxis label
plot_fixef (bool) – plot population effect fit of param?; default True
plot_ci (bool) – plot computed ci’s of population effect?; default True
grps (list) – plot specific group fits only; must correspond to index values in model.fixef
ax (matplotlib.axes.Axes) – axis handle for an existing plot; if provided will ensure that random parameter plots appear behind all other plot objects.
 Returns
matplotlib axis handle
 Return type
plt.axis

plot_summary
(figsize=(12, 6), error_bars='ci', ranef=True, axlim=None, plot_intercept=True, ranef_alpha=0.5, coef_fmt='o', orient='v', ranef_idx=0)[source]¶ Create a forestplot overlaying estimated coefficients with random effects (i.e. BLUPs). By default display the 95% confidence intervals computed during fitting.
 Parameters
error_bars (str) – one of ‘ci’ or ‘se’ to change which error bars are plotted; default ‘ci’
ranef (bool) – overlay BLUP estimates on figure; default True
axlim (tuple) – lower and upper limit of plot; default min and max of BLUPs
plot_intercept (bool) – plot the intercept estimate; default True
ranef_alpha (float) – opacity of random effect points; default .5
coef_fmt (str) – matplotlib marker style for population coefficients
ranef_idx (int) – if multiple random effects clusters were specified this value indicates which one should be plotted; uses 0based indexing; default 0 (first)
 Returns
matplotlib axis handle
 Return type
plt.axis

post_hoc
(marginal_vars, grouping_vars=None, p_adjust='tukey', summarize=True, verbose=False)[source]¶ Posthoc pairwise tests corrected for multiple comparisons (Tukey method) implemented using the emmeans package. This method provide both marginal means/trends along with marginal pairwise differences. More info can be found at: https://cran.rproject.org/web/packages/emmeans/emmeans.pdf
 Parameters
marginal_var (str/list) – what variable(s) to compute marginal means/trends for; unique combinations of factor levels of these variable(s) will determine familywise error correction
grouping_vars (str/list) – what variable(s) to group on. Trends/means/comparisons of other variable(s), will be computed at each level of these variable(s)
p_adjust (str) – multiple comparisons adjustment method. One of: tukey, bonf, fdr, hochberg, hommel, holm, dunnet, mvt (montecarlo multivariate T, aka exact tukey/dunnet). Default tukey
summarize (bool) – output effects and contrasts or don’t (always stored in model object as model.marginal_estimates and model.marginal_contrasts); default True
verbose (bool) – whether to print R messages to the console
 Returns
marginal_estimates (pd.Dataframe): unique factor level effects (e.g. means/coefs)
marginal_contrasts (pd.DataFrame): contrasts between factor levels
 Return type
Multiple
Examples
Pairwise comparison of means of A at each level of B
>>> model.post_hoc(marginal_vars='A',grouping_vars='B')
Pairwise differences of slopes of C between levels of A at each level of B
>>> model.post_hoc(marginal_vars='C',grouping_vars=['A','B'])
Pairwise differences of each unique A,B cell
>>> model.post_hoc(marginal_vars=['A','B'])

predict
(data, use_rfx=False, pred_type='response', verbose=False)[source]¶ Make predictions given new data. Input must be a dataframe that contains the same columns as the model.matrix excluding the intercept (i.e. all the predictor variables used to fit the model). If using random effects to make predictions, input data must also contain a column for the group identifier that were used to fit the model random effects terms. Using random effects to make predictions only makes sense if predictions are being made about the same groups/clusters.
 Parameters
data (pandas.core.frame.DataFrame) – input data to make predictions on
use_rfx (bool) – whether to condition on random effects when making predictions
pred_type (str) – whether the prediction should be on the ‘response’ scale (default); or on the ‘link’ scale of the predictors passed through the link function (e.g. logodds scale in a logit model instead of probability values)
verbose (bool) – whether to print R messages to console
 Returns
prediction values
 Return type
np.ndarray

simulate
(num_datasets, use_rfx=True, verbose=False)[source]¶ Simulate new responses based upon estimates from a fitted model. By default group/cluster means for simulated data will match those of the original data. Unlike predict, this is a nondeterministic operation because lmer will sample randomefects values for all groups/cluster and then sample data points from their respective conditional distributions.
 Parameters
num_datasets (int) – number of simulated datasets to generate. Each simulation always generates a dataset that matches the size of the original data
use_rfx (bool) – wehther to match group/cluster means in simulated data
verbose (bool) – whether to print R messages to console
 Returns
simulated data values
 Return type
np.ndarray
pymer4.models.Lm
: Lm¶
Model class for estimating standard regression models

class
pymer4.models.
Lm
(formula, data, family='gaussian')[source]¶ Model class to perform OLS regression. Formula specification works just like in R based on columns of a dataframe. Formulae are parsed by patsy which makes it easy to utilize specifiy columns as factors. This is different from Lmer. See patsy for more information on the different use cases.
 Parameters
formula (str) – Complete lmstyle model formula
data (pandas.core.frame.DataFrame) – input data
family (string) – what distribution family (i.e.) link function to use for the generalized model; default is gaussian (linear model)
 Variables
fitted (bool) – whether model has been fit
formula (str) – model formula
data (pd.DataFrame) – model copy of input data
design_matrix (pd.DataFrame) – model design matrix determined by patsy
AIC (float) – model akaike information criterion
logLike (float) – model Loglikelihood
family (string) – model family
warnings (list) – warnings output from Python
coefs (pd.DataFrame) – model summary table of parameters
residuals (numpy.ndarray) – model residuals
fits (numpy.ndarray) – model fits/predictions
estimator (string) – ‘OLS’ or ‘WLS’
se_type (string) – how standard errors are computed
sig_type (string) – how inference is performed

fit
(robust=False, conf_int='standard', permute=False, rank=False, verbose=False, n_boot=500, n_jobs=1, n_lags=1, cluster=None, weights=None, wls_dof_correction=True, **kwargs)[source]¶ Fit a variety of OLS models. By default will fit a model that makes parametric assumptions (under a tdistribution) replicating the output of software like R. 95% confidence intervals (CIs) are also estimated parametrically by default. However, empirical bootstrapping can also be used to compute CIs; this procedure resamples with replacement from the data themselves, not residuals or data generated from fitted parameters and will be used for inference unless permutation tests are requested. Permutation testing will shuffle observations to generate a null distribution of tstatistics to perform inference on each regressor (permuted ttest).
Alternatively, OLS robust to heteroscedasticity can be fit by computing sandwich standard error estimates (good ref: https://bit.ly/2VRb7jK). This is similar to Stata’s robust routine. Of the choices below, ‘hc1’ or ‘hc3’ are amongst the more popular. Robust estimators include:
‘hc0’: Huber (1967) original sandwich estimator
‘hc1’: Hinkley (1977) DOF adjustment to ‘hc0’ to account for small sample sizes (default)
‘hc2’: different kind of smallsample adjustment of ‘hc0’ by leverage values in hat matrix
‘hc3’: MacKinnon and White (1985) HC3 sandwich estimator; provides more robustness in smaller samples than hc2, Long & Ervin (2000)
‘hac’: NeweyWest (1987) estimator for robustness to heteroscedasticity as well as serial autocorrelation at given lags.
‘cluster’ : clusterrobust standard errors (see Cameron & Miller 2015 for review). Provides robustness to errors that cluster according to specific groupings (e.g. repeated observations within a person/school/site). This acts as postmodeling “correction” for what a multilevel model explicitly estimates and is popular in the econometrics literature. DOF correction differs slightly from stat/statsmodels which use num_clusters  1, where as pymer4 uses num_clusters  num_coefs
Finally, weightedleastsquares (WLS) can be computed as an alternative to to hetereoscedasticity robust standard errors. This can be estimated by providing an array or series of weights (1 / variance of each group) with the same length as the number of observations or a column to use to compute group variances (which can be the same as the predictor column). This is often useful if some predictor(s) is categorical (e.g. dummycoded) and taking into account unequal group variances is desired (i.e. in the simplest case this would be equivalent to peforming Welch’s ttest).
 Parameters
robust (bool/str) – whether to use heteroscedasticity robust s.e. and optionally which estimator type to use (‘hc0’,’hc1’, ‘hc2’, hc3’,’hac’,’cluster’). If robust = True, default robust estimator is ‘hc1’; default False
conf_int (str) – whether confidence intervals should be computed through bootstrap (‘boot’) or assuming a tdistribution (‘standard’); default ‘standard’
permute (int) – if nonzero, computes parameter significance tests by permuting tstastics rather than parametrically; works with robust estimators
rank (bool) – convert all predictors and dependent variable to ranks before estimating model; default False
summarize (bool) – whether to print a model summary after fitting; default True
verbose (bool) – whether to print which model, standard error, confidence interval, and inference type are being fitted
n_boot (int) – how many bootstrap resamples to use for confidence intervals (ignored unless conf_int=’boot’)
n_jobs (int) – number of cores for parallelizing bootstrapping or permutations; default 1
n_lags (int) – number of lags for robust estimator type ‘hac’ (ignored unless robust=’hac’); default 1
cluster (str) – column name identifying clusters/groups for robust estimator type ‘cluster’ (ignored unless robust=’cluster’)
weights (string/pd.Series/np.ndarray) – weights to perform WLS instead of OLS. Pass in a column name in data to use to compute group variances and automatically adjust dof. Otherwise provide an array or series containing 1 / variance of each observation, in which case dof correction will not occur.
wls_dof_correction (bool) – whether to apply WelchSatterthwaite approximate correction for dof when using weights based on an existing column in the data, ignored otherwise. Set to False to force standard dof calculation
 Returns
R/statsmodels style summary
 Return type
pd.DataFrame
Examples
Simple multiple regression model with parametric assumptions
>>> model = Lm('DV ~ IV1 + IV2', data=df) >>> model.fit()
Same as above but with robust standard errors
>>> model.fit(robust='hc1')
Same as above but with clusterrobust standard errors. The cluster argument should refer to a column in the dataframe.
>>> model.fit(robust='cluster', cluster='Group')
Simple regression with categorical predictor, i.e. between groups ttest assuming equal variances
>>> model = Lm('DV ~ Group', data=df) >>> model.fit()
Same as above but don’t assume equal variances and have pymer4 compute the between group variances automatically, i.e. WLS (preferred).
>>> model.fit(weights='Group')
Manually compute the variance of each group and use the inverse of that as the weights. In this case WLS is estimated but dof correction won’t be applied because it’s not trivial to compute.
>>> weights = 1 / df.groupby("Group")['DV'].transform(np.var,ddof=1) model.fit(weights=weights)

predict
(data)[source]¶ Make predictions given new data. Input must be a dataframe that contains the same columns as the model.matrix excluding the intercept (i.e. all the predictor variables used to fit the model). Will automatically use/ignore intercept to make a prediction if it was/was not part of the original fitted model.
 Parameters
data (pd.DataFrame) – input data to make predictions on
 Returns
prediction values
 Return type
np.ndarray

summary
()[source]¶ Summarize the output of a fitted model.
 Returns
R/statsmodels style summary
 Return type
pd.DataFrame

to_corrs
(corr_type='semi', ztrans_corrs=False)[source]¶ Transform fitted model coefficients (excluding the intercept) to partial or semipartial correlations with dependent variable. The is useful for rescaling coefficients to a correlation scale (1 to 1) and does not change how inferences are performed. Semipartial correlations are computed as the correlation between a DV and each predictor after the influence of all other predictors have been regressed out from that predictor. They are interpretable in the same way as the original coefficients. Partial correlations reflect the unique variance a predictor explains in the DV accounting for correlations between predictors and what is not explained by other predictors; this value is always >= the semipartial correlation. They are not interpretable in the same way as the original coefficients. Partial correlations are computed as the correlations between a DV and each predictor after the influence of all other predictors have been regressed out from that predictor and the DV. Good ref: https://bit.ly/2GNwXh5
 Parameters
corr_type (string) – ‘semi’ or ‘partial’
ztrans_partial_corrs (bool) – whether to fisher ztransform (arctan) partial correlations before reporting them; default False
 Returns
partial or semipartial correlations
 Return type
pd.Series
pymer4.models.Lm2
: Lm2¶
Model class for estimating multilevel models in python using the summarystatistics approach

class
pymer4.models.
Lm2
(formula, data, group, family='gaussian')[source]¶ Model class to perform twostage OLS regression. Practically, this class fits a separate Lm() model to each cluster/group in the data and performs inference on the coefficients of each model (i.e. 1sample ttest per coefficient). The results from this second level regression are reported. This is an alternative to using Lmer, as it implicitly allows intercept and slopes to vary by group, however with no prior/smoothing/regularization on the random effects. See https://bit.ly/2SwHhQU and Gelman (2005). This approach maybe less preferable to Lmer if the number of observations per group are few, but the number of groups is large, in which case the 1stlevel estimates are much noisier and are not smoothed/regularized as in Lmer. It maybe preferable when a “maximal” rfx Lmer model is not estimable. Formula specification works just like in R based on columns of a dataframe. Formulae are parsed by patsy which makes it easy to utilize specific columns as factors. This is different from Lmer. See patsy for more information on the different use cases.
 Parameters
formula (str) – Complete lmstyle model formula
data (pd.DataFrame) – input data
family (string) – what distribution family (i.e.) link function to use for the generalized model; default is gaussian (linear model)
group (list/string) – the grouping variable to use to run the 1stlevel regression; if a list is provided will run multiple levels feeding the coefficients from the previous level into the subsequent level
 Variables
fitted (bool) – whether model has been fit
formula (str) – model formula
data (pandas.core.frame.DataFrame) – model copy of input data
grps (dict) – groups and number of observations per groups recognized by lmer
AIC (float) – model akaike information criterion
logLike (float) – model Loglikelihood
family (string) – model family
warnings (list) – warnings output from Python
fixef (pd.DataFrame) – clusterlevel parameters
coefs (pd.DataFrame) – model summary table of population parameters
residuals (numpy.ndarray) – model residuals
fits (numpy.ndarray) – model fits/predictions
se_type (string) – how standard errors are computed
sig_type (string) – how inference is performed

fit
(robust=False, conf_int='standard', permute=False, perm_on='tstat', rank=False, verbose=False, n_boot=500, n_jobs=1, n_lags=1, to_corrs=False, ztrans_corrs=True, cluster=None, **kwargs)[source]¶ Fit a variety of secondlevel OLS models; all 1stlevel models are standard OLS. By default will fit a model that makes parametric assumptions (under a tdistribution) replicating the output of software like R. 95% confidence intervals (CIs) are also estimated parametrically by default. However, empirical bootstrapping can also be used to compute CIs, which will resample with replacement from the first level regression estimates and uses these CIs to perform inference unless permutation tests are requested. Permutation testing will perform a onesample signflipped permutation test on the estimates directly (perm_on=’coef’) or the tstatistic (perm_on=’tstat’). Permutation is a bit different than Lm which always permutes based on the tstat.
Heteroscedasticity robust standard errors can also be computed, but these are applied at the secondlevel, not at the first level. See the Lm() documentatation for more information about robust standard errors.
 Parameters
robust (bool/str) – whether to use heteroscedasticity robust s.e. and optionally which estimator type to use (‘hc0’,’hc3’,’hac’,’cluster’). If robust = True, default robust estimator is ‘hc0’; default False
conf_int (str) – whether confidence intervals should be computed through bootstrap (‘boot’) or assuming a tdistribution (‘standard’); default ‘standard’
permute (int) – if nonzero, computes parameter significance tests by permuting tstastics rather than parametrically; works with robust estimators
perm_on (str) – permute based on a null distribution of the ‘coef’ of firstlevel estimates or the ‘tstat’ of firstlevel estimates; default ‘tstat’
rank (bool) – convert all predictors and dependent variable to ranks before estimating model; default False
to_corrs (bool/string) – for each first level model estimate a semipartial or partial correlations instead of betas and perform inference over these partial correlation coefficients. note this is different than Lm(); default False
ztrans_corrs (bool) – whether to fisherz transform (arcsin) firstlevel correlations before running secondlevel model. Ignored if to_corrs is False; default True
summarize (bool) – whether to print a model summary after fitting; default True
verbose (bool) – whether to print which model, standard error, confidence interval, and inference type are being fitted
n_boot (int) – how many bootstrap resamples to use for confidence intervals (ignored unless conf_int=’boot’)
n_jobs (int) – number of cores for parallelizing bootstrapping or permutations; default 1
n_lags (int) – number of lags for robust estimator type ‘hac’ (ignored unless robust=’hac’); default 1
cluster (str) – column name identifying clusters/groups for robust estimator type ‘cluster’ (ignored unless robust=’cluster’)
 Returns
R style summary() table
 Return type
DataFrame

plot_summary
(figsize=(12, 6), error_bars='ci', ranef=True, axlim=None, ranef_alpha=0.5, coef_fmt='o', orient='v', **kwargs)[source]¶ Create a forestplot overlaying estimated coefficients with firstlevel effects. By default display the 95% confidence intervals computed during fitting.
 Parameters
error_bars (str) – one of ‘ci’ or ‘se’ to change which error bars are plotted; default ‘ci’
ranef (bool) – overlay BLUP estimates on figure; default True
axlim (tuple) – lower and upper limit of plot; default min and max of BLUPs
ranef_alpha (float) – opacity of random effect points; default .5
coef_fmt (str) – matplotlib marker style for population coefficients
 Returns
matplotlib axis handle
 Return type
plt.axis
pymer4.simulate
: Simulation Functions¶
Functions for generating data for use with various model types

pymer4.simulate.
easy_multivariate_normal
(num_obs, num_features, corrs, mu=0.0, sigma=1.0, seed=None, forcePSD=True, return_new_corrs=False, nit=100)[source]¶ Function to more easily generate multivariate normal samples provided a correlation matrix or list of correlations (upper triangle of correlation matrix) instead of a covariance matrix. Defaults to returning approximately standard normal (mu = 0; sigma = 1) variates. Unlike numpy, if the desired correlation matrix is not positivesemidefinite, will by default issue a warning and find the nearest PSD correlation matrix and generate data with this matrix. This new matrix can optionally be returned used the return_new_corrs argument.
 Parameters
num_obs (int) – number of observations/samples to generate (rows)
corrs (ndarray/list/float) – num_features x num_features 2d array, flattend numpy array of length (num_features * (num_features1)) / 2, or scalar for same correlation on all offdiagonals
num_features (int) – number of features/variables/dimensions to generate (columns)
mu (float/list) – mean of each feature across observations; default 0.0
sigma (float/list) – sd of each feature across observations; default 1.0
forcePD (bool) – whether to find and use a new correlation matrix if the requested one is not positive semidefinite; default False
return_new_corrs (bool) – return the nearest correlation matrix that is positive semidefinite used to generate data; default False
nit (int) – number of iterations to search for the nearest positivesemidefinite correlation matrix is the requested correlation matrix is not PSD; default 100
 Returns
2d numpy array of correlated data organized as num_obs x num_features
 Return type
np.ndarray

pymer4.simulate.
simulate_lm
(num_obs, num_coef, coef_vals=None, corrs=None, mus=0.0, sigmas=1.0, noise_params=(0, 1), family='gaussian', seed=None)[source]¶ Function to quickly simulate a regression model dataset, with continuous predictors. Provided a number of observations, number of coefficients, and optionally correlations between predictors, means, and standard deviations of predictors, returns a pandas dataframe with simulated data that can be used to estimate a linear regression using Lm(). Using the family=’binomial’ argument can generate discrete dependent variable values for use with logistic regression.
Defaults to returning standard normal (mu = 0; sigma = 1) predictors with no explicit correlations.
 Parameters
num_obs (int) – number of total observations, i.e. rows of data
num_coef (int) – number of coefficients/regressors, i.e. columns of data
coef_vals (list,optional) – “true” values of coefficients to generate data. If not provided will be randomly generated. Must include a coefficient for the intercept as well (i.e. mean of data)
corrs (ndarray,list,float) – correlations between coefficients provided as 2d num_coef x num_coef, 1d flattend numpy array/list of length (num_features * (num_features1)) / 2, or a float to be treated as the same correlation between all coefficients
mus (float/list/ndarray) – means of columns of predictors
sigmas (float/list/ndarray) – stds of columns of predictors
noise_params (tup, optional) – mean and std of noise added to simulated data
family (str) – distribution family for the dependent variable. Currently only ‘gaussian’ (continuous DV) or ‘binomial’ (discrete DV) are available.
seed (int) – seed for reproducible random number generation
 Returns
data (pd.DataFrame): dataframe organized as num_obs x num_coef
coefs (np.array): groundtruth coefficient values
 Return type
Multiple

pymer4.simulate.
simulate_lmm
(num_obs, num_coef, num_grps, coef_vals=None, corrs=None, grp_sigmas=0.25, mus=0.0, sigmas=1.0, noise_params=(0, 1), family='gaussian', seed=None)[source]¶ Function to quickly simulate a multilevel regression model dataset, with continuous predictors. Provided a number of observations, number of coefficients, number of groups/clusters, and optionally correlations between predictors, means, and standard deviations of predictors, returns a pandas dataframe with simulated data that can be used to estimate a multilevel model using Lmer(). Using the family=’binomial’ argument can generate discrete dependent variable values for use with logistic multilevel models.
Defaults to returning standard normal (mu = 0; sigma = 1) predictors with no explicit correlations and low variance between groups (sigma = .25).
 Parameters
num_obs (int) – number of observations per cluster/stratum/group
num_coef (int) – number of coefficients/regressors, i.e. columns of data
num_grps (int) – number of cluster/stratums/groups
coef_vals (list,optional) – “true” values of coefficients to generate data. If not provided will be randomly generated. Must include a coefficient for the intercept as well (i.e. mean of data)
corrs (ndarray,list,float) – correlations between coefficients provided as 2d num_coef x num_coef, 1d flattend numpy array/list of length (num_features * (num_features1)) / 2, or a float to be treated
the same correlation between all coefficients (as) –
grp_sigmas (int or list) – grp level std around population coefficient values; can be a single value in which case same std is applied around all coefficients or a list for different std; default .25
mus (float/list/ndarray) – means of columns of predictors
sigmas (float/list/ndarray) – stds of columns of predictors
noise_params (tup, optional) – mean and std of noise added to each group’s simulated data
family (str) – distribution family for the dependent variable. Currently only ‘gaussian’ (continuous DV) or ‘binomial’ (discrete DV) are available.
seed (int) – seed for reproducible random number generation
 Returns
data (pd.DataFrame): dataframe organized as num_obs x num_coef
blups (pd.DataFrame): groundtruth group/cluster level coefficients, organized as num_grps x num_coef (i.e. BLUPs)
coefs (np.array): groundtruth populationlevel coefficients
 Return type
Multiple
pymer4.stats
: Statistics Functions¶
General purpose functions for various parametric and nonparametric statistical routines

pymer4.stats.
boot_func
(x, y=None, func=None, func_args={}, paired=False, n_boot=500, n_jobs=1, seed=None)[source]¶ Bootstrap an arbitrary function by resampling from x and y independently or jointly.
 Parameters
x (listlike) – list of values for first group
y (listlike) – list of values for second group; optional
function (callable) – function that accepts x or y
paired (bool) – whether to resample x and y jointly or independently
n_boot (int) – number of bootstrap iterations
n_jobs (int) – number of parallel cores; default 1
 Returns
original_stat (float): function result with given data
ci (np.array): lower and upper bounds of 95% confidence intervals
 Return type
Multiple

pymer4.stats.
cohens_d
(x, y=None, paired=False, n_boot=1000, equal_var=False, value=0, n_jobs=1, seed=None)[source]¶ Compute Cohen’s d for one or two samples (paired or independent). For paired samples Cohen’s Dz is computed (ref: https://bit.ly/2J54P61). If x and y are not the same size this will use the same pooled SD calculation in Welch’s ttest to account for unequal variances. Unequal variance calculation will almost always produce a smaller estimate than the standard formula, except as the variance of the group with fewer observations increases. In that case, this estimate can be larger than the standard formula. This can be turned off with the equal_var=True argument. Percentile bootstrapped confidence intervals can also be returned
 Parameters
x (listlike) – array or list of observations from first group
y (listlike) – array or list of observations from second group or second set of observations from the same group; optional
paired (bool) – whether to treat x any y (if provided) as paired or independent
n_boot (int) – number of bootstrap samples to run; set to 0 to skip computing
equal_var (bool) – should we pool standard deviation as in Welch’s ttest
value (float) – a value to see if the effect size is bigger than; eff size  value will be computed; default 0
n_jobs (int) – number of parallel cores to use for bootstraping; default 1
seed (int or None) – numerical seed for reproducibility of bootstrapping
 Returns
effect_size (float): cohen’s d
ci (np.array): lower and upper bounds of 95% bootstrapped confidence intervals; optional
 Return type
Multiple

pymer4.stats.
discrete_inverse_logit
(arr)[source]¶ Apply a discretized inverse logit transform to an array of values. Useful for converting normally distributed values to binomial classes.
 Parameters
arr (np.array) – 1d numpy array
 Returns
transformed values
 Return type
np.array

pymer4.stats.
perm_test
(x, y=None, stat='tstat', n_perm=1000, equal_var=False, tails=2, return_dist=False, n_jobs=1, seed=None)[source]¶ General purpose permutation test between two samples. Can handle a wide varierty of permutation tests including ttest, paired ttest, mean diff test, cohens d, pearson r, spearman r.
 Parameters
x (listlike) – array or list of observations from first group
y (listlike) – array or list of observations from second group
stat (string) – one of [‘tstat’, ‘tstatpaired’, ‘mean’, ‘cohensd’, ‘pearsonr’, ‘spearmanr’]; ‘mean’ will just compute permutations on the difference between the mean of x and mean of y. Useful if statistics are precomputed (e.g. x and y contain correlation values, or tstats).
n_perm (int) – number of permutations; set to 0 to return parametric results
equal_var (bool) – should assume equal variances for tstat and cohensd
tails (int) – perform one or twotailed pvalue computations; default 2
return_dists (bool) – return permutation distribution
n_jobs (int) – number of parallel cores to use for bootstraping; default 1
seed (int) – for reproducing results
 Returns
original_stat (float): the original statistic
perm_p_val (float): the permuted pvalue
perm_dist (np.array): array of permuted statistic; optional
 Return type
Multiple

pymer4.stats.
rsquared
(y, res, has_constant=True)[source]¶ Compute the R^2, coefficient of determination. This statistic is a ratio of “explained variance” to “total variance”
 Parameters
y (np.ndarray) – 1d array of dependent variable
res (np.ndarray) – 1d array of residuals
has_constant (bool) – whether the fitted model included a constant (intercept)
 Returns
coefficient of determination
 Return type
float

pymer4.stats.
rsquared_adj
(r, nobs, df_res, has_constant=True)[source]¶ Compute the adjusted R^2, coefficient of determination.
 Parameters
r (float) – rsquared value
nobs (int) – number of observations the model was fit on
df_res (int) – degrees of freedom of the residuals (nobs  number of model params)
has_constant (bool) – whether the fitted model included a constant (intercept)
 Returns
adjusted coefficient of determination
 Return type
float

pymer4.stats.
tost_equivalence
(x, y, lower, upper, paired=False, equal_var=False, n_perm=1000, n_boot=5000, plot=False, seed=None)[source]¶ Function to perform equivalence testing using TOST: twoonesidedtests (Lakens et al, 2018). This works by defining a lower and upper bound of an “equivalence” range for the mean difference between x and y. This is a userdefined range that one might not feel is a particularly meangingful mean difference; conceptually similar to the Bayesian “region of practical equivalence (rope).” Specifically this uses, two onesided ttests against and lower and upper seperately to find out whether lower < mean diff < higher. n_perm only controls the permutation for the original twosided test.
 Parameters
x (listlike) – array or list of observations from first group
y (listlike) – array or list of observations from second group
lower (float) – lower bound of equivalence region
upper (float) – upper bound of equivalence region
equal_var (bool) – should assume equal variances for tstat and effect size calcs
n_perm (int) – number of times to permute groups; set to 0 to turn off
n_boot (int) – number of bootstrap samples for confidence intervals
plot (bool) – return an equivalence plot depicting where the mean difference and 95% CIs fall relative to the equivalence range
return_dists (bool) – optionally return the permuted distributions
 Returns
a dictionary of TOST results
 Return type
dict

pymer4.stats.
vif
(df, has_intercept=True, exclude_intercept=True, tol=5.0, check_only=False)[source]¶ Compute variance inflation factor amongst columns of a dataframe to be used as a design matrix. Uses the same method as Matlab and R (diagonal elements) of the inverted correlation matrix. Prints a warning if any vifs are >= to tol. If check_only is true it will only return a 1 if any vifs are higher than tol.
 Parameters
df (pandas.DataFrame) – dataframe of design matrix output from patsy
has_intercept (bool) – whether the first column of the dataframe is the intercept
exclude_intercept (bool) – exclude intercept from computation and assumed intercept is the first column; default True
tol (float) – tolerance check to print warning if any vifs exceed this value
check_only (bool) – restrict return to a dictionary of vifs that exceed tol only rather than all; default False
 Returns
dictionary with keys as column names and values as vifs
 Return type
dict
pymer4.utils
: Utility Functions¶
Miscellaneous helper functions

pymer4.utils.
R2con
(arr)[source]¶ Convert Rflavored contrast matrix to intepretable contrasts as would be specified by user. Reference: https://goo.gl/E4Mms2
 Parameters
arr (np.ndarry) – 2d contrast matrix output from R’s contrasts() function.
 Returns
2d array organized as contrasts X factor levels
 Return type
np.ndarray

pymer4.utils.
con2R
(arr, names=None)[source]¶ Convert humanreadable contrasts into a form that R requires. Works like the make.contrasts() function from the gmodels package, in that it will autosolve for the remaining orthogonal k1 contrasts if fewer than k1 contrasts are specified.
 Parameters
arr (np.ndarray) – 1d or 2d numpy array with each row reflecting a unique contrast and each column a factor level
names (list/np.ndarray) – optional list of contrast names which will cast the return object as a dataframe
 Returns
A 2d numpy array or dataframe useable with the contrasts argument of glmer

pymer4.utils.
isPSD
(mat, tol=1e08)[source]¶ Check if matrix is positivesemidefinite by virtue of all its eigenvalues being >= 0. The cholesky decomposition does not work for edge cases because np.linalg.cholesky fails on matrices with exactly 0 valued eigenvalues, whereas in Matlab this is not true, so that method appropriate. Ref: https://goo.gl/qKWWzJ
 Parameters
mat (np.ndarray) – 2d numpy array
 Returns
whether matrix is postivesemidefinite
 Return type
bool

pymer4.utils.
nearestPSD
(mat, nit=100)[source]¶ Higham (2000) algorithm to find the nearest positive semidefinite matrix that minimizes the Frobenius distance/norm. Statsmodels using something very similar in corr_nearest(), but with spectral SGD to search for a local minima. Reference: https://goo.gl/Eut7UU
 Parameters
mat (np.ndarray) – 2d numpy array
nit (int) – number of iterations to run algorithm; more iterations improves accuracy but increases computation time.
 Returns
closest positivesemidefinite 2d numpy array
 Return type
np.ndarray
pymer4.io
: Save/Load Functions¶
Functions for persisting models to disk

pymer4.io.
load_model
(filepath)[source]¶ Function for loading pymer4 models. A file path ending in .h5 or .hdf5 should be provided. For Lmer models an additional filepath.robj should be located in the same directory.
 Parameters
model (pymer4.models) – an instance of a pymer4 model
filepath (str) – full filepath string ending with .h5 or .hd5f

pymer4.io.
save_model
(model, filepath, compression='zlib', **kwargs)[source]¶ Function for saving pymer4 models. All models are saved in .h5 or .hdf5 files so filepath extensions should include this. For Lmer models an additional filepath.robj file will be created to retain all R objects.
 Parameters
model (pymer4.models) – an instance of a pymer4 model
filepath (str) – full filepath string ending with .h5 or .hd5f
compression (string) – what kind of compression to use; zlib is the default which should be universally accessible, but for example ‘blosc’ will be faster and produce smaller files. See more here: https://bit.ly/33x9JD7
kwargs – optional keyword arguments to deepdish.io.save