# Mixed Models / Hierarchical Models

Benedikt Ehinger
March 2017

### Overview

• Repeated measures
• Mixed Models

### Repeated measures

It looks like there is no difference between groups?

But what if we have data from within the same subjects?

### Repeated measures

Between-Subject variabilty can be removed by repeated measures.

Never analyze repeated measures data without taking repeated measures into account (Disaggregation, Pseudoreplication)

### Why you have to acount for repeated measures

Black = mean of the subject.

$$n_S=5$$ subjects

### Multiple trials per subject

with $$n_T =15$$ trials

### Sampling distributions

This sampling distribution reflects these exact 5 subjects. But does not generalize!

### Sampling distributions

(pink for aggregate, purple for disaggregate).

There is a strong allure to choose the purple one, easier to get significant results!

$n_{subject}(aggregate) = df+1 = 5$ $n_{"subject"}(pseudoreplication) = df+1= 5*5 = 25$

## DONT DO THIS, YOU KNOW (now) BETTER

### Always necessary to aggregate?

$\sigma_{within} >> \sigma_{between}$

Subjects are very similar

Not so necessary to aggregate first

A single parameter summarizes subject-means well

$\sigma_{within} << \sigma_{between}$

Subjects are very different

Totally necessary to aggregate first

$$n_{subject}$$ parameters summarizes subject-means well

### Should we trust all subjects equally?

Question: Estimate the mean of the population

### Should we trust all subjects equally?

Question: Estimate the mean of the population

### Should we trust all subjects equally?

We should take into account sample-size and variance into our estimate of the population mean

We should trust those subjects more, where we most sure about their central tendency

### Summary

• Why repeated measures
• More power to find effects that exist!
• Don't forget about repeated measures!
• Else your type I error will be very high
• Should we trust all subjects equally?
• No, we should punish for small within-subject sample-size and punish high variance

### Overview

• An example to guide you through the steps
• The mixed model
• Inference
• REML vs ML
• Assumption checking
• Shrinkage
• Terminology
• Convergence Problems

### The example

EEG-Experiment: Contrast vs. P100 size

=> Very noisy subjects (babys?), between 4 and 29 trials



### one model for all subjects

$\hat{y} = \beta_0 + \beta_1 * contrast$

### one model for each subject

$\hat{y_j} = \beta_{0,j} + \beta_{1,j} * contrast$

### two-stage statistics

Note: Contrast-slope scaled by factor 10 for easier comparison

### Overview

• An example to guide you through the steps
• The mixed model
• Inference
• REML vs ML
• Assumption checking
• Shrinkage
• Terminology
• Convergence Problems

### One solution

Linear Mixed Model aka Hierarchical Model aka Multilevel Model

We estimate the population distribution directly. Subject-means with high uncertainty are less influential for the model fit

### The mixed model in R

$\hat{y} = \beta_0 + \beta_1 * contrast + u_{0,j} + u_{1,j} * contrast$

$\hat{y}_j = \beta X_j + b_jZ$

Fitting this in R using the package lme4

response ~ beta + (beta|group)

In our example:

response ~ 1 + contrast + (1|subject) + (contrast|subject)

### Mixed models in R

summary(lmer(p100~contrast+(1+contrast||subject),data=d))

Linear mixed model fit by REML ['lmerMod']
Formula: p100 ~ contrast + ((1 | subject) + (0 + contrast | subject))
Data: d

REML criterion at convergence: 2271.4

Scaled residuals:
Min       1Q   Median       3Q      Max
-2.71978 -0.62658 -0.01064  0.65739  2.91110

Random effects:
Groups    Name        Variance Std.Dev.
subject   (Intercept) 10.16741 3.1886
subject.1 contrast     0.01353 0.1163
Residual              54.45510 7.3794
Number of obs: 323, groups:  subject, 20

Fixed effects:
Estimate Std. Error t value
(Intercept)  4.03422    1.10536   3.650
contrast     0.09536    0.03045   3.131

Correlation of Fixed Effects:
(Intr)
contrast -0.324


gray: one model, color: individual model, black: mixed model

### Direct comparison of 2-stage vs mixed model

two stage:

            Estimate Std. Error t value Pr(>|t|)
(Intercept)  0.06858    0.02474   2.772   0.0122 *


mixed-model:

            Estimate Std. Error t value
contrast     0.09526    0.02990   3.187


true: slope=0.1

### Overview

• The mixed model
• An example to guide you through the steps
• Inference
• REML vs ML
• Assumption checking
• Shrinkage
• Terminology
• Convergence Problems

### Model-Comparisons aka p-values

Unresolved problem!

One way (not the best*): Model comparison for fixed effects:

Be sure to keep random structure equal

anova(mres0,mres1)


"refitting model(s) with ML (instead of REML)"

Data: d
Models:
mres0: p100 ~ 1 + (1 + contrast | subject)
mres1: p100 ~ 1 + contrast + (1 + contrast | subject)
Df    AIC    BIC  logLik deviance  Chisq Chi Df Pr(>Chisq)
mres0  5 2286.5 2305.3 -1138.2   2276.5
mres1  6 2280.1 2302.7 -1134.0   2268.1 8.4095      1   0.003733 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

• currently bootstraping or mcmc recommended

### Intuition:

When estimating the variance parameter using maximum likelihood (ML) $$\sigma$$ is biased because $$\mu$$ is not known at the same time

### Solution:

Fit all fixed effects $$\mu$$, then refit the variance parameters on the residing estimates.

This is called restricted maximum likelihood (REML). If $$n$$ is large, no difference between ML and REML exist. If $$n$$ is small REML gives better predictive results.

If you do model comparison and fixed effects differ (usual case) you have to use ML

### Excourse: Bootstraping

• resample from the data with replacement (or parametrically from $$H_0$$-Model) 1000 times
• calculate statistic of interest
b1 <- bootMer(mres1, FUN = function(x) as.numeric(logLik(x)), nsim = nboot,seed = 1)
b2 <- bootMer(mres0, FUN = function(x) as.numeric(logLik(x)), nsim = nboot,seed = 1)
bootLL =  2 * b1$t - 2 * b2$t # grab the values & plot
quantile(bootLL, probs = c(0.025, 0.975))

     2.5%     97.5%
-7.358778  1.979873


### pbkrtest

Better & easier: Likelihood-Ratio using R-Package pbkrtest

summary(pbkrtest::PBmodcomp(mres1,mres0,nsim=nboot))

Parametric bootstrap test; time: 112.23 sec; samples: 200 extremes: 0;
large : p100 ~ 1 + contrast + (1 + contrast | subject)
small : p100 ~ 1 + (1 + contrast | subject)
stat     df    ddf  p.value
PBtest   8.3917               0.004975 **
Gamma    8.3917               0.005508 **
Bartlett 7.4900 1.0000        0.006204 **
F        8.3917 1.0000 18.614 0.009379 **
LRT      8.3917 1.0000        0.003769 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


The main con against bootstraping is the high computational time (use at least 1000 not 200!)

### Overview

• The mixed model
• An example to guide you through the steps
• Inference
• REML vs ML
• Assumption checking
• QQplost
• Level 1
• Level 2
• Shrinkage
• Terminology
• Convergence Problems

### Checking Assumptions: Normality using QQplots

Source: Sean Kross

### Checking Assumptions Level I

No strong non-linearity visible in residuals

qqplot indicates normality of residuals

### Checking assumptions Level II

In addition: test whether normality for random-variable is true:

QQ-Plots indicate normality of both random effects.

### Overview

• The mixed model
• An example to guide you through the steps
• Inference
• REML vs ML
• Assumption checking
• Shrinkage
• Terminology
• Convergence Problems

### Shrinkage

The individual estimates with high variance / far off the population are shrinked inwards

gray: one model, color: individual model, black: mixed model

### Shrinkage over multiple parameters

Simulated data: Subjects with faster reaction time (intercept) show larger effects for difficulty (slope)

In reverse: If I know somebody is generally slow, he should also have a larger slope!