Model Estimation, Testing, and ReportingPSYC 575Mark LaiUniversity of Southern California2020/09/01 (updated: 2021-09-25)1 / 33

$$\newcommand{\bv}[1]{\boldsymbol{\mathbf{#1}}}$$

Week Learning Objectives

Describe conceptually what likelihood function and maximum likelihood estimation are
Describe the differences between maximum likelihood and restricted maximum likelihood
Conduct statistical tests for fixed effects
Test fixed effects using the F-test with the small-sample correction when the number of clusters is small
Use the likelihood ratio test to test random slopes

2 / 33

Estimation

Regression: OLS

MLM: Maximum likelihood, Bayesian

3 / 33

Why should I learn about estimation methods?4 / 33

Why should I learn about estimation methods?Understand software options
4 / 33

Why should I learn about estimation methods?Understand software options
Know when to use better methods
4 / 33

Why should I learn about estimation methods?Understand software options
Know when to use better methods
Needed for reporting
4 / 33

Maximum Likelihood Estimation5 / 33

The most commonly used methods in MLM are

maximum likelihood (ML) and restricted maximum likelihood (REML)

># Linear mixed model fit by REML ['lmerMod']
># Formula: Reaction ~ Days + (Days | Subject)
>#    Data: sleepstudy
># REML criterion at convergence: 1743.628
># Random effects:
>#  Groups   Name        Std.Dev. Corr
>#  Subject  (Intercept) 24.741       
>#           Days         5.922   0.07
>#  Residual             25.592       
># Number of obs: 180, groups:  Subject, 18
># Fixed Effects:
># (Intercept)         Days  
>#      251.41        10.47

6 / 33

The most commonly used methods in MLM are

maximum likelihood (ML) and restricted maximum likelihood (REML)

># Linear mixed model fit by REML ['lmerMod']
># Formula: Reaction ~ Days + (Days | Subject)
>#    Data: sleepstudy
># REML criterion at convergence: 1743.628
># Random effects:
>#  Groups   Name        Std.Dev. Corr
>#  Subject  (Intercept) 24.741       
>#           Days         5.922   0.07
>#  Residual             25.592       
># Number of obs: 180, groups:  Subject, 18
># Fixed Effects:
># (Intercept)         Days  
>#      251.41        10.47

But what is "Likelihood"?

6 / 33

Likelihood

Let’s say we want to estimate the population mean math achievement score $(\mu)$

We need to make some assumptions:

Known SD: $\sigma = 8$
The scores are normally distributed in the population

7 / 33

Learning the Parameter From the Sample

Assume that we have scores from 5 representative students

Student	Score
1	23
2	16
3	5
4	14
5	7

8 / 33

Likelihood

If we assume that $\mu = 10$, how likely will we get 5 students with these scores?

9 / 33

Likelihood

If we assume that $\mu = 10$, how likely will we get 5 students with these scores?

Student	Score	$P(Y_i = y_i \mid \mu = 10)$
1	23	0.0133173
2	16	0.0376422
3	5	0.0410201
4	14	0.0440082
5	7	0.0464819

Multiplying them all together: $$P(Y_1 = 23, Y_2 = 16, Y_3 = 5, Y_4 = 14, Y_5 = 7 | \mu = 10)$$ = Product of the probabilities =

prod(dnorm(c(23, 16, 5, 14, 7), mean = 10, sd = 8))

># [1] 4.20634e-08

9 / 33

If $\mu = 13$

10 / 33

If $\mu = 13$

Student	Score	$P(Y_i = y_i \mid \mu = 13)$
1	23	0.0228311
2	16	0.0464819
3	5	0.0302463
4	14	0.0494797
5	7	0.0376422

Multiplying them all together: $$P(Y_1 = 23, Y_2 = 16, Y_3 = 5, Y_4 = 14, Y_5 = 7 | \mu = 13)$$ = Product of the probabilities =

prod(dnorm(c(23, 16, 5, 14, 7), mean = 13, sd = 8))

># [1] 5.978414e-08

10 / 33

Compute the likelihood for a range of $\mu$ values

Likelihood Function

11 / 33

Compute the likelihood for a range of $\mu$ values

Likelihood Function

Log-Likelihood (LL) Function

11 / 33

Maximum Likelihood

$\hat \mu = 13$ maximizes the (log) likelihood function

Maximum likelihood estimator (MLE)

12 / 33

Maximum Likelihood

$\hat \mu = 13$ maximizes the (log) likelihood function

Maximum likelihood estimator (MLE)

Estimating $\sigma$

12 / 33

Curvature and Standard Errors

$N = 5$

$N = 20$

13 / 33

Estimation Methods for MLM14 / 33

For MLM

Find $\gamma$s, $\tau$s, and $\sigma$ that maximizes the likelihood function

$$\ell(\bv \gamma, \bv \tau, \sigma; \bv y) = - \frac{1}{2} \left\{\log | \bv V(\bv \tau, \sigma)| + (\bv y - \bv X \bv \gamma)^\top \bv V^{-1}(\bv \tau, \sigma) (\bv y - \bv X \bv \gamma) \right\} + K$$

Here's the log-likelihood function for the coefficient of meanses (see code in the provided Rmd):

15 / 33

Numerical Algorithms

m_lv2 <- lmer(mathach ~ meanses + (1 | id), data = hsball, verbose = 1)

># iteration: 1
>#     f(x) = 47022.519159
># iteration: 2
>#     f(x) = 47151.291766
># iteration: 3
>#     f(x) = 47039.480137
># iteration: 4
>#     f(x) = 46974.909593
># iteration: 5
>#     f(x) = 46990.872588
># iteration: 6
>#     f(x) = 46966.453125
># iteration: 7
>#     f(x) = 46961.719993
># iteration: 8
>#     f(x) = 46965.890703
># iteration: 9
>#     f(x) = 46961.367013
># iteration: 10
>#     f(x) = 46961.288830
># iteration: 11
>#     f(x) = 46961.298898
># iteration: 12
>#     f(x) = 46961.284848
># iteration: 13
>#     f(x) = 46961.285238
># iteration: 14
>#     f(x) = 46961.284845
># iteration: 15
>#     f(x) = 46961.284848
># iteration: 16
>#     f(x) = 46961.284845

16 / 33

ML vs. REML

REML has corrected degrees of freedom for the variance component estimates (like dividing by $N - 1$ instead of by $N$ in estimating variance)

REML is generally preferred in smaller samples
The difference is small with large number of clusters

Technically speaking, REML only estimates the variance components¹

[1] The fixed effects are integrated out and are not part of the likelihood function. They are solved in a second step, usually by the generalized least squares (GLS) method

17 / 33

160 Schools

    REML 
    ML 
  
    (Intercept) 
    12.649 
    12.650 
  
    (0.149) 
    (0.148) 
  
    meanses 
    5.864 
    5.863 
  
    (0.361) 
    (0.359) 
  
    sd__(Intercept) 
    1.624 
    1.610 
  
    sd__Observation 
    6.258 
    6.258 
  
    AIC 
    46969.3 
    46967.1 
  
    BIC 
    46996.8 
    46994.6 
  
    Log.Lik. 
    −23480.642 
    −23479.554 
  
    REMLcrit 
    46961.285 
     
18 / 33

	REML	ML
(Intercept)	12.649	12.650
	(0.149)	(0.148)
meanses	5.864	5.863
	(0.361)	(0.359)
sd__(Intercept)	1.624	1.610
sd__Observation	6.258	6.258
AIC	46969.3	46967.1
BIC	46996.8	46994.6
Log.Lik.	−23480.642	−23479.554
REMLcrit	46961.285

160 Schools

    REML 
    ML 
  
    (Intercept) 
    12.649 
    12.650 
  
    (0.149) 
    (0.148) 
  
    meanses 
    5.864 
    5.863 
  
    (0.361) 
    (0.359) 
  
    sd__(Intercept) 
    1.624 
    1.610 
  
    sd__Observation 
    6.258 
    6.258 
  
    AIC 
    46969.3 
    46967.1 
  
    BIC 
    46996.8 
    46994.6 
  
    Log.Lik. 
    −23480.642 
    −23479.554 
  
    REMLcrit 
    46961.285 
     
16 Schools

    REML 
    ML 
  
    (Intercept) 
    12.809 
    12.808 
  
    (0.504) 
    (0.471) 
  
    meanses 
    6.577 
    6.568 
  
    (1.281) 
    (1.197) 
  
    sd__(Intercept) 
    1.726 
    1.581 
  
    sd__Observation 
    5.944 
    5.944 
  
    AIC 
    4419.6 
    4422.2 
  
    BIC 
    4437.7 
    4440.3 
  
    Log.Lik. 
    −2205.796 
    −2207.099 
  
    REMLcrit 
    4411.591 
     
18 / 33

	REML	ML
(Intercept)	12.649	12.650
	(0.149)	(0.148)
meanses	5.864	5.863
	(0.361)	(0.359)
sd__(Intercept)	1.624	1.610
sd__Observation	6.258	6.258
AIC	46969.3	46967.1
BIC	46996.8	46994.6
Log.Lik.	−23480.642	−23479.554
REMLcrit	46961.285

	REML	ML
(Intercept)	12.809	12.808
	(0.504)	(0.471)
meanses	6.577	6.568
	(1.281)	(1.197)
sd__(Intercept)	1.726	1.581
sd__Observation	5.944	5.944
AIC	4419.6	4422.2
BIC	4437.7	4440.3
Log.Lik.	−2205.796	−2207.099
REMLcrit	4411.591

Other Estimation MethodsGeneralized estimating equations (GEE)Robust to some misspecification and non-normality
Maybe inefficient in small samples (i.e., with lower power)
See Snijders & Bosker 12.2; the geepack R package
Markov Chain Monte Carlo (MCMC)/BayesianResearchers set prior distributions for the parametersDifferent from "empirical Bayes": Prior coming from the data

Does not depend on normality of the sampling distributionsMore stable in small samples with the use of priors

Can handle complex models
See Snijders & Bosker 12.1; the MCMCglmm and the brms R packages
19 / 33

Testing20 / 33

Fixed effects $(\gamma)$

Usually the likelihood-based CI/likelihood-ratio (LRT; $\chi^2$) test is sufficient
- Require ML (as fixed effects are not part of the likelihood function in REML)
Small sample (10--50 clusters): Kenward-Roger approximation of degrees of freedom
Non-normality: Residual bootstrap¹

Random effects $(\tau)$

LRT (with $p$ values divided by 2)

[1]: See van der Leeden et al. (2008) and Lai (2021)

21 / 33

Testing Fixed Effects22 / 33

Likelihood Ratio (Deviance) Test

$$H_0: \gamma = 0$$

23 / 33

Likelihood Ratio (Deviance) Test

$$H_0: \gamma = 0$$ Likelihood ratio: $\dfrac{L(\gamma = 0)}{L(\gamma = \hat \gamma)}$

Deviance: $-2 \times \log\left(\frac{L(\gamma = 0)}{L(\gamma = \hat \gamma)}\right)$
= $-2 \mathrm{LL}(\gamma = 0) - [-2 \mathrm{LL}(\gamma = \hat \gamma)]$
= $\mathrm{Deviance} \mid_{\gamma = 0} - \mathrm{Deviance} \mid_{\gamma = \hat \gamma}$

ML (instead of REML) should be used

23 / 33

Example

...
># Linear mixed model fit by maximum likelihood  ['lmerMod']
># Formula: mathach ~ (1 | id)
>#       AIC       BIC    logLik  deviance  df.resid 
>#  47121.81  47142.45 -23557.91  47115.81      7182 
...

...
># Linear mixed model fit by maximum likelihood  ['lmerMod']
># Formula: mathach ~ meanses + (1 | id)
>#       AIC       BIC    logLik  deviance  df.resid 
>#  46967.11  46994.63 -23479.55  46959.11      7181 
...

pchisq(47115.81 - 46959.11, df = 1, lower.tail = FALSE)

># [1] 5.952567e-36

In lme4, you can also use

anova(m_lv2, ran_int)  # Automatically use ML

24 / 33

Problem of LRT in Small Samples

The LRT relies on the assumption that the deviance under the null follows a $\chi^2$ distribution, which is not likely to hold in small samples

Inflated Type I error rates

E.g., 16 Schools

LRT critical value with $\alpha = .05$: 3.84
Simulation-based critical value: 4.67

25 / 33

Problem of LRT in Small Samples

The LRT relies on the assumption that the deviance under the null follows a $\chi^2$ distribution, which is not likely to hold in small samples

Inflated Type I error rates

E.g., 16 Schools

LRT critical value with $\alpha = .05$: 3.84
Simulation-based critical value: 4.67

25 / 33

$F$ Test With Small-Sample Correction

It is based on the Wald test (not the LRT):

$t = \hat{\gamma} / \hat{\mathrm{se}}(\hat{\gamma})$,
Or equivalently, the $F = t^2$ (for a one-parameter test)

The small-sample correction does two things:

Adjust $\hat{\mathrm{se}}(\hat{\gamma})$ as it tends to be underestimated in small samples
Determine the critical value based on an $F$ distribution, with an approximate denominator degrees of freedom (ddf)

26 / 33

Kenward-Roger (1997) Correction

Generally performs well with < 50 clusters

# Wald
anova(m_contextual, ddf = "lme4")

># Analysis of Variance Table
>#         npar  Sum Sq Mean Sq F value
># meanses    1  860.08  860.08  26.400
># ses        1 1874.34 1874.34  57.533

# K-R 
anova(m_contextual, ddf = "Kenward-Roger")

># Type III Analysis of Variance Table with Kenward-Roger's method
>#          Sum Sq Mean Sq NumDF  DenDF F value    Pr(>F)    
># meanses  324.39  324.39     1  15.51  9.9573  0.006317 ** 
># ses     1874.34 1874.34     1 669.03 57.5331 1.116e-13 ***
># ---
># Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

For meanses, the critical value (and the $p$ value) is determined based on an $F(1, 15.51)$ distribution, which has a critical value of

qf(.95, df1 = 1, df2 = 15.51)

># [1] 4.517161

27 / 33

Testing Random Effects28 / 33

LRT for Random Slopes

Should you include random slopes?

Theoretically yes unless you're certain that the slopes are the same for every groups

However, frequentist methods usually crash with more than two random slopes

Test the random slopes one by one, and identify which one is needed
Bayesian methods are more equipped for complex models

29 / 33

LRT for Random Slopes

Should you include random slopes?

Theoretically yes unless you're certain that the slopes are the same for every groups

However, frequentist methods usually crash with more than two random slopes

Test the random slopes one by one, and identify which one is needed
Bayesian methods are more equipped for complex models

"One-tailed" LRT

LRT $(\chi^2)$ is generally a two-tailed test. But for random slopes,

$H_0: \tau_1 = 0$ is a one-tailed hypothesis

A quick solution is to divide the resulting $p$ by 2¹

[1]: Originally proposed by Snijders & Bosker; tested in simulation by LaHuis & Ferguson (2009, https://doi.org/10.1177/1094428107308984)

29 / 33

Example: LRT for \(\tau^2_1\)...
># Formula: mathach ~ meanses + ses_cmc + (ses_cmc | id)
>#    Data: hsball
># REML criterion at convergence: 46557.65
...
...
># Formula: mathach ~ meanses + ses_cmc + (1 | id)
>#    Data: hsball
># REML criterion at convergence: 46568.58
...
30 / 33

Example: LRT for $\tau^2_1$

...
># Formula: mathach ~ meanses + ses_cmc + (ses_cmc | id)
>#    Data: hsball
># REML criterion at convergence: 46557.65
...

...
># Formula: mathach ~ meanses + ses_cmc + (1 | id)
>#    Data: hsball
># REML criterion at convergence: 46568.58
...

G Matrix

$$\begin{bmatrix} \tau^2_0 & \\ \tau_{01} & \tau^2_1 \\ \end{bmatrix}$$

$$\begin{bmatrix} \tau^2_0 & \\ {\color{red}0} & {\color{red}0} \\ \end{bmatrix}$$

30 / 33

Example: LRT for $\tau^2_1$

...
># Formula: mathach ~ meanses + ses_cmc + (ses_cmc | id)
>#    Data: hsball
># REML criterion at convergence: 46557.65
...

...
># Formula: mathach ~ meanses + ses_cmc + (1 | id)
>#    Data: hsball
># REML criterion at convergence: 46568.58
...

G Matrix

$$\begin{bmatrix} \tau^2_0 & \\ \tau_{01} & \tau^2_1 \\ \end{bmatrix}$$

$$\begin{bmatrix} \tau^2_0 & \\ {\color{red}0} & {\color{red}0} \\ \end{bmatrix}$$

pchisq(10.92681, df = 2, lower.tail = FALSE)

># [1] 0.004239097

Need to divide by 2

30 / 33

Multilevel Bootstrap31 / 33

A simulation-based approach to approximate the sampling distribution of fixed and random effects

Useful for obtaining CIs
Especially for statistics that are functions of fixed/random effects (e.g., $R^2$)

Parametric, Residual, and Cases bootstrap

32 / 33

A simulation-based approach to approximate the sampling distribution of fixed and random effects

Useful for obtaining CIs
Especially for statistics that are functions of fixed/random effects (e.g., $R^2$)

Parametric, Residual, and Cases bootstrap

32 / 33

In my own work,¹ the residual bootstrap was found to perform best, especially when data are not normally distributed and when the number of clusters is small

See R code for this week

Lai (2021, https://doi.org/10.1080/00273171.2020.1746902)

33 / 33

Week Learning Objectives

Describe conceptually what likelihood function and maximum likelihood estimation are

Describe the differences between maximum likelihood and restricted maximum likelihood

Conduct statistical tests for fixed effects

Test fixed effects using the F-test with the small-sample correction when the number of clusters is small

Use the likelihood ratio test to test random slopes

2 / 33

Help

Keyboard shortcuts

↑, ←, Pg Up, k

Go to previous slide

↓, →, Pg Dn, Space, j

Go to next slide

Home

Go to first slide

End

Go to last slide

Number + Return

Go to specific slide

b / m / f

Toggle blackout / mirrored / fullscreen mode

Clone slideshow

Toggle presenter mode

Restart the presentation timer

?, h

Toggle this help

Model Estimation, Testing, and Reporting

PSYC 575

Mark Lai

University of Southern California

2020/09/01 (updated: 2021-09-25)

Week Learning Objectives

Estimation

Why should I learn about estimation methods?

Why should I learn about estimation methods?

Understand software options

Why should I learn about estimation methods?

Understand software options

Know when to use better methods

Why should I learn about estimation methods?

Understand software options

Know when to use better methods

Needed for reporting

Maximum Likelihood Estimation

The most commonly used methods in MLM are

maximum likelihood (ML) and restricted maximum likelihood (REML)

The most commonly used methods in MLM are

maximum likelihood (ML) and restricted maximum likelihood (REML)

But what is "Likelihood"?

Likelihood

Let’s say we want to estimate the population mean math achievement score \((\mu)\)

Learning the Parameter From the Sample

Likelihood

Likelihood

If \(\mu = 13\)

If \(\mu = 13\)

Likelihood Function

Likelihood Function

Log-Likelihood (LL) Function

Maximum Likelihood

Maximum Likelihood

Estimating \(\sigma\)

Curvature and Standard Errors

Estimation Methods for MLM

For MLM

Numerical Algorithms

ML vs. REML

160 Schools

160 Schools

16 Schools

Other Estimation Methods

Generalized estimating equations (GEE)

Markov Chain Monte Carlo (MCMC)/Bayesian

Testing

Fixed effects \((\gamma)\)

Random effects \((\tau)\)

Testing Fixed Effects

Likelihood Ratio (Deviance) Test

Likelihood Ratio (Deviance) Test

Example

Problem of LRT in Small Samples

Problem of LRT in Small Samples

\(F\) Test With Small-Sample Correction

Kenward-Roger (1997) Correction

Testing Random Effects

LRT for Random Slopes

Should you include random slopes?

LRT for Random Slopes

Should you include random slopes?

"One-tailed" LRT

Example: LRT for \(\tau^2_1\)

Example: LRT for \(\tau^2_1\)

G Matrix

Example: LRT for \(\tau^2_1\)

G Matrix

Multilevel Bootstrap

Week Learning Objectives

Help