class: center, middle, inverse, title-slide # Model Estimation, Testing, and Reporting ## PSYC 575 ### Mark Lai ### University of Southern California ### 2020/09/01 (updated: 2021-09-25) --- `$$\newcommand{\bv}[1]{\boldsymbol{\mathbf{#1}}}$$` # Week Learning Objectives - Describe conceptually what **likelihood function** and maximum likelihood estimation are - Describe the differences between **maximum likelihood** and **restricted maximum likelihood** - Conduct statistical tests for fixed effects - Test fixed effects using the F-test with the **small-sample correction** when the number of clusters is small - Use the **likelihood ratio test** to test random slopes --- # Estimation <img src="img/model_sample.png" width="621" height="40%" /> Regression: OLS MLM: Maximum likelihood, Bayesian --- class: middle # Why should I learn about estimation methods? -- - ## Understand software options -- - ## Know when to use better methods -- - ## Needed for reporting --- class: inverse, middle, center # Maximum Likelihood Estimation --- ### The most commonly used methods in MLM are ### maximum likelihood (ML) and restricted maximum likelihood (REML) ``` *># Linear mixed model fit by REML ['lmerMod'] ># Formula: Reaction ~ Days + (Days | Subject) ># Data: sleepstudy ># REML criterion at convergence: 1743.628 ># Random effects: ># Groups Name Std.Dev. Corr ># Subject (Intercept) 24.741 ># Days 5.922 0.07 ># Residual 25.592 ># Number of obs: 180, groups: Subject, 18 ># Fixed Effects: ># (Intercept) Days ># 251.41 10.47 ``` -- ## But what is "Likelihood"? --- # Likelihood .pull-left[ ### Let’s say we want to estimate the population mean math achievement score `\((\mu)\)` We need to make some assumptions: - Known *SD*: `\(\sigma = 8\)` - The scores are normally distributed in the population ] .pull-right[ <img src="05_model_estimation_and_testing_files/figure-html/lik-pop-1.png" width="100%" /> ] --- # Learning the Parameter From the Sample Assume that we have scores from 5 representative students | Student| Score| |-------:|-----:| | 1| 23| | 2| 16| | 3| 5| | 4| 14| | 5| 7| --- # Likelihood If we **assume** that `\(\mu = 10\)`, how likely will we get 5 students with these scores? .pull-left[ <img src="05_model_estimation_and_testing_files/figure-html/lik-pop-10-1.png" width="90%" /> ] -- .pull-right[ | Student| Score| `\(P(Y_i = y_i \mid \mu = 10)\)`| |-------:|-----:|----------------------------:| | 1| 23| 0.0133173| | 2| 16| 0.0376422| | 3| 5| 0.0410201| | 4| 14| 0.0440082| | 5| 7| 0.0464819| Multiplying them all together: `$$P(Y_1 = 23, Y_2 = 16, Y_3 = 5, Y_4 = 14, Y_5 = 7 | \mu = 10)$$` = Product of the probabilities = ```r prod(dnorm(c(23, 16, 5, 14, 7), mean = 10, sd = 8)) ``` ``` ># [1] 4.20634e-08 ``` ] --- # If `\(\mu = 13\)` .pull-left[ <img src="05_model_estimation_and_testing_files/figure-html/lik-pop-13-1.png" width="100%" /> ] -- .pull-right[ | Student| Score| `\(P(Y_i = y_i \mid \mu = 13)\)`| |-------:|-----:|----------------------------:| | 1| 23| 0.0228311| | 2| 16| 0.0464819| | 3| 5| 0.0302463| | 4| 14| 0.0494797| | 5| 7| 0.0376422| Multiplying them all together: `$$P(Y_1 = 23, Y_2 = 16, Y_3 = 5, Y_4 = 14, Y_5 = 7 | \mu = 13)$$` = Product of the probabilities = ```r prod(dnorm(c(23, 16, 5, 14, 7), mean = 13, sd = 8)) ``` ``` ># [1] 5.978414e-08 ``` ] --- Compute the likelihood for a range of `\(\mu\)` values .pull-left[ # Likelihood Function <img src="05_model_estimation_and_testing_files/figure-html/lik-func-1.png" width="90%" /> ] -- .pull-right[ # Log-Likelihood (LL) Function <img src="05_model_estimation_and_testing_files/figure-html/llik-func-1.png" width="90%" /> ] --- .pull-left[ # Maximum Likelihood `\(\hat \mu = 13\)` maximizes the (log) likelihood function Maximum likelihood estimator (MLE) ] -- .pull-right[ ## Estimating `\(\sigma\)` <img src="05_model_estimation_and_testing_files/figure-html/llik-func-sigma-1.png" width="90%" /> ] --- # Curvature and Standard Errors .pull-left[ `\(N = 5\)` <img src="05_model_estimation_and_testing_files/figure-html/mle-ase1-1.png" width="90%" /> ] .pull-right[ `\(N = 20\)` <img src="05_model_estimation_and_testing_files/figure-html/mle-ase2-1.png" width="90%" /> ] --- class: inverse, center, middle # Estimation Methods for MLM --- # For MLM Find `\(\gamma\)`s, `\(\tau\)`s, and `\(\sigma\)` that maximizes the likelihood function `$$\ell(\bv \gamma, \bv \tau, \sigma; \bv y) = - \frac{1}{2} \left\{\log | \bv V(\bv \tau, \sigma)| + (\bv y - \bv X \bv \gamma)^\top \bv V^{-1}(\bv \tau, \sigma) (\bv y - \bv X \bv \gamma) \right\} + K$$` Here's the log-likelihood function for the coefficient of `meanses` (see code in the provided Rmd): .pull-left[ <img src="05_model_estimation_and_testing_files/figure-html/loglik-meanses-1.png" width="396" /> ] .pull-right[ <img src="05_model_estimation_and_testing_files/figure-html/deviance-gamma01-1.png" width="396" /> ] --- # Numerical Algorithms .pull-left[ ```r m_lv2 <- lmer(mathach ~ meanses + (1 | id), data = hsball, verbose = 1) ``` ``` ># iteration: 1 ># f(x) = 47022.519159 ># iteration: 2 ># f(x) = 47151.291766 ># iteration: 3 ># f(x) = 47039.480137 ># iteration: 4 ># f(x) = 46974.909593 ># iteration: 5 ># f(x) = 46990.872588 ># iteration: 6 ># f(x) = 46966.453125 ># iteration: 7 ># f(x) = 46961.719993 ># iteration: 8 ># f(x) = 46965.890703 ># iteration: 9 ># f(x) = 46961.367013 ># iteration: 10 ># f(x) = 46961.288830 ># iteration: 11 ># f(x) = 46961.298898 ># iteration: 12 ># f(x) = 46961.284848 ># iteration: 13 ># f(x) = 46961.285238 ># iteration: 14 ># f(x) = 46961.284845 ># iteration: 15 ># f(x) = 46961.284848 ># iteration: 16 ># f(x) = 46961.284845 ``` ] .pull-right[ <img src="05_model_estimation_and_testing_files/figure-html/deviance-gamma01-2-1.png" width="396" /> ] --- # ML vs. REML REML has corrected degrees of freedom for the variance component estimates (like dividing by `\(N - 1\)` instead of by `\(N\)` in estimating variance) - REML is generally preferred in smaller samples - The difference is small with large number of clusters Technically speaking, REML only estimates the variance components<sup>1</sup> .footnote[ [1] The fixed effects are integrated out and are not part of the likelihood function. They are solved in a second step, usually by the generalized least squares (GLS) method ] --- .pull-left[ ### 160 Schools <table class="table" style="width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:center;"> REML </th> <th style="text-align:center;"> ML </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> (Intercept) </td> <td style="text-align:center;"> 12.649 </td> <td style="text-align:center;"> 12.650 </td> </tr> <tr> <td style="text-align:left;"> </td> <td style="text-align:center;"> (0.149) </td> <td style="text-align:center;"> (0.148) </td> </tr> <tr> <td style="text-align:left;"> meanses </td> <td style="text-align:center;"> 5.864 </td> <td style="text-align:center;"> 5.863 </td> </tr> <tr> <td style="text-align:left;"> </td> <td style="text-align:center;"> (0.361) </td> <td style="text-align:center;"> (0.359) </td> </tr> <tr> <td style="text-align:left;"> sd__(Intercept) </td> <td style="text-align:center;"> 1.624 </td> <td style="text-align:center;"> 1.610 </td> </tr> <tr> <td style="text-align:left;box-shadow: 0px 1px"> sd__Observation </td> <td style="text-align:center;box-shadow: 0px 1px"> 6.258 </td> <td style="text-align:center;box-shadow: 0px 1px"> 6.258 </td> </tr> <tr> <td style="text-align:left;"> AIC </td> <td style="text-align:center;"> 46969.3 </td> <td style="text-align:center;"> 46967.1 </td> </tr> <tr> <td style="text-align:left;"> BIC </td> <td style="text-align:center;"> 46996.8 </td> <td style="text-align:center;"> 46994.6 </td> </tr> <tr> <td style="text-align:left;"> Log.Lik. </td> <td style="text-align:center;"> −23480.642 </td> <td style="text-align:center;"> −23479.554 </td> </tr> <tr> <td style="text-align:left;"> REMLcrit </td> <td style="text-align:center;"> 46961.285 </td> <td style="text-align:center;"> </td> </tr> </tbody> </table> ] -- .pull-right[ ### 16 Schools <table class="table" style="width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:center;"> REML </th> <th style="text-align:center;"> ML </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> (Intercept) </td> <td style="text-align:center;"> 12.809 </td> <td style="text-align:center;"> 12.808 </td> </tr> <tr> <td style="text-align:left;"> </td> <td style="text-align:center;"> (0.504) </td> <td style="text-align:center;"> (0.471) </td> </tr> <tr> <td style="text-align:left;"> meanses </td> <td style="text-align:center;"> 6.577 </td> <td style="text-align:center;"> 6.568 </td> </tr> <tr> <td style="text-align:left;"> </td> <td style="text-align:center;"> (1.281) </td> <td style="text-align:center;"> (1.197) </td> </tr> <tr> <td style="text-align:left;"> sd__(Intercept) </td> <td style="text-align:center;"> 1.726 </td> <td style="text-align:center;"> 1.581 </td> </tr> <tr> <td style="text-align:left;box-shadow: 0px 1px"> sd__Observation </td> <td style="text-align:center;box-shadow: 0px 1px"> 5.944 </td> <td style="text-align:center;box-shadow: 0px 1px"> 5.944 </td> </tr> <tr> <td style="text-align:left;"> AIC </td> <td style="text-align:center;"> 4419.6 </td> <td style="text-align:center;"> 4422.2 </td> </tr> <tr> <td style="text-align:left;"> BIC </td> <td style="text-align:center;"> 4437.7 </td> <td style="text-align:center;"> 4440.3 </td> </tr> <tr> <td style="text-align:left;"> Log.Lik. </td> <td style="text-align:center;"> −2205.796 </td> <td style="text-align:center;"> −2207.099 </td> </tr> <tr> <td style="text-align:left;"> REMLcrit </td> <td style="text-align:center;"> 4411.591 </td> <td style="text-align:center;"> </td> </tr> </tbody> </table> ] --- # Other Estimation Methods ### Generalized estimating equations (GEE) - Robust to some misspecification and non-normality - Maybe inefficient in small samples (i.e., with lower power) - See Snijders & Bosker 12.2; the `geepack` R package ### Markov Chain Monte Carlo (MCMC)/Bayesian - Researchers set prior distributions for the parameters * Different from "empirical Bayes": Prior coming from the data - Does not depend on normality of the sampling distributions * More stable in small samples with the use of priors - Can handle complex models - See Snijders & Bosker 12.1; the `MCMCglmm` and the `brms` R packages --- class: inverse, middle, center # Testing --- ### Fixed effects `\((\gamma)\)` - Usually the likelihood-based CI/likelihood-ratio (LRT; `\(\chi^2\)`) test is sufficient * Require ML (as fixed effects are not part of the likelihood function in REML) - Small sample (10--50 clusters): Kenward-Roger approximation of degrees of freedom - Non-normality: Residual bootstrap<sup>1</sup> ### Random effects `\((\tau)\)` - LRT (with `\(p\)` values divided by 2) .footnote[ [1]: See [van der Leeden et al. (2008)](https://link-springer-com.libproxy1.usc.edu/chapter/10.1007/978-0-387-73186-5_11) and [Lai (2021)](https://doi.org/10.1080/00273171.2020.1746902) ] --- class: inverse, middle, center # Testing Fixed Effects --- # Likelihood Ratio (Deviance) Test `$$H_0: \gamma = 0$$` -- Likelihood ratio: `\(\dfrac{L(\gamma = 0)}{L(\gamma = \hat \gamma)}\)` Deviance: `\(-2 \times \log\left(\frac{L(\gamma = 0)}{L(\gamma = \hat \gamma)}\right)\)` = `\(-2 \mathrm{LL}(\gamma = 0) - [-2 \mathrm{LL}(\gamma = \hat \gamma)]\)` = `\(\mathrm{Deviance} \mid_{\gamma = 0} - \mathrm{Deviance} \mid_{\gamma = \hat \gamma}\)` ML (instead of REML) should be used --- # Example ``` ... ># Linear mixed model fit by maximum likelihood ['lmerMod'] ># Formula: mathach ~ (1 | id) ># AIC BIC logLik deviance df.resid ># 47121.81 47142.45 -23557.91 47115.81 7182 ... ``` ``` ... ># Linear mixed model fit by maximum likelihood ['lmerMod'] ># Formula: mathach ~ meanses + (1 | id) ># AIC BIC logLik deviance df.resid ># 46967.11 46994.63 -23479.55 46959.11 7181 ... ``` ```r pchisq(47115.81 - 46959.11, df = 1, lower.tail = FALSE) ``` ``` ># [1] 5.952567e-36 ``` In `lme4`, you can also use ```r anova(m_lv2, ran_int) # Automatically use ML ``` --- # Problem of LRT in Small Samples .pull-left[ The LRT relies on the assumption that the deviance under the null follows a `\(\chi^2\)` distribution, which is not likely to hold in small samples - Inflated Type I error rates E.g., 16 Schools - LRT critical value with `\(\alpha = .05\)`: 3.84 - Simulation-based critical value: 4.67 ] -- .pull-right[ <img src="05_model_estimation_and_testing_files/figure-html/est-vs-boot-samp-dist-1.png" width="80%" /> ] --- # `\(F\)` Test With Small-Sample Correction It is based on the Wald test (not the LRT): - `\(t = \hat{\gamma} / \hat{\mathrm{se}}(\hat{\gamma})\)`, - Or equivalently, the `\(F = t^2\)` (for a one-parameter test) The small-sample correction does two things: - Adjust `\(\hat{\mathrm{se}}(\hat{\gamma})\)` as it tends to be underestimated in small samples - Determine the critical value based on an `\(F\)` distribution, with an approximate **denominator degrees of freedom (ddf)** --- # Kenward-Roger (1997) Correction Generally performs well with < 50 clusters .pull-left[ ```r # Wald anova(m_contextual, ddf = "lme4") ``` ``` ># Analysis of Variance Table ># npar Sum Sq Mean Sq F value ># meanses 1 860.08 860.08 26.400 ># ses 1 1874.34 1874.34 57.533 ``` ] .pull-right[ ```r # K-R anova(m_contextual, ddf = "Kenward-Roger") ``` ``` ># Type III Analysis of Variance Table with Kenward-Roger's method ># Sum Sq Mean Sq NumDF DenDF F value Pr(>F) ># meanses 324.39 324.39 1 15.51 9.9573 0.006317 ** ># ses 1874.34 1874.34 1 669.03 57.5331 1.116e-13 *** ># --- ># Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ``` ] For `meanses`, the critical value (and the `\(p\)` value) is determined based on an `\(F(1, 15.51)\)` distribution, which has a critical value of ```r qf(.95, df1 = 1, df2 = 15.51) ``` ``` ># [1] 4.517161 ``` --- class: inverse, middle, center # Testing Random Effects --- # LRT for Random Slopes .pull-left[ ### Should you include random slopes? Theoretically yes unless you're certain that the slopes are the same for every groups However, frequentist methods usually crash with more than two random slopes - Test the random slopes one by one, and identify which one is needed - Bayesian methods are more equipped for complex models ] -- .pull-right[ ### "One-tailed" LRT LRT `\((\chi^2)\)` is generally a two-tailed test. But for random slopes, `\(H_0: \tau_1 = 0\)` is a one-tailed hypothesis A quick solution is to divide the resulting `\(p\)` by 2<sup>1</sup> .footnote[ [1]: Originally proposed by Snijders & Bosker; tested in simulation by LaHuis & Ferguson (2009, https://doi.org/10.1177/1094428107308984) ] ] --- # Example: LRT for `\(\tau^2_1\)` .pull-left[ ``` ... ># Formula: mathach ~ meanses + ses_cmc + (ses_cmc | id) ># Data: hsball ># REML criterion at convergence: 46557.65 ... ``` ``` ... ># Formula: mathach ~ meanses + ses_cmc + (1 | id) ># Data: hsball ># REML criterion at convergence: 46568.58 ... ``` ] -- .pull-right[ .center[ ### G Matrix `$$\begin{bmatrix} \tau^2_0 & \\ \tau_{01} & \tau^2_1 \\ \end{bmatrix}$$` `$$\begin{bmatrix} \tau^2_0 & \\ {\color{red}0} & {\color{red}0} \\ \end{bmatrix}$$` ] ] -- ```r pchisq(10.92681, df = 2, lower.tail = FALSE) ``` ``` ># [1] 0.004239097 ``` Need to divide by 2 --- class: inverse, middle, center # Multilevel Bootstrap --- A simulation-based approach to approximate the sampling distribution of fixed and random effects - Useful for obtaining CIs - Especially for statistics that are functions of fixed/random effects (e.g., `\(R^2\)`) Parametric, Residual, and Cases bootstrap -- ![](img/types_bootstrap.png) --- In my own work,<sup>1</sup> the residual bootstrap was found to perform best, especially when data are not normally distributed and when the number of clusters is small See R code for this week .footnote[ Lai (2021, https://doi.org/10.1080/00273171.2020.1746902) ]