nAGQ > 1 in lme4::glmer gives unexpected likelihood
Dimitris handled some of the practical issues, so I'm going to try to handle some of the underlying math issues based on my very rusty collection of measure theory .... hopefully I'll fix more "wrongs" than I create and not incur the wrath of the math-stats deities ....
On 26/04/2020 01:23, Rolf Turner wrote:
<SNIP>
OK.? Let me display my ignorance:? I was always under the impression
that likelihood is defined with respect to an *underlying measure*.
Change the measure and you get a different likelihood (with the log
likelihoods differing by a constant).
To beat it to death:? Pr(X <= x) = \int_{-\infty}^x f(x) d\mu(x)
where f(x) is the density (which can be interpreted as the likelihood)
of X *with respect to* the measure \mu().
E.g. if you have a random variable X with probability mass function
P(x;\theta) defined, say, on the positive integers, you could make use
of the atomic measure \mu_1() having point mass 1 at each positive
integer, or you could make use of the atomic measure \mu_2() having
point mass 1/2^n at each positive integer n.
While the Kolmogorov axioms are expressed in terms of general measure theory, we almost always skip the Lebesque integral and just use the Riemann integral, with the associated and implied usual measure on the real line (i.e. the one corresponding to the L2-norm). While we certainly use other norms in stats (and the Lp norms are a great way to think about things like regularization or even the deep relationship between mean, median and mode), we almost always think of the probabilities in terms of the usual measure on the real line. On top of that, the "measures" you defined aren't the usual types of things you see in measure theory. You seem to have blurred the line between (or even perhaps convolved) your PDF and your measure. (I'm looking at my copy of Kolmogorov and Fomin's Introductory Real Analysis and really not wanting to take the time to make all of my statements rigorous). This brings me to what I suspect is the actual problem: The entire integral, including the associated measure, is what defines the probability model, and it is for a given probability model that the likelihood is unique. Indeed, that's the whole point of Fisher's work on likelihood: you create something that is a function of the data for a given probability model, including associated parameters. Since we can't change the data, we change the parameters of the probability model to maximize the likelihood. (This is being slightly infelicitous about the fine print on P(x|?)vs L(?|x)and the difference between the probability and scoring function). The definition of (G)LMM that we use defines our probability model and thus unique likelihood. The real fun comes in figuring out an efficient way to evaluate the profiled log likelihood, which is where a lot of the innovation in lme4 and MixedModels occurs -- using sparse matrix methods on a penalized least squares problem instead of generalized least squares. By the way, I think the problem that you're getting at with your measure example is actually something not dissimilar to importance sampling. One more thing to comment on below the fold ....
<SNIP> Moreover, is it not the case that the likelihood that one gets from fitting a linear model with no random effects, using lm(), is not (directly) comparable with the likelihood obtained from fitting a linear mixed model using lmer()?? (Whence one cannot test for the "significance" of a random effect by comparing the two fits via a likelihood ratio test, as one might perhaps na?vely hope to do.)
Yes, you can compare the likelihoods between LM and LMM fits BUT the likelihood-ratio test will be problematic because the LM fit corresponds to an LMM fit with values at the edge of the parameter space (one or more variance components equal to zero). Now, you could bootstrap this test instead of assuming that the null distribution follows a chi-squared distribution with an appropriate number of degrees of freedom, then everything would be fine. I even have some example code for this in Julia that I could potentially clean up and share. AIC/BIC don't suffer from the same issues as the LRT (because they don't assume a null distribution), but have their own difficulties and I don't want to spend too much time discussing model selection here as enough ink and arXiv space has been spilled over it. I hope that helps and that I have fixed more misunderstandings than I created. Phillip
Please enlighten me as to the ways in which my thinking is confused. cheers, Rolf