"nesting for free" in lme4. Do I have this right?
I agree that the terminology could go either way. FWIW the GLMM FAQ uses Thierry's convention, i.e. "If the lower-level random effect has the same labels within each larger group (e.g. blocks 1, 2, 3, 4 within sites A, B, and C) then the explicit nesting (1|a/b) is required. It seems to be considered best practice to code the nested level uniquely (e.g. A1, A2, ?, B1, B2, ?) so that confusion between nested and crossed effects is less likely" (similarly in my chapter 14 of Fox et al's Ecological Statistics book). On Fri, Apr 6, 2018 at 10:47 AM, Jake Westfall
<jake.a.westfall at gmail.com> wrote:
Hi Thierry, I see your point. I usually think of implicit vs. explicit nesting in terms of how the factor levels are labeled in the dataset: the levels are explicitly nested if you can see the nesting just from looking at the data frame, and they're implicitly nested if you can't tell that they're nested just from looking at the data frame--because they appear to be crossed--instead you have to have outside, background knowledge about the dataset. If you instead view the issue from the perspective of the syntax, then yes, it would make sense to call (1|region/state) explicit and (1|region) + (1|state) implicit. Jake On Fri, Apr 6, 2018 at 3:08 AM, Thierry Onkelinx <thierry.onkelinx at inbo.be> wrote:
Dear Jake, IMHO you switched the implicit and explicit nesting. (1|region/state) is explicit because you tell the model that those should be nested. Implicit nesting occurs when you add the random effect as crossed (1|region) + (1|state) but since each state has a unique ID and each state occurs in only one region, the result is identical to the nested random effects. Hence the implicit nesting. Best regards, ir. Thierry Onkelinx Statisticus / Statistician Vlaamse Overheid / Government of Flanders INSTITUUT VOOR NATUUR- EN BOSONDERZOEK / RESEARCH INSTITUTE FOR NATURE AND FOREST Team Biometrie & Kwaliteitszorg / Team Biometrics & Quality Assurance thierry.onkelinx at inbo.be Havenlaan 88 bus 73, 1000 Brussel www.inbo.be //////////////////////////////////////////////////////////// /////////////////////////////// To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination: he may be able to say what the experiment died of. ~ Sir Ronald Aylmer Fisher The plural of anecdote is not data. ~ Roger Brinner The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data. ~ John Tukey //////////////////////////////////////////////////////////// /////////////////////////////// 2018-04-05 23:07 GMT+02:00 Jake Westfall <jake.a.westfall at gmail.com>:
Hi Paul, The "(1|state) + (1|region)" and "(1|region/state)" models are equivalent if states are explicitly nested (as opposed to implicitly nested) in regions. Explicitly nested means that the states in each region are
labeled
with unique IDs that don't repeat across regions. This equivalence is because (1|region/state) is just internally expanded to (1|region) + (1|region:state), but the "region:state" interaction term has the same number of unique levels as a simple "state" term if all states have
unique
IDs. The model with just "(1|state)" is not equivalent to either of the aforementioned models. It has fewer parameters and it lumps some of the region variance in with the state variance. As for identifiability, because you have many observations within each state, a model with random effects for both state and region is
definitely
identifiable, there shouldn't be any technical or conceptual problem. And it's probably a better model since it appears there is at least
non-trivial
region variance in your data and the model converges fine. Jake On Thu, Apr 5, 2018 at 3:51 PM, Paul Johnson <pauljohn32 at gmail.com>
wrote:
I don't (yet) have permission to give you this data, but I've asked
for permission and hope I can show in future if you are interested.
One of the students showed me a model and I said "oh, no, that won't
work" but glmer calculated estimates. He ran a model for US public
opinion with "(1|state) + (1|region)" and I thought it should be "(1 |
region/state)", but it turns out results are the same. This is what I
mean "nesting for free" in lme4.
This author's field of study uses lme4 to calculate shrinkage
estimates for states, gender/ethnic subgroups. Things that I would
treat as individual level fixed effects, like gender, age, and
education, are inserted as random effects. Then they use the
conditional modes for constructing other quantities they want.
After the student ran his model, I suggested instead: "(1 | state)".
This also converges, but the state-by-state effect estimates differ
from the nested model. I thought that was interesting. The main
purpose here is to retrieve the conditional modes, NOT the variance
estimates.
Which model is better/correct?
Nested
Generalized linear mixed model fit by maximum likelihood (Laplace
Approximation) [glmerMod]
Family: binomial ( logit )
Formula: y ~ (1 | regionf/state)
Control: glmerControl(optimizer = "Nelder_Mead")
AIC BIC logLik deviance df.resid
18888.9 18911.5 -9441.4 18882.9 13875
Scaled residuals:
Min 1Q Median 3Q Max
-1.0856 -0.8567 -0.7726 1.1097 1.4309
Random effects:
Groups Name Variance Std.Dev.
state:regionf (Intercept) 0.05461 0.2337
regionf (Intercept) 0.01226 0.1107
Number of obs: 13878, groups: state:regionf, 50; regionf, 4
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.26732 0.06965 -3.838 0.000124 ***
---
Here are region and first few state conditional modes from the nested
model:
$state
(Intercept)
Alabama -0.188083365
Alaska -0.103191227
Arizona 0.049081765
Arkansas -0.108260203
California 0.436002015
Colorado 0.035086075
Connecticut 0.046271020
Delaware 0.080796357
Florida 0.062439432
...
$regionf
(Intercept)
Northeast 0.139654257
Midwest -0.002894665
South -0.106236023
West -0.027772517
For reference, the first few random effects from (1 | state) are:
ranef(individual.modeld)
$state
(Intercept)
Alabama -0.260553538
Alaska -0.116579160
Arizona 0.046677297
Arkansas -0.172353191
California 0.435397028
Colorado 0.032704790
Connecticut 0.162169289
Delaware 0.057368626
Florida -0.013638881
....
The "total state effect" for, say, Alabama is the sum of the South
effect and the Alabama effect,
and that's different from the (1 | state) model.
Alabama nested: -0.106236023 -0.188083365 = -0.2943194
Alabama from (1 | state): -0.26
Those are slightly different. Is one better?
Seems to me the difference between the two is simply the shrinkage
effect of the PLS estimator, there's no substantively important
meaning in it. The shrinkage is applied separately to (1 | region)
and (1 | state). The correlation between region and state effects is
assumed 0. It is manufacturing 2 random effect layers where there
should be just one.
If state is treated as a fixed effect, of course, the region variable
is unidentifiable. Should the random effect model really be allowed to
manufacture a difference? That makes me think (1 | state) is the
"right model" and "(1 | region/state)" is different only as an
artifact.
Likelihood ratio test seems to say there is no big difference, which
is a relief. But what if there were?
individual.modeld: y ~ (1 | state)
individual.modelc: y ~ (1 | regionf/state)
Df AIC BIC logLik deviance Chisq Chi Df
Pr(>Chisq)
individual.modeld 2 18889 18904 -9442.6 18885 individual.modelc 3 18889 18912 -9441.4 18883 2.2477 1
0.1338
More and more, I have the feeling that the random effects model fabricates an ability to estimate quantities that are not really indentifiable. Estimating state-level-random effects on top of randomly assigned regional scores is only possible because of the technicalities related to shrinkage. Maybe. I'm not sure, that's why I'm asking you what you think. pj -- Paul E. Johnson http://pj.freefaculty.org Director, Center for Research Methods and Data Analysis http://crmda.ku.edu To write to me directly, please address me at pauljohn at ku.edu.
_______________________________________________ R-sig-mixed-models at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
[[alternative HTML version deleted]]
_______________________________________________ R-sig-mixed-models at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
[[alternative HTML version deleted]]
_______________________________________________ R-sig-mixed-models at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models