Cluster-robust SEs & random effects -- seeking some clarification
Thanks, James The McNeish & Kelley (2019) paper is one I was not aware of despite my read of several other Kelley-authored articles. Indeed, that paper provides a point of departure for a question on my work on the Bangladesh RCT mask-intervention study mentioned earlier. In short: they used cluster-affiliated dummy variables (read: the pairID variable) in a fixed effect model. For their linear run with baseline controls, their STATA code was: reghdfe posXsymp treatment proper_mask_base prop_resp_ill_base_2, absorb(pairID) vce(cluster union) In translating this to a random-effects model using lmer, does it make sense to include the pairID variable in the model *if* I treat the cluster variable as its own random effect as: lme4_1_B = lmer(posXsymp~treatment+proper_mask_base+prop_resp_ill_base_2 + pairID + (1 | union), data = bdata.raw1)#lme4 package I have mentioned previously that the lmer code above is a random-intercepts only model. This is by design as there are mean-level differences in the clusters to begin with on several background variables that are captured by the random effects. I also am making a conceptual case that in order for the mask study to have appropriate generalizability, one must assume or treat clusters as *randomly* selected from a larger population of clusters. Otherwise, any marginal effect of the mask-intervention (while perhaps more accurately estimated in a fixed model), is not going to have the generalizability to any population of human interactions. My focal question nonetheless concerns how to treat the pairID variable in my translation of their fixed effects model to a random effects model in lmer. If I include the pairID variable as above, what does it reflect given that cluster is treated as a random effect? I have a separate model where I eliminate the pairID variable as: lme4_1 = lmer(posXsymp~treatment+proper_mask_base+prop_resp_ill_base_2 + (1 | union), data = bdata.raw1)#lme4 package *What is the substantive difference between these two models? *My sense is that this gets at the separation of between/within effects and that the pairID variable in their original STATA fixed effects model (a cluster-affiliated variable in the language of McNeish & Kelley) is analogous to the cluster variable itself BUT in their model, a) the assumption is that clusters are interchangeable (not drawn from a random population); and b) one can not estimate within-cluster/between cluster effects using their parameterization (i.e., random effects--in my case intercepts--for the clusters). I realize this is a bit of a mouthful, but I was inspired to post after reading the McNeish & Kelley and needed to get this out for my own thinking. -JD On Mon, Aug 15, 2022 at 10:00 PM James Pustejovsky <jepusto at gmail.com> wrote:
When you note, 'if you trust the specification of your random effects structure' can you elaborate on this? I imagine in the extreme, no random effects structure will ever truly be perfect, so I guess it comes down to some combination of theory, practicality, and model tractability?
Sure. Clearly, any model is a stylized and approximate representation of the true process. By "trust the specification" I just mean that you--and usually, also readers or potential critics--think that the random effects structure of the model is an adequate representation of the features of the data-generating process. In more colloquial terms, did you (the analyst) do a good job of developing the model? I think it's pretty helpful to think about this stuff in terms of convincing an audience. In practice, and given the current reporting conventions in social science disciplines, it's often pretty hard for readers/reviewers/critics to gauge whether an analyst has done a good job. In such contexts, cluster-robust SEs give some additional assurance (or insurance, the analogy in my previous message) that the inferences can be trusted even if the analyst didn't engage in a thorough, diligent model-building process. James