lme4, cloglog vs. binomial link
On Mon, 4 Jun 2012, Tibor Kiss wrote:
In the following mixed models, *target_noun_lemma* is the representation of the noun in the construction, its categorical value being one of the 712 different nouns in the sample. The sample contains 6.841 different instances of the construction: 810 instances of determiner omission and 6.031 instances of determiner realization.
The distribution of *target_noun_lemma* is highly skewed (which is standard for language samples): the top five nouns occur 1.225, 466, 443, 414, and 304 times The model seems to be worse in terms of AIC (2360 compared to 2302), 1. Is it correct to assume that given a cloglog link, the less frequent response should be considered the success? 2. Is it correct to conclude that the changes in the model have led to less influence of the random factor? 3. How shall I react to the increase in AIC?
I found all this quite dizzying. I would first look for an optimal link function in a fixed effect GLM for a dataset of your top 5 nouns. I don't think you can read much into the scale of the random effects estimates using different link functions. The other way of doing these things is changing the distribution of the random effects - for a single random effect like this there are nonparametric/mixture models (you could interpret this as clustering your nouns into families). Interpretation of the AICs depends on the internals of the loglik for the different links. They should be comparable, in which case logit good, cloglog bad.
I am most curious to learn about possible modifications of the model so that an observed random effect can be minimized
You can sometimes get rid of a random effect completely by transformation. The examples I know of are for continuous Y and crossed factors (additive and dominant genetic variances), where one factor can be removed. Cheers, David Duffy.