Skip to content

lme4, cloglog vs. binomial link

3 messages · Tibor Kiss, David Duffy, Peter Dalgaard

#
Hi everybody,

I am a theoretical and computational linguist and use GLMs and GLMMs (glm and lmer) to identify linguistic features relevant for the presence/absence of other linguistic features. I am somewhat uncertain about the interpretation of my results. In the present application, I want to know which linguistic features determines the presence/absence of a determiner (the, a) in constructions similar to "by bus" vs. "We are taking the bus" (in fact, I am working on German data, but that does not matter now.) The linguistic features I am using come from annotated language corpora, and almost all features are categorical, spanning information pertaining to the words in question(category, morphology, meaning) to the syntactic environment. 

I am not so much interested in developing a model that predicts whether a determiner is realized or omitted, but in a subset of features that seem most pertinent to why the determiner is omitted. In most cases, high prediction and variable selection seem to be two sides of the same coin. If however, the fixed effects are features of the construction (that can be translated into features in a grammar rule) while the lexical items in the construction (the noun in particular, since I develop models where the preposition is kept constant) is taken to be the random effect, then the model can reach high prediction, but is itself not so interesting because it may basically say that whatever the features are, certain lexical items can override everything. (Formally, everything is fine with that result, but grammarians do not like analyses of the form: learn the following nouns by rote, as they can be used in the construction.)

In the following mixed models, *target_noun_lemma* is the representation of the noun in the construction, its categorical value being one of the 712 different nouns in the sample. The sample contains 6.841 different instances of the construction: 810 instances of determiner omission and 6.031 instances of determiner realization. 

The distribution of *target_noun_lemma* is highly skewed (which is standard for language samples): the top five nouns occur 1.225, 466, 443, 414, and 304 times, respectively, and the long tail consists of singular occurrences. 
Using a mixed model, the lexical influence of the nouns can be determined and will be reflected in high standard deviation of the random effect *target_noun_lemma*. While an glmm taking lexical influence into account would show a higher degree of prediction, but it would also indicate that the features employed (i.e. the fixed parameters of the model) only tell a less relevant part of the story. 

In the following models, the fixed effects describe the presence/absence of an adjective in the construction, whether the noun in the construction is extended by further syntactic material (OP), whether the noun is derived from a verb, and which interpretation the preposition will have.
Generalized linear mixed model fit by the Laplace approximation 
Formula: determiner ~ adja_in_hit + +I(cor_mp_int_dep_rel_type == "OP") +  TN_LEX_nominalisierung + modal + machtverhaeltnis + zuordnung + bezugspunkt + restriktiv + (1 | target_noun_lemma) 
   Data: unter.data 
  AIC  BIC logLik deviance
 2302 2370  -1141     2282
Random effects:
 Groups            Name        Variance Std.Dev.
 target_noun_lemma (Intercept) 19.241   4.3865  
Number of obs: 6838, groups: target_noun_lemma, 712

Fixed effects:
                                       				Estimate Std. Error z value Pr(>|z|)    
(Intercept)                              			6.8978     0.4944  13.952  < 2e-16 ***
adja_in_hit1                            			-1.4169     0.1795  -7.893 2.95e-15 ***
I(cor_mp_int_dep_rel_type == "OP")TRUE  	-0.6685     0.5845  -1.144 0.252735    
TN_LEX_nominalisierung                 		-4.7256     0.6551  -7.214 5.43e-13 ***
modal                                  			-1.4955     0.1963  -7.620 2.54e-14 ***
machtverhaeltnis                    		     	1.2352     0.3345   3.693 0.000222 ***
zuordnung                                			1.5816     0.2646   5.977 2.27e-09 ***
bezugspunkt                             			-2.6957     1.6866  -1.598 0.109979    
restriktiv                              			-1.6586     0.2742  -6.049 1.46e-09 ***

The intercept has a value of 6.8978, so a standard deviation of 4.3865 shows that whatever the predictions of the fixed effects are, they can be dwarfed by the random effects. 
 
With regard to the skewed distribution of determiner omission/realization, I understand from Zuur et al. (2009:251) that it would make sense to apply a complementary log log link instead of the binomial link I applied in the above model. Hence, I defined the following model, which differs in the link, but also in the response variable, because I assume (hopefully correct) that the complementary log log link requires the less frequent response to become the success. Hence I have changed determiner into inv_resp (inverted response), where *yes* is coded as yes, and *no* is coded as z-no. 

unter.new.glmm9 <- lmer(inv_resp ~ adja_in_hit + I(cor_mp_int_dep_rel_type == "OP") + TN_LEX_nominalisierung + modal + machtverhaeltnis + zuordnung + bezugspunkt + restriktiv + (1 | target_noun_lemma), data = unter.data, family = binomial(link = "cloglog"))

Data: unter.data 
AIC  BIC logLik deviance
2360 2429  -1170     2340
Random effects:
Groups            Name        Variance Std.Dev.
target_noun_lemma (Intercept) 6.9517   2.6366  
Number of obs: 6838, groups: target_noun_lemma, 712

Fixed effects:
                                           			Estimate Std. Error z value Pr(>|z|)    
(Intercept)                             			-5.4100     0.2935 -18.434  < 2e-16 ***
adja_in_hit1                            			0.9492     0.1424   6.667 2.61e-11 ***
I(cor_mp_int_dep_rel_type == "OP")TRUE   0.3181     0.2545   1.250 0.211386    
TN_LEX_nominalisierung                   		3.5858     0.3913   9.164  < 2e-16 ***
modal                                    			1.2048     0.1505   8.005 1.20e-15 ***
machtverhaeltnis                        		-0.9212     0.2470  -3.729 0.000192 ***
zuordnung                               			-1.2208     0.2096  -5.825 5.71e-09 ***
bezugspunkt                              		2.4556     1.1068   2.219 0.026503 *  
restriktiv                               			1.3635     0.1994   6.838 8.05e-12 ***


The model seems to be worse in terms of AIC (2360 compared to 2302), but the standard deviation of *target_noun_lemma* dropped from 4.38 to 2.636.

Comparing the BLUPs of both models, the BLUPs of the first (binomial) one range from -10.65 to 4.06, while the blups of the second one (cloglog) occupy a smaller range from 6.90 to -3.24. (Given that the response variable is inverted, this means that BLUPs triggering determiner realization remain roughly in the same region (4.06/3.24) while BLUPs triggering determiner omission lost some influence in the extreme (10.65/6.90) ? if I interpret the results correctly. 

My questions are as follows:

1. Is it correct to assume that given a cloglog link, the less frequent response should be considered the success?
2. Is it correct to conclude that the changes in the model have led to less influence of the random factor?
3. How shall I react to the increase in AIC?

A final question, which may not have an answer at all: I am most curious to learn about possible modifications of the model so that an observed random effect can be minimized (while its presence cannot be denied). 

Thanks very much!

With kind regards

Tibor

-----------------------------------------------------------
Prof. Dr. Tibor Kiss
Sprachwissenschaftliches Institut
Ruhr-Universit?t Bochum
http://www.linguistics.rub.de/~kiss
#
On Mon, 4 Jun 2012, Tibor Kiss wrote:

            
I found all this quite dizzying.  I would first look for an optimal link 
function in a fixed effect GLM for a dataset of your top 5 nouns. I don't 
think you can read much into the scale of the random effects estimates 
using different link functions.  The other way of doing these things is 
changing the distribution of the random effects - for a single random 
effect like this there are nonparametric/mixture models (you could 
interpret this as clustering your nouns into families).

Interpretation of the AICs depends on the internals of the loglik 
for the different links.  They should be comparable, in which case 
logit good, cloglog bad.
You can sometimes get rid of a random effect completely by transformation. 
The examples I know of are for continuous Y and crossed factors (additive 
and dominant genetic variances), where one factor can be removed.

Cheers, David Duffy.
1 day later
#
On Jun 4, 2012, at 13:07 , Tibor Kiss wrote:

            
No, cloglog is asymmetric, so it will make a difference which outcome is considered success, but there is no mathematical reason to choose between them. In survival data, the cloglog comes out of the proportional hazards model when you have death within a fixed time period as the response (exact date of death not recorded). In that case, death is "success" (!); hopefully, it is the least likely outcome, but it might not be. If cloglog is just used as a generic link function, then no such logic applies.
No. The scales are different. At the very least, you need to somehow compare it to the fixed effects on the same scale.
(Or, equivalently, the deviance). The cloglog link model seems to give the worse fit to data.
First, is that desirable, and why? The only logic, that I can think of, is that you want to get the fixed-effect part of the model right, so that the error is not mistakenly taken as part of the random variation.