Skip to content

Bernoulli glmm question.

5 messages · Rolf Turner, Marko Bachl, Tibor Kiss +1 more

#
I am trying to help a graduate student in linguistics analyse her data. 
  Very much a case of the blind leading the blind, but I gotta try!

Summary of the structure of the data:

A number of (Mandarin speaking) students are assessed on their 
pronunciation of a suite of "test items" --- English language words.
(E.g. umbrella, helicopter, knife.)  They are assessed phoneme by 
phoneme in each word.  The response, at least in the context of my 
question, is whether they got the pronunciation right (y = 1) or wrong 
(y = 0).

The phonemes are classified into 7 types:

* Initial consonant
* Medial consonant
* Final consonant
* Initial consonant cluster
* Medial consonant cluster
* Final consonant cluster
* vowel

The students are classified by sex ("gender" to those wimps who are too 
embarrassed to say the word "sex").

I thought to fit a Bernoulli model with "type" (of phoneme) and sex (of 
the student) as predictors, with "student" being a random effect.

The syntax that I tried was:

fit <- lmer(y ~ sex + type + (1 | student), family = binomial, data = X)

where "X" is a data frame containing the relevant variables.

Main effects only, no interactions, so as to keep things simple --- at 
least initially.

First impressions from the fit:  Girls do significantly better than 
boys, and vowels are significantly easier than final consonant clusters 
(which form the baseline) and initial and medial clusters are 
significantly harder for the kids to pronounce than are the final 
clusters.  Single consonants (initial, medial, and final) do not differ 
significantly from the baseline in their difficulty level.

The bit about vowels being easier conforms to the graduate student's 
expectations and is kind of obvious from a rough inspection of the data.

There are 50 "test items" (words).  In the data set that I am initially 
looking at there are 54 students.  There are a total of 10314 observations.

(I am just looking at the oldest group of students to start with.  There 
are 6 other groups and eventually I will put all 7 groups together and 
investigate an age (or "level") effect as well.)

Would anyone be kind enough to comment on my efforts so far? Please try 
not to be too rude! :-)  Am I on the right track? Am I overlooking any 
glaring traps for young players? Have I got the syntax of my call to 
lmer() correct?

One thing that I am nervous about:

If I fit the "trivial model"

fit0 <- lmer(y ~ 1 + (1 | student), family = binomial, data = X)

the resulting coefficients are just the estimates (BLUPs?) of the 
"random intercepts, is it not so?  If I calculate the variance of these 
coefficients:

	var(coef(fit0)$student[,1])

I get 0.0226.  I thought that this value would be "pretty similar to" 
(though not exactly the same as) the estimated random effect variance. 
But the latter is 0.0502 --- which seems to me to be quite different.

A 95% confidence interval for sigma^2 on the basis of my "var(coef ...)" 
calculation, assuming that (n-1)*s^2/sigma^2 ~ chi-squared_{n-1},
is [0.0160, 0.0345] (to 4 decimal places) so the estimated random effect 
variance from fit0 is "significantly different" from my naive estimate.

My thinking must be out to lunch here.  Can someone put me back on the 
rails.  (My humblest apologies for the mixed metaphors. :-) )

Thanks for any words of wisdom.

cheers,

Rolf Turner
#
Dear Rolf,
just a short technical reply to your question at the end - if a part
of your question is how to technically extract the random effect
variances from a lmer() fit.
The syntax to get the estimates of the random intercept variances from
a lmer() fit is

sapply(VarCorr(fit0), diag)

which should give you exactly the estimates from summary(fit0). There
has been some discussion on the list on why var(coef()) does not work,
but I cannot recall the answer or find the relevant thread.

Overall, your analytical approach seems to make sense to me - but I am
by no means an expert. I was wondering whether a second random effect
of "word" should (has to be?) included, if all students are tested
with the same words. Such a model would account for the assumption
that there are other characteristics of the words beyond the list of
your "phonemes" which have an effect on correct pronunciation. The
model would be

fit1  <- lmer(y ~ 1 + (1 | student) + (1 | word), family = binomial, data = X)

where "word" and "student" are crossed random effects, which would
perfectly balanced if every student was tested with every word. But
again: I am not exactly familiar with this kind of research, so I am
just guessing...

Best regards
Marko
#
Hi,

I cannot say much about your question concerning the variance, but I would probably include "word" as a random factor as well. It's not easy to understand from your email, but I assume that each word is 'cut' into phonemes, so that your 10314 observations are actually 10314 slices of the 50 words fed to the 54 students. So "y" will be 0 or 1 for a phoneme. It might be the case that the whole word influences the pronunciation, and the words have been chosen at random, I assume, so I would include them as a random factor. 

Also, I would use glmer directly, but that might be cosmetics.

With kind regards

Tibor

----------------------------------------------
Prof. Dr. Tibor Kiss
Sprachwissenschaftliches Institut
Ruhr-Universit?t Bochum
www.linguistics.rub.de/~kiss


Am 13.03.2014 um 02:55 schrieb Rolf Turner <r.turner at auckland.ac.nz>:
#
Hi,

for the case given by Rolf Turner, I would add word as a random factor. 
"word" appears as a grouping factor for the observation  "type"

  I am " more nervous ( :)  )" about  the relevance of  more complex 
random effect designs,
accounting for correlations between type and student , and/or  type and 
(word and student)
(To tell the truth, I have similar  questions about  a model I'm 
currently  working on),  as:

fit2<- glmer(y ~ sex + type + (type | student) +(1|word), family = binomial, data = X)
where there are 7  (possibly correlated), random effects for each student

or
fit3<- glmer(y ~ sex + type + (type | student) +(type|word), family = binomial, data = X)
where there are 7  (possibly correlated), random effects for each student,
and 7  (possibly correlated), random effects for for each word

D. Bates suggested , at least as a starting point:

fit3a<- glmer(y ~ sex + type + (1|student ) +(1+type:student) +(1|word)+ (1+ type:word), family = binomial, data = X)

which is a simpler model to estimate (without any correlations), still accounting for the interactions between type and (student, word)


Which would be the "right " model ?
Is the data-driven (forward) selection of random terms still accepted  (or not, according to D. Barr ) ?
Recent discussions on this list showed at least some controverse about it.

I would be glad to be "corrected" , enlighted or advised on these points.
Thanks, in advanced, for your attention

Best regards

Robert Espesser
CNRS UMR  7309 - Universit? Aix-Marseille
5 Avenue Pasteur
13100 AIX-EN-PROVENCE

Tel: +33 (0)413 55 36 26




Le 13/03/2014 08:40, Tibor Kiss a ?crit :
#
Thanks to Marko Bachl, Tibor Kiss, Robert Espesser, and off-list Kevin 
Thorpe and Robert Kushler who all provided useful feedback and advice.

The overall import of the replies was that I should include ***word*** 
as a random effect, which I shall do.  In retrospect it's kind of 
obvious that this effect should be included.  Duhhhh!

Robert Espesser also suggested that I might wish to include some 
interaction terms in the model. He's almost surely correct, but I'm a 
bit reluctant to do so as that would take me (and definitely) the 
graduate student well out of our comfort zones.  I'll play around with 
this idea and try the various models that Prof. Espesser suggested, and 
see how much difference they make to the inferences to be drawn about 
the fixed effects.

Thanks again.

cheers,

Rolf Turner