Skip to content

GLS, GEE or LMM ??

3 messages · ONKELINX, Thierry, Jens Oldeland

#
Dear All, 

I have run into a number of questions, and thus I hope you could help me 
out. I am modelling the effect of oyster density and nutrients on the 
bodyweight of mussels (population average).
Data was sampled at three different stations over 8 years, with values 
measured in springtime once per year.

I was following Zuur et al 2009 Mixed Effects Models (wonderful book!), 
but got lost at some points since different models lead to totally 
different results.

a) the first question is about "centring data". Zuur suggest to center 
parameters (p.334) if they are highly correlated with the intercept. 
When I apply a lme (family=gaussian, random ~ 1 | bank,  correlation = 
corAR1(form = ~ daycount)) I have to center nearly all the values. When 
I apply a GEE then there is no correlation at all (r=0.14).
Actually, centring the data leads to the same output at the end (for the 
lme)

b) Choosing GEE, the effect of one parameter (salinity) is highly 
significant, while using the LMM approach it is not, which would be 
better for our interpretation...
But why? Is it because GEE should not be used on normally distributed 
data? I know that GEE uses sandwich estimator and LMM uses ML. Which one 
would be more "trustworthy" or conservative?

c) one last qeustion: negative AICs, which one is better. AIC: -10 or -5 
? I have read contrasting statements. Is there any proof?? Does it hold 
for BIC as well?

thank you in advance!
Jens
#
Dear Jens,

A random effect with only three levels is not a good idea. You are
estimating a variance on only three numbers. Have a look at the plot
below. It gives the confidence interval of the ratio between the
estimated variance and the true variance. Note that with three levels,
the estimated variance can be from 40 times smaller up to 3.7 times
larger than the true variance. If you have 30 (thirty) levels, this
range is reduced: from 1.8 times smaller up to 1.5 times larger.

n <- seq(2, 100)
low <- qchisq(p = 0.025, df = n - 1) / (n - 1)
high <- qchisq(p = 0.975, df = n - 1) / (n - 1)
plot(n, high, type = "l", ylim = c(0, 5))
lines(n, low)
abline(h = 1, lty = 2)

Therefore I recommend that you add the site variable to the fixed
effects and drop the random effects.

A) Centering continuous data will mostly only affect the estimates of
the intercept. The intercept is the expected value of your respons when
all variables are zero (or at their reference level). So if you have a
timeseries ranging from 2000 to 2010, then the intercept is the value in
the year 0. When you center year to 2000 (year = 2000 --> cyear = 0),
then the intercept will be the expected value in the year 2000. The
first is non sense given your time series, the latter has a practical
interpretation. Note that both model will be mathematically identical
but just use a different parametrisation.

B) Given that you have only three levels, neither a LMM nor GEE will be
a good model. So comparing them is not a good idea.

C) Lower AIC is always better. So -10 is better than -5. AIC = 2 k - 2
log(L) with k = number of parameters, L = likelihood. Models with a high
likelihood will have a lower AIC (if the number of parameters are
equal).

HTH,

Thierry


------------------------------------------------------------------------
----
ir. Thierry Onkelinx
Instituut voor natuur- en bosonderzoek
team Biometrie & Kwaliteitszorg
Gaverstraat 4
9500 Geraardsbergen
Belgium

Research Institute for Nature and Forest
team Biometrics & Quality Assurance
Gaverstraat 4
9500 Geraardsbergen
Belgium

tel. + 32 54/436 185
Thierry.Onkelinx at inbo.be
www.inbo.be

To call in the statistician after the experiment is done may be no more
than asking him to perform a post-mortem examination: he may be able to
say what the experiment died of.
~ Sir Ronald Aylmer Fisher

The plural of anecdote is not data.
~ Roger Brinner

The combination of some data and an aching desire for an answer does not
ensure that a reasonable answer can be extracted from a given body of
data.
~ John Tukey
Druk dit bericht a.u.b. niet onnodig af.
Please do not print this message unnecessarily.

Dit bericht en eventuele bijlagen geven enkel de visie van de schrijver weer 
en binden het INBO onder geen enkel beding, zolang dit bericht niet bevestigd is
door een geldig ondertekend document. The views expressed in  this message 
and any annex are purely those of the writer and may not be regarded as stating 
an official position of INBO, as long as the message is not confirmed by a duly 
signed document.
#
Dear Thierry,

thank you very much for your help! However, I think I have not explained 
my approach very good.
I am using this formula

M1.1.lme <- lme(aWert ~ Salinity +  pH + chl.a + NO3 + oyster_qm + 
meanspring,  random = ~ 1 | bank,  na.action=na.omit, method="ML",  
data=mussels,  correlation = corAR1(form = ~ datumszahl))

hence six variables for the fixed effect, bank (station) as the location 
effect and "datumszahl" for the time effect. Datumszahl is a numeric 
that replaces a certain date. For example 35932  would be 17. May 98. 
Hence I am not using year 2000 but day..35000? oops :-)

Do you still think that six variables are not enough to calculate a LMM 
or GEE?
But than...what is the purpose of such models when they do not work with 
a small set of variables?

thinking,
Jens



ONKELINX, Thierry schrieb: