is multicollinearity of fixed effects resolved by random effects

Tue, May 20, 2008 12:28 AM

Disclaimer: The following recommendation of a sequence of steps is not
the only one, perhaps not even the best one. It has worked for me in
the past.

First, I assume
- that centering was done with SCALE=FALSE,
- that you checked that linearity is defensible for the relation
between X13C and your three predictors (i.e., that you do not need
quadratic terms for some of them),
- that you used method = "ML" for comparison of models with identical
random but different fixed effect parts and you read up on the some of
the complications associated with such comparisons discussed on this
list. (You can use method="REML" for comparison of models differing
only in the random effects, again there are some qualifications.)

Second, I recommend that now you focus on the fixed-effects part. It
still is a bit random (pun intended).

If taking out the three.factor interaction does no harm, you may want
to remove the non-significant two-factor interactions. After those are
gone, you may want to check whether there is a non-significant main
effect that is not part of an interaction (cMAP?). Then you may want
to take this one out, too. Such a hiearchical dropping out of effects
may lead you to your current favorite:
X13C ~ cMAT * cLAT + TYPE + (1 | SITE) + (1 | SPECIES)
which expands to:
X13C ~ cMAT+ cLAT + cMAT:cLAT+ TYPE + (1 | SITE) + (1 | SPECIES)
Note there are often good arguments for keeping theoretically
interesting effects in the model, even if they are not significant!

Third, you should plot your effects to get a good idea about the
source of the interaction.

Finally, you specify various random-effects parts and try to
understand what they mean (see previous posts to this list), for
example:
(1|SITE)
(1|GENUS)
(1|GENUS) + (1|SITE)
(1|GENUS) + (1|GENUS:SITE)   equivalent to  (1|GENUS/SITE)

Then allow the significant fixed effects to vary for the random effects.

Reinhold Kliegl

On Tue, May 20, 2008 at 1:55 AM, Jordan Mayor <clavulina at gmail.com> wrote:

#Thanks for all of your comments.  I have done three new things to my fungal
data since last posting my comments:

1) I realized that in my previous email I had centered my explanatory
variables, not predictors (sorry - I had previously centered the isotope
data for discriminant analyses),  which I have now fixed  (e.g. cMAT, cMAP,
cLAT),
2) I have include TYPE as a fixed effect instead of creating separate models
for both mycorrhizal and saprotrophic fungi to simplify, and
3) I have decided to compare model sets using the GENUS instead of SPECIES
as random effects in the following lmer models and both sets converged on
the same model according to AIC scores.

Justifications for the above actions:
1) Fixed previous error.   The centered predictors have a much lower
correlation with one another now suggesting I may have finally solved the
severe multicollinearity effect???? Not sure if this is a correct
interpretation however.

center=lm(cMAT~cLAT,data=dGen)
summary(center)

Call:
lm(formula = cMAT ~ cLAT, data = dGen)
Residuals:
    Min      1Q         Median      3Q       Max
-17.817  -3.206   1.229         2.229  10.313
Coefficients:
                    Estimate    Std. Error    t value      Pr(>|t|)
(Intercept) -0.920757   0.181696    -5.068      4.88e-07 ***
cLAT         -0.213731   0.009673    -22.097    < 2e-16 ***
Residual standard error: 5.466 on 911 degrees of freedom
Multiple R-squared: 0.3489,     Adjusted R-squared: 0.3482
F-statistic: 488.3 on 1 and 911 DF,  p-value: < 2.2e-16

2) TYPE provides good a priori knowledge about the physiology of fungi (i.e.
trophic role in the ecosystem) and this is reflected in my dependent
variables whose variance I am t -> the carbon and nitrogen isotopes in fungi

3) Fungal genera are much easier to taxonomically identify and thus are less
prone to collector misidentification, regional biases, or spelling errors!
The family level may be too coarse for ecological comparisons because many
genera within families can form both of the trophic roles I am trying to
model.  I don't see any reason to use GENERA as a fixed effect however,
because individual life histories (host plant/tissue, fungal age, site
fertility, site stress) at each SITE could modify isotope values in very
unknown (random) ways.  I will defiantly look into using the "ape" package
in future research, thanks for pointing this out Simon, and yes there is now
a phylogeny for fungi to which this could be applied - see (Blackwell et al.
2006 Mycologia 98:829-837, Hibbett et al. 2007Mycological Research
111:509-547) if anyone is interested.

Using TYPE as fixed and GENERA as random provided somewhat better dispersion
across sites as evidenced in the following table previously requested by
Douglas Bates:

table(table(dGen$GENUS))

 0    1      2      3     4     5    6    7    8    9   10   11   13   15
16   17   19    25    28    35    54   86   97
 1    63   21   15    6     5    9    7    1    3    2     2      1
2     1     2     1       1      1      2       1     1     1

xtabs(~ SITE, dGen)

SITE
                             Aheden                            Aheden 2
                                 33                                  53
   Ashiu (temp deciduous broadleaf)         Betsele
                                 40                                   5
                     Breuil, France                         Chiba
                                 47                                   9
                                  d                          Flakaliden
                                 87                                  21
                Glacier Bay, Alaska                   Guyana
                                  8                                  49
                                  h                              heath
tundra, subarctic Sweden
                                 92                                  14
                          Kagoshima                       Kulbacksliden
                                  1                                   8
                              Kyoto                               Lamar
Haines
                                 54                                  25
          Lambir (lowland tropical)                  Miyajima
                                 31                                   2
                           Norikura                           Norrliden
                                 12                                   1
                            Okinawa                       Ontake (subalpine
coniferous)
                                  1                                  17
                              Oodai                           pine forests
in CA
                                  2                                  43
                          Shirahama                         Snowbowl
                                  3                                  22
                  Spruce plantation                      Stadsskogen
                                 37                                 123
                        Svartberget                         Tanigawa
                                  5                                   8
tussock tundra near Toolik Lake, AK       Vilan
                                  8                                   7
                        Woods Creek
                                 45

The new models are as follows:

Models with SPECIES included:
d.mod0: X13C ~ TYPE + (1 | SITE) + (1 | SPECIES)
d.mod5: X13C ~ cMAT + TYPE + (1 | SITE) + (1 | SPECIES)
d.mod6: X13C ~ cMAP + TYPE + (1 | SITE) + (1 | SPECIES)
d.mod7: X13C ~ cLAT + TYPE + (1 | SITE) + (1 | SPECIES)
d.mod8: X13C ~ cMAT * cLAT + (1 | SITE) + (1 | SPECIES)
d.mod2: X13C ~ cMAT * cLAT + TYPE + (1 | SITE) + (1 | SPECIES)
d.mod3: X13C ~ cMAT * cMAP + TYPE + (1 | SITE) + (1 | SPECIES)
d.mod4: X13C ~ cMAP * cLAT + TYPE + (1 | SITE) + (1 | SPECIES)
d.mod1: X13C ~ cMAT * cMAP * cLAT + TYPE + (1 | SITE) + (1 | SPECIES)
                   Df     AIC     BIC         logLik    Chisq Chi Df
Pr(>Chisq)
d.mod0.p   6   2594.8  2622.8  -1291.4
d.mod5.p   7   2596.4  2629.1  -1291.2   0.4031          1  0.5255178
d.mod6.p   7   2596.7  2629.5  -1291.3   0.0000          0  < 2.2e-16 ***
d.mod7.p   7   2596.7  2629.5  -1291.4   0.0000          0  < 2.2e-16 ***
d.mod8.p   7   2926.1  2958.9  -1456.1   0.0000          0  < 2.2e-16 ***
d.mod2.p  9   2563.2  2605.4  -1272.6   366.8690      2  < 2.2e-16 ***

Best fit

d.mod3.p   9   2585.3  2627.4  -1283.6   0.0000          0  < 2.2e-16 ***
d.mod4.p   9   2584.6  2626.7  -1283.3   0.6785          0  < 2.2e-16 ***
d.mod1.p 13  2569.9  2630.8  -1272.0  22.6361         4  0.0001497 ***

Those same models with GENERA instead of SPECIES:
                          Df     AIC     BIC          logLik    Chisq Chi
Df  Pr(>Chisq)
dGen.mod0.p   5   2428.6  2451.8   -1209.3
dGen.mod5.p   6   2430.0  2457.7   -1209.0   0.6296       1    0.427500
dGen.mod6.p   6   2430.6  2458.4   -1209.3   0.0000       0    < 2.2e-16 ***
dGen.mod7.p   6   2430.6  2458.4   -1209.3   0.0000       0    < 2.2e-16 ***
dGen.mod8.p   7   2515.3  2547.6   -1250.6   0.0000       1    1.000000
dGen.mod2.p  8   2401.6  2438.6   -1192.8 115.6470    1    < 2.2e-16 ***

Best fit again

dGen.mod3.p   8   2418.6  2455.5   -1201.3   0.0000       0    < 2.2e-16 ***
dGen.mod4.p   8   2418.4  2455.3   -1201.2   0.2163       0    < 2.2e-16 ***
dGen.mod1.p 12   2408.8  2464.3   -1192.4  17.5047      4   0.001542 **

I am still uncertain if I have modeled my random effects properly however
but is seems that Model 2 is robust regardless if I model Random effects as
(1|SITE/GENUS) or (1|SITE) + (1|GENUS).  BUT when comparing the two best fit
models to each other, the one with more df is significantly different
suggesting I should choose the simpler of the two without strong evidence to
support nesting.

Models:
dGen.mod2.1:   X13C ~ cMAT * cLAT + TYPE + (1 | SITE:GENUS)
dGen.mod2:      X13C ~ cMAT * cLAT + TYPE + (1 | SITE/GENUS)
                              Df     AIC        BIC           logLik
Chisq Chi     Df     Pr(>Chisq)
dGen.mod2.1.p  7       2421.2  2453.6      -1203.6
dGen.mod2.p     8       2417.9  2454.8       -1200.9    5.3363
1       0.02089 *

Sorry if I have overloaded you all with output - I just thought some would
be interested and I wanted to follow up.  I would be very interested in
knowing if anyone has any comments on my interpretations or could suggest
further model configurations.  Thank you very much.

Jordan Mayor


On Sun, May 18, 2008 at 8:33 PM, Simon Blomberg <s.blomberg1 at uq.edu.au>
wrote:

It is easy to incorporate phylogenetic correlations among species
("taxonomic factors"), using the ape package and lme. This is far better
than combining species into families or genera, as taxonomic heirarchies
are subjective, artificial, and rarely represent the true phylogeny. I
strongly disagree that taxonomic factors necessarily function as fixed
effects. The phylogeny represents what we think we know about the
covariance among species in all traits, measured or unmeasured. I can't
see how unmeasured traits or poorly-defined "taxonomic factors" can
possibly be included as fixed effects. If particular measured traits are
thought to be important in determining the mean response, they should be
included as fixed effects, but phylogenies are used to model the
covariance, not the mean. Unusual species should stand out by
examination of the normalized residuals.

My approach would be to ditch species as a factor altogether, and
incorporate phylogenetic effects through the correlation argument in lme
(or gls if there are no other random effects). This assumes there is an
available phylogeny for the fungi, which may not be true.

There is a very large literature on incorporating phylogeny into
analyses (to which I am afraid I am a small contributor).

Simon.

On Mon, 2008-05-19 at 10:02 +1000, John Maindonald wrote:

I think grouping them into families is a very good idea.
This makes use of prior insight, reduces the number of
parameters to an extent that it becomes more reasonable
to think about fixed effects, and you can look for individual
species that stray from the path laid out for them by their
families.  For the fixed effects analyses that I suggested,
you might do these by families.

Conceptually, there are taxonomic factors that surely
function, no doubt in interaction with location variables,
as fixed effects.  I consider that one ought to start by thinking
of them as fixed effects, unless it can be demonstrated that
data are indistinguishable from random variation.

Maybe however those effects operate more as the level of
genera or families than at the level of species.  Responding
to this point may be a useful aim for the study.

John.

John Maindonald             email: john.maindonald at anu.edu.au
phone : +61 2 (6125)3473    fax  : +61 2(6125)5549
Centre for Mathematics & Its Applications, Room 1194,
John Dedman Mathematical Sciences Building (Building 27)
Australian National University, Canberra ACT 0200.


On 19 May 2008, at 4:04 AM, Jordan Mayor wrote:

Thank you JM, RK, and DB for your thoughtful responses to my
questions.  I am working on posting the file on the web and will
respond as soon as it is publicly available - (I could email it
direct to those concerned as well).  Until then here is the output
requested by DB and RK as well as more information regarding sites,
species, and the data:

ds=d[d$TYPE=="s",] #separates into just sap fungi as used in the

example below

dm=d[d$TYPE=="m",] #separates into just myc fungi (treated as

distinct populations in my analyses)

table(table(ds$SPECIES))

    0      1    2   3   5   6
393 161  45   7   1   1
# As seen all species are not present at all the sites - this is
partly why I thought to nest SPECIES within SITE.  The sites range
widely from boreal tundra, to temperate California, to tropical
forests of Borneo and Guyana.  Most fungi are not so cosmopolitan as
to even potentially be present everywhere - let alone be collected
during brief collecting forays.  One site (pine forest in CA) didn't
even list  fungal names.  All fungi were listed as mycorrhizal (m),
saprotrophic (s), or of unknown ecological roles (unk).  Most of the
genera (or families) are present at all sites but that biologically
coarseness concerns me.  If, as pointed out by RK the ideal
situation would be every species present at every site then perhaps
creating a new column containing families or genera will move the
dataset toward that direction.

xtabs(~SITE,ds)

SITE
                             Aheden
Aheden 2

4                                   0
   Ashiu (temp deciduous broadleaf)        Betsele

21                                   0
                     Breuil, France                       Chiba

14                                   6
                                  d
Flakaliden

23                                   0
                Glacier Bay, Alaska                Guyana
                                  4
20
                                  h                     heath
tundra, subarctic Sweden

38                                   4
                          Kagoshima
Kulbacksliden

1                                   0
                              Kyoto                        Lamar
Haines
                                 26
13
          Lambir (lowland tropical)               Miyajima

14                                   1
                           Norikura                        Norrliden

3                                   0
                            Okinawa       Ontake (subalpine
coniferous)

0                                   8
                              Oodai                  pine forests in
CA
                                  2
25
                          Shirahama                     nowbowl
                                  2
13
                  Spruce plantation                   Stadsskogen
                                 17
13
                        Svartberget                     Tanigawa

0                                   6
tussock tundra near Toolik Lake, AK      Vilan

5                                   0
                        Woods Creek
                                 25                     # the zeros
here are because some sites have no X13C data - only X15N

I did try running models with centered predictors, as suggested by
RK - see below.  I found that the AIC scores were much larger
however when the predictors were centered using my method.  I
centered by taking the mean of mycorrhizal and saprotrophic fungal
groups at each site - then I subtracted each (m) or (s) fungus from
those same group means within each site.  Because the (m) and (s)
fungi have unique sources of carbon, and nitrogen and these are
reflected in their isotope values (X13C, X15N), I deemed this
centering level to be appropriatein order to preserve the magnitude
of difference between the groups.

mix.model1  # Raw isotope values used

Linear mixed model fit by maximum likelihood
Formula: X13C ~ MAT + MAP + LAT + (1 | SPECIES) + (1 | SITE)
   Data: ds
  AIC   BIC    logLik     deviance  REMLdev
 1016 1041 -500.9     1002         1031
Random effects:
 Groups   Name        Variance Std.Dev.
 SPECIES  (Intercept) 0.62149  0.78835
 SITE     (Intercept) 0.34885  0.59063
 Residual             1.28911  1.13539
Number of obs: 283, groups: SPECIES, 215; SITE, 24

Fixed effects:
              Estimate Std. Error t value
(Intercept) -2.207e+01  1.121e+00 -19.695
MAT         -6.418e-02  3.833e-02  -1.675
MAP          3.137e-05  2.100e-04   0.149
LAT         -8.690e-03  1.778e-02  -0.489

Correlation of Fixed Effects:
          (Intr)       MAT     MAP
MAT  -0.707
MAP  -0.401  -0.247
LAT   -0.959  0.692   0.272

mix.model1a # Group centered within each site

Linear mixed model fit by maximum likelihood
Formula: STND_13c ~ MAT + MAP + LAT + (1 | SPECIES) + (1 | SITE)
   Data: ds
  AIC   BIC    logLik     deviance  REMLdev
 2623 2649  -1305     2609         2620
Random effects:
 Groups   Name        Variance Std.Dev.
 SPECIES  (Intercept) 301.9720 17.3773
 SITE     (Intercept)   5.1294  2.2648
 Residual             327.3857 18.0938
Number of obs: 283, groups: SPECIES, 215; SITE, 24

Fixed effects:
             Estimate Std. Error t value
(Intercept) 3.463e+01  1.166e+01  2.9713
MAT         3.899e-01  4.330e-01  0.9006
MAP         2.543e-05  1.811e-03  0.0140
LAT         3.235e-01  1.867e-01  1.7324

Correlation of Fixed Effects:
         (Intr)       MAT     MAP
MAT -0.779
MAP -0.230  -0.300
LAT  -0.955   0.732   0.116

anova(mix.model1,mix.model1a)

Data: ds
Models:
mix.model1: X13C ~ MAT + MAP + LAT + (1 | SPECIES) + (1 | SITE)
mix.model1a: STND_13c ~ MAT + MAP + LAT + (1 | SPECIES) + (1 | SITE)
                            Df      AIC         BIC          logLik
Chisq Chi Df  Pr(>Chisq)
mix.model1.p    7      1015.85  1041.37  -500.93
mix.model1a.p  7      2623.49  2649.00 -1304.74     0      0       <
2.2e-16 ***

# As seen above, the model using my centered values performed
poorly.  My interpretation is that the centering removed the very
variability associated with climate I am trying to predict!

#In addition, I compared the other models mentioned by RK using the
raw isotope values:

anova(mix.model1,mix.model2,mix.model3)

Data: ds
Models:
mix.model1: X13C ~ MAT + MAP + LAT + (1 | SPECIES) + (1 | SITE)
mix.model2: X13C ~ (MAT + MAP + LAT)^2 + (1 | SPECIES) + (1 | SITE)
mix.model3: X13C ~ MAT * MAP * LAT + (1 | SPECIES) + (1 | SITE)
                         Df     AIC         BIC
logLik      Chisq    Chi Df   Pr(>Chisq)
mix.model1.p  7     1015.85  1041.37    -500.93
mix.model2.p 10    978.21    1014.66    -479.10   43.645
3       1.796e-09 ***
mix.model3.p 11    980.18    1020.28    -479.09   0.023
1        0.8794

# I have refrained from trying the models mentioned by DB in order
to protect his rights to not commit the "capital mistake" ;)
however, the three predictors (MAT, MAP, LAT) do indeed seem to
overplot - I will try to select only one perhaps in my final models.

# Again - thank you all for your help on this.

--
Jordan Mayor, Ph.D. Candidate
Ecosystem Dynamics Research Lab
Department of Botany, University of Florida
Gainesville, FL 32611

_______________________________________________
R-sig-mixed-models at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models

is multicollinearity of fixed effects resolved by random effects

Thread (10 messages)