is multicollinearity of fixed effects resolved by random effects
Disclaimer: The following recommendation of a sequence of steps is not the only one, perhaps not even the best one. It has worked for me in the past. First, I assume - that centering was done with SCALE=FALSE, - that you checked that linearity is defensible for the relation between X13C and your three predictors (i.e., that you do not need quadratic terms for some of them), - that you used method = "ML" for comparison of models with identical random but different fixed effect parts and you read up on the some of the complications associated with such comparisons discussed on this list. (You can use method="REML" for comparison of models differing only in the random effects, again there are some qualifications.) Second, I recommend that now you focus on the fixed-effects part. It still is a bit random (pun intended).
d.mod1: X13C ~ cMAT * cMAP * cLAT + TYPE + (1 | SITE) + (1 | SPECIES) d.mod1a: X13C ~ (cMAT+ cMAP+ cLAT)^2 + TYPE + (1 | SITE) + (1 | SPECIES)
If taking out the three.factor interaction does no harm, you may want to remove the non-significant two-factor interactions. After those are gone, you may want to check whether there is a non-significant main effect that is not part of an interaction (cMAP?). Then you may want to take this one out, too. Such a hiearchical dropping out of effects may lead you to your current favorite: X13C ~ cMAT * cLAT + TYPE + (1 | SITE) + (1 | SPECIES) which expands to: X13C ~ cMAT+ cLAT + cMAT:cLAT+ TYPE + (1 | SITE) + (1 | SPECIES) Note there are often good arguments for keeping theoretically interesting effects in the model, even if they are not significant! Third, you should plot your effects to get a good idea about the source of the interaction. Finally, you specify various random-effects parts and try to understand what they mean (see previous posts to this list), for example: (1|SITE) (1|GENUS) (1|GENUS) + (1|SITE) (1|GENUS) + (1|GENUS:SITE) equivalent to (1|GENUS/SITE) Then allow the significant fixed effects to vary for the random effects. Reinhold Kliegl
On Tue, May 20, 2008 at 1:55 AM, Jordan Mayor <clavulina at gmail.com> wrote:
#Thanks for all of your comments. I have done three new things to my fungal data since last posting my comments: 1) I realized that in my previous email I had centered my explanatory variables, not predictors (sorry - I had previously centered the isotope data for discriminant analyses), which I have now fixed (e.g. cMAT, cMAP, cLAT), 2) I have include TYPE as a fixed effect instead of creating separate models for both mycorrhizal and saprotrophic fungi to simplify, and 3) I have decided to compare model sets using the GENUS instead of SPECIES as random effects in the following lmer models and both sets converged on the same model according to AIC scores. Justifications for the above actions: 1) Fixed previous error. The centered predictors have a much lower correlation with one another now suggesting I may have finally solved the severe multicollinearity effect???? Not sure if this is a correct interpretation however.
center=lm(cMAT~cLAT,data=dGen) summary(center)
Call:
lm(formula = cMAT ~ cLAT, data = dGen)
Residuals:
Min 1Q Median 3Q Max
-17.817 -3.206 1.229 2.229 10.313
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.920757 0.181696 -5.068 4.88e-07 ***
cLAT -0.213731 0.009673 -22.097 < 2e-16 ***
Residual standard error: 5.466 on 911 degrees of freedom
Multiple R-squared: 0.3489, Adjusted R-squared: 0.3482
F-statistic: 488.3 on 1 and 911 DF, p-value: < 2.2e-16
2) TYPE provides good a priori knowledge about the physiology of fungi (i.e.
trophic role in the ecosystem) and this is reflected in my dependent
variables whose variance I am t -> the carbon and nitrogen isotopes in fungi
3) Fungal genera are much easier to taxonomically identify and thus are less
prone to collector misidentification, regional biases, or spelling errors!
The family level may be too coarse for ecological comparisons because many
genera within families can form both of the trophic roles I am trying to
model. I don't see any reason to use GENERA as a fixed effect however,
because individual life histories (host plant/tissue, fungal age, site
fertility, site stress) at each SITE could modify isotope values in very
unknown (random) ways. I will defiantly look into using the "ape" package
in future research, thanks for pointing this out Simon, and yes there is now
a phylogeny for fungi to which this could be applied - see (Blackwell et al.
2006 Mycologia 98:829-837, Hibbett et al. 2007Mycological Research
111:509-547) if anyone is interested.
Using TYPE as fixed and GENERA as random provided somewhat better dispersion
across sites as evidenced in the following table previously requested by
Douglas Bates:
table(table(dGen$GENUS))
0 1 2 3 4 5 6 7 8 9 10 11 13 15 16 17 19 25 28 35 54 86 97 1 63 21 15 6 5 9 7 1 3 2 2 1 2 1 2 1 1 1 2 1 1 1
xtabs(~ SITE, dGen)
SITE
Aheden Aheden 2
33 53
Ashiu (temp deciduous broadleaf) Betsele
40 5
Breuil, France Chiba
47 9
d Flakaliden
87 21
Glacier Bay, Alaska Guyana
8 49
h heath
tundra, subarctic Sweden
92 14
Kagoshima Kulbacksliden
1 8
Kyoto Lamar
Haines
54 25
Lambir (lowland tropical) Miyajima
31 2
Norikura Norrliden
12 1
Okinawa Ontake (subalpine
coniferous)
1 17
Oodai pine forests
in CA
2 43
Shirahama Snowbowl
3 22
Spruce plantation Stadsskogen
37 123
Svartberget Tanigawa
5 8
tussock tundra near Toolik Lake, AK Vilan
8 7
Woods Creek
45
The new models are as follows:
Models with SPECIES included:
d.mod0: X13C ~ TYPE + (1 | SITE) + (1 | SPECIES)
d.mod5: X13C ~ cMAT + TYPE + (1 | SITE) + (1 | SPECIES)
d.mod6: X13C ~ cMAP + TYPE + (1 | SITE) + (1 | SPECIES)
d.mod7: X13C ~ cLAT + TYPE + (1 | SITE) + (1 | SPECIES)
d.mod8: X13C ~ cMAT * cLAT + (1 | SITE) + (1 | SPECIES)
d.mod2: X13C ~ cMAT * cLAT + TYPE + (1 | SITE) + (1 | SPECIES)
d.mod3: X13C ~ cMAT * cMAP + TYPE + (1 | SITE) + (1 | SPECIES)
d.mod4: X13C ~ cMAP * cLAT + TYPE + (1 | SITE) + (1 | SPECIES)
d.mod1: X13C ~ cMAT * cMAP * cLAT + TYPE + (1 | SITE) + (1 | SPECIES)
Df AIC BIC logLik Chisq Chi Df
Pr(>Chisq)
d.mod0.p 6 2594.8 2622.8 -1291.4
d.mod5.p 7 2596.4 2629.1 -1291.2 0.4031 1 0.5255178
d.mod6.p 7 2596.7 2629.5 -1291.3 0.0000 0 < 2.2e-16 ***
d.mod7.p 7 2596.7 2629.5 -1291.4 0.0000 0 < 2.2e-16 ***
d.mod8.p 7 2926.1 2958.9 -1456.1 0.0000 0 < 2.2e-16 ***
d.mod2.p 9 2563.2 2605.4 -1272.6 366.8690 2 < 2.2e-16 ***
Best fit
d.mod3.p 9 2585.3 2627.4 -1283.6 0.0000 0 < 2.2e-16 ***
d.mod4.p 9 2584.6 2626.7 -1283.3 0.6785 0 < 2.2e-16 ***
d.mod1.p 13 2569.9 2630.8 -1272.0 22.6361 4 0.0001497 ***
Those same models with GENERA instead of SPECIES:
Df AIC BIC logLik Chisq Chi
Df Pr(>Chisq)
dGen.mod0.p 5 2428.6 2451.8 -1209.3
dGen.mod5.p 6 2430.0 2457.7 -1209.0 0.6296 1 0.427500
dGen.mod6.p 6 2430.6 2458.4 -1209.3 0.0000 0 < 2.2e-16 ***
dGen.mod7.p 6 2430.6 2458.4 -1209.3 0.0000 0 < 2.2e-16 ***
dGen.mod8.p 7 2515.3 2547.6 -1250.6 0.0000 1 1.000000
dGen.mod2.p 8 2401.6 2438.6 -1192.8 115.6470 1 < 2.2e-16 ***
Best fit again
dGen.mod3.p 8 2418.6 2455.5 -1201.3 0.0000 0 < 2.2e-16 ***
dGen.mod4.p 8 2418.4 2455.3 -1201.2 0.2163 0 < 2.2e-16 ***
dGen.mod1.p 12 2408.8 2464.3 -1192.4 17.5047 4 0.001542 **
I am still uncertain if I have modeled my random effects properly however
but is seems that Model 2 is robust regardless if I model Random effects as
(1|SITE/GENUS) or (1|SITE) + (1|GENUS). BUT when comparing the two best fit
models to each other, the one with more df is significantly different
suggesting I should choose the simpler of the two without strong evidence to
support nesting.
Models:
dGen.mod2.1: X13C ~ cMAT * cLAT + TYPE + (1 | SITE:GENUS)
dGen.mod2: X13C ~ cMAT * cLAT + TYPE + (1 | SITE/GENUS)
Df AIC BIC logLik
Chisq Chi Df Pr(>Chisq)
dGen.mod2.1.p 7 2421.2 2453.6 -1203.6
dGen.mod2.p 8 2417.9 2454.8 -1200.9 5.3363
1 0.02089 *
Sorry if I have overloaded you all with output - I just thought some would
be interested and I wanted to follow up. I would be very interested in
knowing if anyone has any comments on my interpretations or could suggest
further model configurations. Thank you very much.
Jordan Mayor
On Sun, May 18, 2008 at 8:33 PM, Simon Blomberg <s.blomberg1 at uq.edu.au>
wrote:
It is easy to incorporate phylogenetic correlations among species
("taxonomic factors"), using the ape package and lme. This is far better
than combining species into families or genera, as taxonomic heirarchies
are subjective, artificial, and rarely represent the true phylogeny. I
strongly disagree that taxonomic factors necessarily function as fixed
effects. The phylogeny represents what we think we know about the
covariance among species in all traits, measured or unmeasured. I can't
see how unmeasured traits or poorly-defined "taxonomic factors" can
possibly be included as fixed effects. If particular measured traits are
thought to be important in determining the mean response, they should be
included as fixed effects, but phylogenies are used to model the
covariance, not the mean. Unusual species should stand out by
examination of the normalized residuals.
My approach would be to ditch species as a factor altogether, and
incorporate phylogenetic effects through the correlation argument in lme
(or gls if there are no other random effects). This assumes there is an
available phylogeny for the fungi, which may not be true.
There is a very large literature on incorporating phylogeny into
analyses (to which I am afraid I am a small contributor).
Simon.
On Mon, 2008-05-19 at 10:02 +1000, John Maindonald wrote:
I think grouping them into families is a very good idea. This makes use of prior insight, reduces the number of parameters to an extent that it becomes more reasonable to think about fixed effects, and you can look for individual species that stray from the path laid out for them by their families. For the fixed effects analyses that I suggested, you might do these by families. Conceptually, there are taxonomic factors that surely function, no doubt in interaction with location variables, as fixed effects. I consider that one ought to start by thinking of them as fixed effects, unless it can be demonstrated that data are indistinguishable from random variation. Maybe however those effects operate more as the level of genera or families than at the level of species. Responding to this point may be a useful aim for the study. John. John Maindonald email: john.maindonald at anu.edu.au phone : +61 2 (6125)3473 fax : +61 2(6125)5549 Centre for Mathematics & Its Applications, Room 1194, John Dedman Mathematical Sciences Building (Building 27) Australian National University, Canberra ACT 0200. On 19 May 2008, at 4:04 AM, Jordan Mayor wrote:
Thank you JM, RK, and DB for your thoughtful responses to my questions. I am working on posting the file on the web and will respond as soon as it is publicly available - (I could email it direct to those concerned as well). Until then here is the output requested by DB and RK as well as more information regarding sites, species, and the data:
ds=d[d$TYPE=="s",] #separates into just sap fungi as used in the
example below
dm=d[d$TYPE=="m",] #separates into just myc fungi (treated as
distinct populations in my analyses)
table(table(ds$SPECIES))
0 1 2 3 5 6
393 161 45 7 1 1
# As seen all species are not present at all the sites - this is
partly why I thought to nest SPECIES within SITE. The sites range
widely from boreal tundra, to temperate California, to tropical
forests of Borneo and Guyana. Most fungi are not so cosmopolitan as
to even potentially be present everywhere - let alone be collected
during brief collecting forays. One site (pine forest in CA) didn't
even list fungal names. All fungi were listed as mycorrhizal (m),
saprotrophic (s), or of unknown ecological roles (unk). Most of the
genera (or families) are present at all sites but that biologically
coarseness concerns me. If, as pointed out by RK the ideal
situation would be every species present at every site then perhaps
creating a new column containing families or genera will move the
dataset toward that direction.
xtabs(~SITE,ds)
SITE
Aheden
Aheden 2
4 0
Ashiu (temp deciduous broadleaf) Betsele
21 0
Breuil, France Chiba
14 6
d
Flakaliden
23 0
Glacier Bay, Alaska Guyana
4
20
h heath
tundra, subarctic Sweden
38 4
Kagoshima
Kulbacksliden
1 0
Kyoto Lamar
Haines
26
13
Lambir (lowland tropical) Miyajima
14 1
Norikura Norrliden
3 0
Okinawa Ontake (subalpine
coniferous)
0 8
Oodai pine forests in
CA
2
25
Shirahama nowbowl
2
13
Spruce plantation Stadsskogen
17
13
Svartberget Tanigawa
0 6
tussock tundra near Toolik Lake, AK Vilan
5 0
Woods Creek
25 # the zeros
here are because some sites have no X13C data - only X15N
I did try running models with centered predictors, as suggested by
RK - see below. I found that the AIC scores were much larger
however when the predictors were centered using my method. I
centered by taking the mean of mycorrhizal and saprotrophic fungal
groups at each site - then I subtracted each (m) or (s) fungus from
those same group means within each site. Because the (m) and (s)
fungi have unique sources of carbon, and nitrogen and these are
reflected in their isotope values (X13C, X15N), I deemed this
centering level to be appropriatein order to preserve the magnitude
of difference between the groups.
mix.model1 # Raw isotope values used
Linear mixed model fit by maximum likelihood
Formula: X13C ~ MAT + MAP + LAT + (1 | SPECIES) + (1 | SITE)
Data: ds
AIC BIC logLik deviance REMLdev
1016 1041 -500.9 1002 1031
Random effects:
Groups Name Variance Std.Dev.
SPECIES (Intercept) 0.62149 0.78835
SITE (Intercept) 0.34885 0.59063
Residual 1.28911 1.13539
Number of obs: 283, groups: SPECIES, 215; SITE, 24
Fixed effects:
Estimate Std. Error t value
(Intercept) -2.207e+01 1.121e+00 -19.695
MAT -6.418e-02 3.833e-02 -1.675
MAP 3.137e-05 2.100e-04 0.149
LAT -8.690e-03 1.778e-02 -0.489
Correlation of Fixed Effects:
(Intr) MAT MAP
MAT -0.707
MAP -0.401 -0.247
LAT -0.959 0.692 0.272
mix.model1a # Group centered within each site
Linear mixed model fit by maximum likelihood
Formula: STND_13c ~ MAT + MAP + LAT + (1 | SPECIES) + (1 | SITE)
Data: ds
AIC BIC logLik deviance REMLdev
2623 2649 -1305 2609 2620
Random effects:
Groups Name Variance Std.Dev.
SPECIES (Intercept) 301.9720 17.3773
SITE (Intercept) 5.1294 2.2648
Residual 327.3857 18.0938
Number of obs: 283, groups: SPECIES, 215; SITE, 24
Fixed effects:
Estimate Std. Error t value
(Intercept) 3.463e+01 1.166e+01 2.9713
MAT 3.899e-01 4.330e-01 0.9006
MAP 2.543e-05 1.811e-03 0.0140
LAT 3.235e-01 1.867e-01 1.7324
Correlation of Fixed Effects:
(Intr) MAT MAP
MAT -0.779
MAP -0.230 -0.300
LAT -0.955 0.732 0.116
anova(mix.model1,mix.model1a)
Data: ds
Models:
mix.model1: X13C ~ MAT + MAP + LAT + (1 | SPECIES) + (1 | SITE)
mix.model1a: STND_13c ~ MAT + MAP + LAT + (1 | SPECIES) + (1 | SITE)
Df AIC BIC logLik
Chisq Chi Df Pr(>Chisq)
mix.model1.p 7 1015.85 1041.37 -500.93
mix.model1a.p 7 2623.49 2649.00 -1304.74 0 0 <
2.2e-16 ***
# As seen above, the model using my centered values performed
poorly. My interpretation is that the centering removed the very
variability associated with climate I am trying to predict!
#In addition, I compared the other models mentioned by RK using the
raw isotope values:
anova(mix.model1,mix.model2,mix.model3)
Data: ds
Models:
mix.model1: X13C ~ MAT + MAP + LAT + (1 | SPECIES) + (1 | SITE)
mix.model2: X13C ~ (MAT + MAP + LAT)^2 + (1 | SPECIES) + (1 | SITE)
mix.model3: X13C ~ MAT * MAP * LAT + (1 | SPECIES) + (1 | SITE)
Df AIC BIC
logLik Chisq Chi Df Pr(>Chisq)
mix.model1.p 7 1015.85 1041.37 -500.93
mix.model2.p 10 978.21 1014.66 -479.10 43.645
3 1.796e-09 ***
mix.model3.p 11 980.18 1020.28 -479.09 0.023
1 0.8794
# I have refrained from trying the models mentioned by DB in order
to protect his rights to not commit the "capital mistake" ;)
however, the three predictors (MAT, MAP, LAT) do indeed seem to
overplot - I will try to select only one perhaps in my final models.
# Again - thank you all for your help on this.
--
Jordan Mayor, Ph.D. Candidate
Ecosystem Dynamics Research Lab
Department of Botany, University of Florida
Gainesville, FL 32611
_______________________________________________ R-sig-mixed-models at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
-- Simon Blomberg, BSc (Hons), PhD, MAppStat. Lecturer and Consultant Statistician Faculty of Biological and Chemical Sciences The University of Queensland St. Lucia Queensland 4072 Australia Room 320 Goddard Building (8) T: +61 7 3365 2506 http://www.uq.edu.au/~uqsblomb email: S.Blomberg1_at_uq.edu.au Policies: 1. I will NOT analyse your data for you. 2. Your deadline is your problem. The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data. - John Tukey.