Dear list
This seems like something I really should know by now, but I'm getting so
confused, I'd really appreciate a little help!
I am trying to model the relationship between relative abundance (%) and
relative cover (%) data for plant species. I want to know to
what extent the 2 measures correlate, and to compare the extent of this
correlation at different sites. Obviously, both sets of data are
zero-inflated and highly skewed.
The "traditional" thing to do would be to log-transform both of them and
use lm(). However, a recent paper (O'Hara & Kotze, 2010) argues that a
much better approach is to use glm() and to specify Poisson or negative
binomial models, rather than using transformations. This does make a lot
of sense, I think!
I have tried using "quasipoisson" and "quasibinomial" families in glm(),
but I am left with a number of questions:
1) Should relative abundance and relative cover be treated as "count"
data, given that the values are not actually integers but rather
percentages?
2) Which parts of the output of glm(...family=quasipoisson(link=log)) do I
use to evaluate the fit? Just residual deviance and the p value?
3) How do I plot the data so as to graphically represent the model? If I
am using a log link should I use log axes for x and y?
Thanks so much for any help!
Karen
---
Karen Kotschy
Centre for Water in the Environment
University of the Witwatersrand, Johannesburg
Tel: +2711 717-6425
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.
Dear list
This seems like something I really should know by now, but I'm getting so
confused, I'd really appreciate a little help!
I am trying to model the relationship between relative abundance (%) and
relative cover (%) data for plant species. I want to know to
what extent the 2 measures correlate, and to compare the extent of this
correlation at different sites. Obviously, both sets of data are
zero-inflated and highly skewed.
The "traditional" thing to do would be to log-transform both of them and
use lm(). However, a recent paper (O'Hara & Kotze, 2010) argues that a
much better approach is to use glm() and to specify Poisson or negative
binomial models, rather than using transformations. This does make a lot
of sense, I think!
I have tried using "quasipoisson" and "quasibinomial" families in glm(),
but I am left with a number of questions:
1) Should relative abundance and relative cover be treated as "count"
data, given that the values are not actually integers but rather
percentages?
2) Which parts of the output of glm(...family=quasipoisson(link=log)) do I
use to evaluate the fit? Just residual deviance and the p value?
3) How do I plot the data so as to graphically represent the model? If I
am using a log link should I use log axes for x and y?
Thanks so much for any help!
Karen
Interesting paper by O'Hara and Kotze, but it does not refer to cover
(compositional) data, but rather to count data. Cover data is actually
a considerably harder problem to handle in the generalized linear model
case (alas), *unless* the data come from a point count of some sort
(i.e., where you know the 'denominator', or the total number of counts
that would correspond to 100% cover, in which case you can use a
binomial GLM: see e.g. [Seavy, N. E, S. Quader, John D. Alexander, and
C. John Ralph. 2002. Generalized linear models and point count data:
statistical considerations for the design and analysis of monitoring
studies. In Bird Conservation Implementation and Integration in the
Americas: Proceedings of the Third International Partners in Flight
Conference, ed. C. John Ralph and Terrell D. Rich, 2:744-753. Asilomar,
CA: U.S. Dept. of Agriculture, Forest Service, Pacific Southwest
Research Station, March 20.
http://www.fs.fed.us/psw/publications/documents/psw_gtr191/psw_gtr191_0744-0753_seavy.pdf.]
The natural (to a statistician) way to deal with this would be via
beta regression [Smithson, Michael, and Jay Verkuilen. 2006. A better
lemon squeezer? Maximum-likelihood regression with beta-distributed
dependent variables. Psychological Methods 11, no. 1 (March): 54-71.
doi:2006-03820-004.] Beta distributions are a natural description of
cover -- they are distributions defined on [0,1] with a simple
mathematical description, that can be fitted similarly to GLMs [see the
'betareg' package on CRAN]. I think I heard a talk at ESA a few years
ago that used beta regression (or maybe I just thought it should have
used beta regression). There's one big problem, though -- zeros do not
naturally fit into the statistical framework, so you have to do some
kind of ad hoc fix for this (this is discussed, briefly, in Smithson and
Verkuilen 2006). I looked for ecology papers that used beta regression
or cited SV2006, but didn't find very many (see below).
If you have a point count statistic, I would analyze your data in
terms of 'number of points occupied out of total census points', with a
binomial or quasibinomial model. If they are assessed in some other way
where there is no natural denominator, I would either (sigh) use
transformations or look into beta regression.
I'd be interested to hear other opinions.
good luck,
Ben Bolker
Boughton, Elizabeth H., Pedro F. Quintana-Ascencio, and Patrick J.
Bohlen. 2010. Refuge effects of Juncus effusus in grazed, subtropical
wetland plant communities. Plant Ecology (9).
doi:10.1007/s11258-010-9836-4.
http://www.springerlink.com/content/u18v4526k10uw2p1/.
Irvine, Kathryn M., and Thomas J. Rodhouse. 2010. Power analysis for
trend in ordinal cover classes: implications for long-term vegetation
monitoring. Journal of Vegetation Science (8): no-no.
doi:10.1111/j.1654-1103.2010.01214.x.
http://onlinelibrary.wiley.com/doi/10.1111/j.1654-1103.2010.01214.x/full.
Royo, Alejandro A., Ramona Bates, and Elizabeth P. Lacey. 2008.
Demographic constraints in three populations of Lobelia boykinii: a rare
wetland endemic. The Journal of the Torrey Botanical Society 135, no. 2
(4): 189-199. doi:10.3159/07-RA-039.1.
http://www.bioone.org/doi/abs/10.3159/07-RA-039.1?cookieSet=1&prevSearch=.
Dear Karen,
I was recently confronted with a similar problem, see paper:
http://www.elaliberte.info/Laliberte_et_al_2010_RangEcolManag.pdf?attredirects=0
We ended up using major axis regression on transformed data, among other
things. Then we simply plotted the relative abundance vs relative cover
of different species and compared against the 1:1 line.
I do realize that this is simplistic, a bit ad hoc and not very pretty
(in part because normality is assumed with MA regression). That said, I
thought it did allow us to quickly see which sampling method
over/under-estimates different species, which was the main goal. But I'd
be interested in knowing what approach you end up using.
If anything, you could cite that rather unexciting paper as a good
example of what the "bad approach" is -- it may end up being the only
time it ever gets cited! :)
Cheers
Etienne
On Tue, 2010-10-26 at 11:27 +0200, Karen Kotschy wrote:
Dear list
This seems like something I really should know by now, but I'm getting so
confused, I'd really appreciate a little help!
I am trying to model the relationship between relative abundance (%) and
relative cover (%) data for plant species. I want to know to
what extent the 2 measures correlate, and to compare the extent of this
correlation at different sites. Obviously, both sets of data are
zero-inflated and highly skewed.
The "traditional" thing to do would be to log-transform both of them and
use lm(). However, a recent paper (O'Hara & Kotze, 2010) argues that a
much better approach is to use glm() and to specify Poisson or negative
binomial models, rather than using transformations. This does make a lot
of sense, I think!
I have tried using "quasipoisson" and "quasibinomial" families in glm(),
but I am left with a number of questions:
1) Should relative abundance and relative cover be treated as "count"
data, given that the values are not actually integers but rather
percentages?
2) Which parts of the output of glm(...family=quasipoisson(link=log)) do I
use to evaluate the fit? Just residual deviance and the p value?
3) How do I plot the data so as to graphically represent the model? If I
am using a log link should I use log axes for x and y?
Thanks so much for any help!
Karen
---
Karen Kotschy
Centre for Water in the Environment
University of the Witwatersrand, Johannesburg
Tel: +2711 717-6425
Etienne Lalibert?
================================
School of Plant Biology, M090
The University of Western Australia
35 Stirling Highway
Crawley, Perth
Western Australia 6009
Phone: +61 8 6488 2214
www.elaliberte.info
Dear Karen,
seconding the comments of Phil and Etienne: One key question is whether
you can assume no error on the values of your predictors (i.e. run a
model 1-regression). If you can, Ben Bolker's comments point in the
right way; if you cannot, my heart goes out for the "simplistic"
approach of Etienne and try to pad your results with a bit of
"robustness testing".
(E.g. perturb/jitter your values and see if it makes a difference to
your regression. This may not be "official" stats, but should show clear
differences when the pattern is not robust. For example, the many 0s in
your data may be caused by detection problems (rather than true
absences) and hence giving them a random low cover/abundance (e.g. 1/2
of the respective minimum value) should NOT change your results. If it
does, I would interpret this as the data not supporting a clear
correlation between abundance and cover.)
HTH,
Carsten
On 26.10.10 11:27, Karen Kotschy wrote:
Dear list
This seems like something I really should know by now, but I'm getting so
confused, I'd really appreciate a little help!
I am trying to model the relationship between relative abundance (%) and
relative cover (%) data for plant species. I want to know to
what extent the 2 measures correlate, and to compare the extent of this
correlation at different sites. Obviously, both sets of data are
zero-inflated and highly skewed.
The "traditional" thing to do would be to log-transform both of them and
use lm(). However, a recent paper (O'Hara& Kotze, 2010) argues that a
much better approach is to use glm() and to specify Poisson or negative
binomial models, rather than using transformations. This does make a lot
of sense, I think!
I have tried using "quasipoisson" and "quasibinomial" families in glm(),
but I am left with a number of questions:
1) Should relative abundance and relative cover be treated as "count"
data, given that the values are not actually integers but rather
percentages?
2) Which parts of the output of glm(...family=quasipoisson(link=log)) do I
use to evaluate the fit? Just residual deviance and the p value?
3) How do I plot the data so as to graphically represent the model? If I
am using a log link should I use log axes for x and y?
Thanks so much for any help!
Karen
---
Karen Kotschy
Centre for Water in the Environment
University of the Witwatersrand, Johannesburg
Tel: +2711 717-6425
Dr. Carsten F. Dormann
Department of Computational Landscape Ecology
Helmholtz Centre for Environmental Research-UFZ
(Department Landschafts?kologie)
(Helmholtz Zentrum f?r Umweltforschung - UFZ)
Permoserstr. 15
04318 Leipzig
Germany
Tel: ++49(0)341 2351946
Fax: ++49(0)341 2351939
Email: carsten.dormann at ufz.de
internet: http://www.ufz.de/index.php?de=4205
Registered Office/Sitz der Gesellschaft: Leipzig
Commercial Register Number/Registergericht: Amtsgericht Leipzig,
Handelsregister Nr. B 4703
Chairman of the Supervisory Board/Vorsitzender des Aufsichtsrats: MinR
Wilfried Kraus
Scientific Managing Director/Wissenschaftlicher Gesch?ftsf?hrer: Prof.
Dr. Georg Teutsch
Administrative Managing Director/Administrativer Gesch?ftsf?hrer: Dr.
Andreas Schmidt