Modelling proportion data in lme4

Dear Adriana,
Dear all,

I'd be really grateful if someone could advise on the following issue I've come across.

I have proportion data (non-integer, bounded between 0 and 1) as my
Do you actually have some 0s? Most of the rest of my answer assumes you do.
response variable, in a model that requires nested random effects and
weights, which makes lme4 the ideal choice. Using lme4 with a binomial
You might want to take a look at:

http://stats.stackexchange.com/questions/81343/response-variable-percentage-and-too-many-zeros-zero-inflated-poisson

http://stats.stackexchange.com/questions/142038/two-part-models-in-r-continuous-outcome-with-too-many-zeros

http://stats.stackexchange.com/questions/142013/correct-glmer-distribution-family-and-link-for-a-continuous-zero-inflated-data-s/

and this R-help question (referred from the above questions, e.g. http://stats.stackexchange.com/a/81347):

https://stat.ethz.ch/pipermail/r-help/2005-January/065070.html

where using a Tweedie model is suggested.

The cplm CRAN package, by W. Zhang:
https://cran.r-project.org/web/packages/cplm/index.html

will fit mixed-effects Tweedies.

I'd suggesting checking the vignetted of the cplm package, as well as
Zhang's paper

http://link.springer.com/10.1007/s11222-012-9343-7

and Dunn and Smyth's 2005 paper, which contains examples that use the
Tweedie distribution, as well as several references in the literature where
these models have been used:

https://link.springer.com/article/10.1007/s11222-005-4070-y

Take all of this advice with a grain (or two) of salt, but in somewhat
similar cases, and when I had a structure of replicates that allowed me to
examine the relationship between mean and variance in the response, I have
used it to help me decide whether a Tweedie was, or not, a reasonable
choice compared to other options; for instance, with the Tweedie model we'd
expect to see a linear slope between log(variance) and log(mean), with the
slope, p, being the exponent in the relationship V(mu) = mu^p (see, e.g.,
Figure 3 in the paper by Dunn and Smyth).
error structure and logit link seems to produce reasonable (and realistic
looking) results, and the residual plots look good. However, it warns me
that the error structure expects integer data, and I don't know whether
this approach is doing what I think (and hope) that it is doing. I have
tried to validate the lme4 results in the following ways:

1.  Running the same method (binomial error structure and logit link with
the proportions as the response variable) with glmmADMB. This produces
very different results (they are completely unrealistic, e.g. predicted
proportion of 2.16e-34).

2.  Using beta regression with glmmADMB. This seems to work and produce
results that are on the same scale, but not that close to those of lme4.

3.  Running an lme4 model with normal errors (lmer), after
logit-transforming the response variable. This again gives quite
different results to the lme4 model with binomial error structure and
logit link (and the behaviour of the residuals is not ideal).

Since these all give different results, it's hard to tell whether the
lme4 method I've used is giving the 'right' answer. I would be really
grateful for any advice. Is lme4 correctly analysing the proportion data
when a binomial error structure and logit link are specified?

Additional note: the proportion data are compositional similarity
measurements (Jaccard assymetric abundance-based compositional
similarity), so technically there is a numerator and denominator
(numerator = abundance of species in Site 1 that are also present in Site
2; denominator = abundance of all species in Site 1). I've been exploring
different weights options, but they generally include the denominator.
A couple of comments here:

1. I am not sure those proportion data can always be modelled as binomial.
Is the numerator a quantity we can think of as arising from a number of
independent trials, where the denominator is that number of independent
trials?

2. You might consider modeling the numerator using the denominator not as
denominator but as a covariate. This has the advantage of allowing you to
examine different possible relationships such as

Numerator ~  Denominator + other stuff

but also

Numerator ~ poly(Denominator, 2) + other stuff

or

Numerator ~ bs(Denominator) + other stuff

and just generally things like

Numerator ~ some_function_of(Denominator, some_other_covariates)

such as

Numerator ~ poly(Denominator, 2) * some_covariate

etc.

When you do

Numerator/Denominator ~ other stuff

you are committing yourself to one particular form of that relationship
(which might not be easy to reason about).

Best,

R.
Many thanks in advance,

Adriana

_____

Adriana De Palma
PREDICTS Postdoctoral Research Assistant
Natural History Museum
South Kensington

Web: The Purvis Lab<http://www.bio.ic.ac.uk/research/apurvis/ajpurvis.htm> | PREDICTS<predicts.org.uk>

	[[alternative HTML version deleted]]

_______________________________________________
R-sig-mixed-models at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
--
Ramon Diaz-Uriarte
Department of Biochemistry, Lab B-25
Facultad de Medicina
Universidad Aut?noma de Madrid
Arzobispo Morcillo, 4
28029 Madrid
Spain

Phone: +34-91-497-2412

Email: rdiaz02 at gmail.com
       ramon.diaz at iib.uam.es

http://ligarto.org/rdiaz

Modelling proportion data in lme4

Thread (4 messages)