Skip to content
Prev 15326 / 20628 Next

Modelling proportion data in lme4

Dear Adriana,
On Thu, 30-03-2017, at 09:41, Adriana De Palma <A.De-Palma at nhm.ac.uk> wrote:
Do you actually have some 0s? Most of the rest of my answer assumes you do.
You might want to take a look at:


http://stats.stackexchange.com/questions/81343/response-variable-percentage-and-too-many-zeros-zero-inflated-poisson

http://stats.stackexchange.com/questions/142038/two-part-models-in-r-continuous-outcome-with-too-many-zeros

http://stats.stackexchange.com/questions/142013/correct-glmer-distribution-family-and-link-for-a-continuous-zero-inflated-data-s/

and this R-help question (referred from the above questions, e.g. http://stats.stackexchange.com/a/81347):

https://stat.ethz.ch/pipermail/r-help/2005-January/065070.html

where using a Tweedie model is suggested.


The cplm CRAN package, by W. Zhang:
https://cran.r-project.org/web/packages/cplm/index.html

will fit mixed-effects Tweedies.


I'd suggesting checking the vignetted of the cplm package, as well as
Zhang's paper

http://link.springer.com/10.1007/s11222-012-9343-7


and Dunn and Smyth's 2005 paper, which contains examples that use the
Tweedie distribution, as well as several references in the literature where
these models have been used:

https://link.springer.com/article/10.1007/s11222-005-4070-y



Take all of this advice with a grain (or two) of salt, but in somewhat
similar cases, and when I had a structure of replicates that allowed me to
examine the relationship between mean and variance in the response, I have
used it to help me decide whether a Tweedie was, or not, a reasonable
choice compared to other options; for instance, with the Tweedie model we'd
expect to see a linear slope between log(variance) and log(mean), with the
slope, p, being the exponent in the relationship V(mu) = mu^p (see, e.g.,
Figure 3 in the paper by Dunn and Smyth).
A couple of comments here:

1. I am not sure those proportion data can always be modelled as binomial.
Is the numerator a quantity we can think of as arising from a number of
independent trials, where the denominator is that number of independent
trials?


2. You might consider modeling the numerator using the denominator not as
denominator but as a covariate. This has the advantage of allowing you to
examine different possible relationships such as

Numerator ~  Denominator + other stuff

but also

Numerator ~ poly(Denominator, 2) + other stuff

or

Numerator ~ bs(Denominator) + other stuff


and just generally things like


Numerator ~ some_function_of(Denominator, some_other_covariates)

such as

Numerator ~ poly(Denominator, 2) * some_covariate


etc.


When you do

Numerator/Denominator ~ other stuff

you are committing yourself to one particular form of that relationship
(which might not be easy to reason about).



Best,


R.
--
Ramon Diaz-Uriarte
Department of Biochemistry, Lab B-25
Facultad de Medicina
Universidad Aut?noma de Madrid
Arzobispo Morcillo, 4
28029 Madrid
Spain

Phone: +34-91-497-2412

Email: rdiaz02 at gmail.com
       ramon.diaz at iib.uam.es

http://ligarto.org/rdiaz