Message-ID: <Pine.A41.4.63a.0509080716490.177972@homer04.u.washington.edu>
Date: 2005-09-08T14:22:32Z
From: Thomas Lumley
Subject: Prediction with multiple zeros in the dependent variable
In-Reply-To: <s31f806d.080@grecc.umaryland.edu>
On Thu, 8 Sep 2005, John Sorkin wrote:
> I have a batch of data in each line of data contains three values,
> calcium score, age, and sex. I would like to predict calcium scores as a
> function of age and sex, i.e. calcium=f(age,sex). Unfortunately the
> calcium scorers have a very "ugly distribution". There are multiple
> zeros, and multiple values between 300 and 600. There are no values
> between zero and 300. Needless to say, the calcium scores are not
> normally distributed, however, the values between 300 and 600 have a
> distribution that is log normal.
[Coronary artery calcium by EBCT, I presume]
Our approach to modelling calcium scores is to do it in two parts. First
fit something like a logistic regression model where the outcome is zero
vs non-zero calcium. Then, for the non-zero use something like a linear
regression model for log calcium.
You could presumably use such a model for prediction or imputation too,
and you can work out means, medians etc from the two models.
One particular reason for using this two-part model is that we find
different predictors of zero/non-zero and of amount. This makes biological
sense -- a factor that makes arterial plaques calcify might well have no
impact until you have arterial plaques.
Or you could use smooth quantile regression in the rq package.
-thomas