Prediction with multiple zeros in the dependent variable

Thomas Lumley · 2005-09-08T14:22:32Z

On Thu, 8 Sep 2005, John Sorkin wrote: > I have a batch of data in each line of data contains three values, > calcium score, age, and sex. I would like to predict calcium scores as a > function of age and sex, i.e. calcium=f(age,sex). Unfortunately the > calcium scorers have a very "ugly distribution". There are multiple > zeros, and multiple values between 300 and 600. There are no values > between zero and 300. Needless to say, the calcium scores are not > normally distributed, however, the va

Thomas Lumley

Thu, Sep 8, 2005 7:22 AM

On Thu, 8 Sep 2005, John Sorkin wrote:

[Coronary artery calcium by EBCT, I presume]

Our approach to modelling calcium scores is to do it in two parts.  First 
fit something like a logistic regression model where the outcome is zero 
vs non-zero calcium.  Then, for the non-zero use something like a linear 
regression model for log calcium.

You could presumably use such a model for prediction or imputation too, 
and you can work out means, medians etc from the two models.

One particular reason for using this two-part model is that we find 
different predictors of zero/non-zero and of amount. This makes biological 
sense -- a factor that makes arterial plaques calcify might well have no 
impact until you have arterial plaques.

Or you could use smooth quantile regression in the rq package.

 	-thomas

Prediction with multiple zeros in the dependent variable

Thread (5 messages)