Prediction with multiple zeros in the dependent variable
On Thu, 8 Sep 2005, John Sorkin wrote:
I have a batch of data in each line of data contains three values, calcium score, age, and sex. I would like to predict calcium scores as a function of age and sex, i.e. calcium=f(age,sex). Unfortunately the calcium scorers have a very "ugly distribution". There are multiple zeros, and multiple values between 300 and 600. There are no values between zero and 300. Needless to say, the calcium scores are not normally distributed, however, the values between 300 and 600 have a distribution that is log normal.
[Coronary artery calcium by EBCT, I presume] Our approach to modelling calcium scores is to do it in two parts. First fit something like a logistic regression model where the outcome is zero vs non-zero calcium. Then, for the non-zero use something like a linear regression model for log calcium. You could presumably use such a model for prediction or imputation too, and you can work out means, medians etc from the two models. One particular reason for using this two-part model is that we find different predictors of zero/non-zero and of amount. This makes biological sense -- a factor that makes arterial plaques calcify might well have no impact until you have arterial plaques. Or you could use smooth quantile regression in the rq package. -thomas