Prediction with multiple zeros in the dependent variable

An embedded and charset-unspecified text was scrubbed...
Name: not available
Url: https://stat.ethz.ch/pipermail/r-help/attachments/20050908/d1d4b10f/attachment.pl
I have a batch of data in each line of data contains three values,
calcium score, age, and sex. I would like to predict calcium scores
as a function of age and sex, i.e. calcium=f(age,sex). Unfortunately
the calcium scorers have a very "ugly distribution". There are
multiple zeros, and multiple values between 300 and 600. There are
no values between zero and 300. Needless to say, the calcium scores
are not normally distributed, however, the values between 300 and 600
have a distribution that is log normal. As you might imagine, the
residuals from the regression are not normally distributed and thus
violates the basic assumption of regression analyses. Does anyone
have a suggestion for a method (or a transformation) that will allow
me predict calcium from age and sex without violating the assumptions
of the model?
Thanks,
John
From your description (but only from your description) one might be
tempted to suggest (borrowing a term from Joe Shafer) a "semi-continuous"
model. This means that each observation either takes a discrete value,
or takes a value with a continuous distribution. In your case this
might be

Score = 0 with probability p which is a function of Age and Sex
Score = X with probability (1-p) where X has a log-normal distribution.

Whether using such a model, for data arising in the context you refer
to, is reasonable  depends on whether "Calcium Score = 0" is a reasonable
description of a biological state of things. Even if not a reasonable
biological state, it may be a reasonable description of the outcome
of a measurement process (e.g. too small to measure), in which case
there may be a consequential issue -- what is the likely distribution
of calcium values which give rise to Score = 0? (Though your data may
be uninformative about this). However, if your aim is simply predicting
calcium scores, then this may be irrelevant.

With such a model, you should be able to make progress by using
a log-linear model for the probability p (which may be adequately
addressed by simply using a logistic regression for the event
"Score = 0" or equivalently "score != 0", though you may need to
be careful about how you represent Age as a covariate; Sex, being
binary, should not present problems). This then allowes you to predict
the probability of zero score, and the complementary probability
of non-zero score.

Then you can consider the problem of estimating the relationship
between Score and (Age, Sex) conditional on Score != 0.

This, in turn, is no more (and no less!) complicated than estimating
the continuous distribution of non-zero scores from the subset of
the data which carries such scores.

If the distribution of non-zero scores were (as you suggest) a simple
log-normal distribution, then a regression of log(Score) on Age and
Sex might do well.

However, from your description, it may not be a simple log-normal.

The absence of scores between 0 and 300, and the containment of
score values betweem 300 and 600, suggests a 3-parameter log-normal
in which, as well as the mean and SD for the normal distribution of
log(X) there is also a lower limit S0, so that it is

  log(S - S0)

which has the N(mean,SD^2) distribution. The distribution might be
more complicated than this.

So, in summary, provided a "semi-continuous" model is acceptable,
you can proceed by estimating its two aspects separately: The
discrete part by a logistic (or other suitable binary) regression,
using 'glm' in R; the continuous part by a suitable regression
(using e.g. 'lm' in R) perhaps after suitable transformation
(though this may need care). In each case, it is only the relevant
part of the data (the proportions with "Score = 0" and "Score != 0"
on the one hand, the values of Score where "Score != 0" on the other
hand, in each case using the corresponding (Age, Sex) as covariates)
which will be needed.

Once you have these estimated models, they can be used straightforwardly
for prediction: Given Age and Sex, the Score will be zero with
estimated probability p(Age,Sex) or, with probability (1 - p(Age,Sex)),
will have a distribution implied by your regression.

So the structure of the predicted values will be the same as the
structure of the observed values. All very straightforward, provided
this is a reasonable way to go.

However, there is a complication in that the above might well not
be a reasonable model (as hinted at above). As an example, consider
the following (purely hypothetical assumptions).

1. The true distribution of Calcium Score is (say) simple log-normal
   such that log(Score) is normal with mean linearly dependent on Age
   and Sex, in all subjects.

2. In attempting to measure true Score (i.e. in obtaining observed
   Calcium Score data), there is a probability that "Score = 0"
   will be obtained, and this probability depends on the true Score
   (e.g. the smaller the true Score, the higher the probability of
   obtaining "Score = 0").

The resulting non-zero score data will then no longer have the log-normal
distribution assumed in (1), since the frequency of occurrence of
smaller values will be attenutated by a factor equal to the probability
that such a value will result in "Score != 0".

(I'm inclined to suspect, from your statement about "300-600", that
this might indeed be the case.)

If this is what is going on, then a different kind of approach is
needed. Each "Score = 0" would in fact correspond to an unobserved
non-zero value of Score, and the estimation of the distribution of
true Score would be straightforward if you knew what these values
were. Conditional on knowing the overall distribution, the distribution
of unobserved values conditional on "Score = 0" could be obtained,
and from this distribution could be derived the information you would
need to estimate the distribution of true Score which you need for
estimating the cinditional distribution ...

In other words, we are in effect in an "EM-Algorithm" situation.
This can certainly be solved in R (though I can't at this moment
provide any pointers to R-implementations of a solution for your
specific problem).

However, it would be quite feasible for poeple to construct
suggestions for solving your problem along these lines. But before
people get involved in the work needed, it would be very helpful
if you would respond to the comments above in terms of the real
situation you are dealing with, so that we know what sort of thing
we should be thinking about.

Hoping this helps,
Ted.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at nessie.mcc.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 08-Sep-05                                       Time: 12:01:38
------------------------------ XFMail ------------------------------
I have a batch of data in each line of data contains three values,
calcium score, age, and sex. I would like to predict calcium scores as a
function of age and sex, i.e. calcium=f(age,sex). Unfortunately the
calcium scorers have a very "ugly distribution". There are multiple
zeros, and multiple values between 300 and 600. There are no values
between zero and 300. Needless to say, the calcium scores are not
normally distributed, however, the values between 300 and 600 have a
distribution that is log normal. As you might imagine, the residuals
from the regression are not normally distributed and thus violates the
basic assumption of regression analyses. Does anyone have a suggestion
for a method (or a transformation) that will allow me predict calcium
from age and sex without violating the assumptions of the model?
Thanks,
John

John Sorkin M.D., Ph.D.
Chief, Biostatistics and Informatics
Baltimore VA Medical Center GRECC and
University of Maryland School of Medicine Claude Pepper OAIC
John - first I would try a proportional odds model, with zero as its own 
category then treating all other values as continuous or collapsing them 
into 20-tiles.  If the PO assumption happens to hold (look at partial 
residual plots) you have a simple solution.

Frank
Frank E Harrell Jr   Professor and Chair           School of Medicine
                      Department of Biostatistics   Vanderbilt University
I have a batch of data in each line of data contains three values,
calcium score, age, and sex. I would like to predict calcium scores as a
function of age and sex, i.e. calcium=f(age,sex). Unfortunately the
calcium scorers have a very "ugly distribution". There are multiple
zeros, and multiple values between 300 and 600. There are no values
between zero and 300. Needless to say, the calcium scores are not
normally distributed, however, the values between 300 and 600 have a
distribution that is log normal.
[Coronary artery calcium by EBCT, I presume]

Our approach to modelling calcium scores is to do it in two parts.  First 
fit something like a logistic regression model where the outcome is zero 
vs non-zero calcium.  Then, for the non-zero use something like a linear 
regression model for log calcium.

You could presumably use such a model for prediction or imputation too, 
and you can work out means, medians etc from the two models.

One particular reason for using this two-part model is that we find 
different predictors of zero/non-zero and of amount. This makes biological 
sense -- a factor that makes arterial plaques calcify might well have no 
impact until you have arterial plaques.

Or you could use smooth quantile regression in the rq package.

 	-thomas
John:

1. As George Box long ago emphasized and proved, normality is **NOT** that
important in regression, certainly not for estimation and not even for
inference in balanced designs. Independence of the observations is far more
important. 

2. That said, it sounds like what you have here is a mixture of some sort.
Before running off to do fancy modeling, I would work very hard to look for
some kind of "lurking variable" or experimental aberration -- what was going
on in the experiment or study that might have caused all the zeros? Was
there an instrument problem? -- a bad reagent? -- improper handling of the
samples? It might very well be that you need to throw away part of the data
because it's useless, rather than artificially attempt to model it.

3. And having said that, if a comprehensive model IS called for, one rather
cynical approach to take is just to add a grouping variable as a covariate
that has a value of 1 for all data in the zero group and 2 for all the
nonzero data. Your model is f(age,sex) = 0 for all data in group 1 and your
linear or nonlinear regression for group 2. Of course, this merely cloaks
the cynicism in respectable dress. It's hard for me to believe that it was
Mother Nature and not some kind of experimental problem that you see. 

A slightly less cynical approach might be to use some sort of changepoint
model (in both age and sex) of the form f(age, sex) = g(age,sex) for age>=k1
and sex <=k2 and h(age,sex) otherwise. Well, perhaps **not** less cynical --
the response data are so widely separated that you'll just be using a bunch
of extra (nonlinear, incidentally) parameters to essentially reproduce the
use of a covariate.

So I guess the point is that unless you already have a previously developed
nonlinear model that could explain the behavior you see (perhaps based on
some kind of mechanistic reasoning) it's not a good idea to try to develop
an artificial empirical model that comprehends all the data. The fact is (a
horrible phrase) that no modeling at all is needed for the most important
message the data have to convey: rather, focus on the cause of the message
instead of statistical artifice. Once you have determined that, you may be
able to do something sensible. Clear thinking trumps muddy modeling every
time.

(Hopefully, this is sufficiently inflammatory that others will vigorously
and wisely dispute me).

Cheers,

-- Bert Gunter
Genentech Non-Clinical Statistics
South San Francisco, CA

"The business of the statistician is to catalyze the scientific learning
process."  - George E. P. Box
-----Original Message-----
From: r-help-bounces at stat.math.ethz.ch 
[mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of John Sorkin
Sent: Wednesday, September 07, 2005 9:06 PM
To: r-help at stat.math.ethz.ch
Subject: [R] Prediction with multiple zeros in the dependent variable

I have a batch of data in each line of data contains three values,
calcium score, age, and sex. I would like to predict calcium 
scores as a
function of age and sex, i.e. calcium=f(age,sex). Unfortunately the
calcium scorers have a very "ugly distribution". There are multiple
zeros, and multiple values between 300 and 600. There are no values
between zero and 300. Needless to say, the calcium scores are not
normally distributed, however, the values between 300 and 600 have a
distribution that is log normal. As you might imagine, the residuals
from the regression are not normally distributed and thus violates the
basic assumption of regression analyses. Does anyone have a suggestion
for a method (or a transformation) that will allow me predict calcium
from age and sex without violating the assumptions of the model?
Thanks,
John

John Sorkin M.D., Ph.D.
Chief, Biostatistics and Informatics
Baltimore VA Medical Center GRECC and
University of Maryland School of Medicine Claude Pepper OAIC

University of Maryland School of Medicine
Division of Gerontology
Baltimore VA Medical Center
10 North Greene Street
GRECC (BT/18/GR)
Baltimore, MD 21201-1524

410-605-7119 
-- NOTE NEW EMAIL ADDRESS:
jsorkin at grecc.umaryland.edu

	[[alternative HTML version deleted]]

______________________________________________
R-help at stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! 
http://www.R-project.org/posting-guide.html