predict (PR#2686)
<Bravington wrote:>
`predict' complains about new factor levels, even if the "new" levels are merely levels in the original that didn't occur in the original fit and were sensibly dropped, and that don't occur in the prediction data either.
<Ripley replied:>
This is intentional. The coding for factors is based on the full set of levels, and should be comparable for different prediction sets. If you are using factors with fictitious levels the fix is obvious: improve the design.
<Bravington again:> There is still an inconsistency bug between `lm' and `predict.lm',
though.
`lm' intentionally overlooks inactive levels of a factor,
<Ripley again:> Only if an argument is set, and originally lm did not do so.
<Bravington again:> But `lm' always does this now, doesn't it? -- even if it didn't originally. I think you can't not drop unused levels, even if you wanted to.
but `predict.lm'doesn't, even when it legitimately could. In particular, it is a bit odd to have no problem predicting without a `newdata' argument even when the original data had inactive factor levels, but then to get an error if `newdata=<<original data>>' is supplied explicitly! (See example.)
<Ripley:> Read again. predict.lm is consistent across its inputs: unlike lm it can take variable `newdata'. As I said the intention is to be consistent across *prediction sets*. Omitting newdata is not giving a prediction set.
<Bravington again:>
Mmm-- that's getting a bit metaphysical for me-- when is a prediction not a
prediction, and what is ``predict'' actually doing if it is not predicting?!
Anyhow, according to the help page for `predict.lm':
If the fit is rank-deficient, some of the columns of the design
matrix will have been dropped. Prediction from such a fit only
makes sense if `newdata' is contained in the same subspace as the
original data. That cannot be checked accurately, so a warning is
issued.
The subspace condition is obviously satisfied if the prediction data is the
same as the original data-- so prediction does "make sense" in that context
according to the documentation (as well as common sense. Normally I am no
fan of slavish adherence to documentation, but in my own interests I'll make
an exception...). And yet there's an error message, not even a warning.
Prediction from the original data was just an example, of course; my general
proposal is that inactive factor levels in the prediction set should be
dropped. I don't see how this could ever cause inconsistent behaviour across
prediction sets-- have I missed something?
cheers
Mark
*******************************
Mark Bravington
CSIRO (CMIS)
PO Box 1538
Castray Esplanade
Hobart
TAS 7001
phone (61) 3 6232 5118
fax (61) 3 6232 5012
Mark.Bravington@csiro.au