Skip to content
Prev 15221 / 63421 Next

using poly in a linear regression in the presence of NAf ails (despite subsetting them out)

Dear Andy, Brian, and Markus,

I've moved this to r-devel because the issue is a bit esoteric. I apologize
for joining the discussion so late, but didn't have time earlier in the week
to formulate these ideas.

I believe that I understand and appreciate Brian's point, but think that the
issue isn't entirely clear-cut.

It seems to me that two kinds of questions arise with missing data. The
deeper ones are statistical but there are also nontrivial mechanical issues
such as the one here.

Whenever a term in a model is parametrized with a data-dependent basis,
there's a potential for problems and confusion -- manifested, for example,
in the issue of "safe" prediction. I don't think that these problems are
unique to missing data. On the other hand, the basis selected for the
subspace spanned by a term shouldn't be the most important consideration.
That is, when models are equivalent -- as, for example lm(y ~ x + I(x^2))
and lm(y ~ poly(x, 2)), an argument could be made for treating them
similarly.

Brian's point that NAs, say, in x2, can influence the basis for poly(x1, 2)
is disquieting, but note that this can happen now if there are no NAs in x1.
The point, therefore, doesn't really justify the current behaviour of
poly(). Indeed, if there are NAs in x2 but not in x1, the columns
representing poly(x1, 2) won't be orthogonal in the subset of cases used in
the model fit (though they do provide a correct basis for the term).

Consider another example -- lm(y ~ f, subset = f != "a"), where f is a
factor with levels "a", "b", and "c", and where there are NAs in f. Here the
basis for the f term is data dependent, in that the baseline level is taken
as "b" in the absence of "a", yet NAs don't cause an error.

Having poly(), bs(), and ns() report a more informative error message
certainly is preferable to the current situation, but an alternative would
be to have them work and perhaps report a warning.

If the object is to protect people from stepping into traps because of
missing data, then an argument could be made for having the default
na.action be na.fail, as in S-PLUS. (I wouldn't advocate this, by the way.)
Perhaps I'm missing some consideration here, but isn't it clearest to allow
the data, subset, and na.action arguments to determine the data in which the
formula is evaluated? 

Regards,
John

--------------------------------
John Fox
Department of Sociology
McMaster University
Hamilton, Ontario
Canada L8S 4M4
905-525-9140x23604
http://socserv.mcmaster.ca/jfox 
--------------------------------