Skip to content

refitting lm() with same x, different y

4 messages · William Valdar, Brian Ripley, Douglas Bates

#
Dear All,

Is there is a fast way of refitting lm() when the design matrix stays constant 
but the response is different? For example,

y1 ~ X
y2 ~ X
y3 ~ X
...etc.

where y1 is the 1st instance of the response vector. Calling lm() every 
time seems rather wasteful since the QR-decomposition of X needs to be 
calculated only once. It would be nice if qr() was called only once and 
then the same QR-factorization used in all subsequent fits. However, I 
can't see a way to do this easily. Can anybody else?

Why do I want to do this? I'm fitting ~1000 different X's to a response 
vector (for biologists: 1000 genetic markers to a measured phenotype with 
2000 cases) and wish to establish global significance thresholds for 
multiple testing. The fits have a complex dependency structure that makes 
the Bonferroni correction inappropriate. So I intend to refit all ~1000 
X's with a shuffled response many times. However, this runs too slow for 
my needs.

Of course, not having to redo QR will only help if QR is a rate limiting 
step in lm(), so if anybody can tell me it's not, then that would be very 
helpful too. I would also like to do this for glm() and lmer() fits. 
Ideally.

Many thanks,

William

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Dr William Valdar               ++44 (0)1865 287 717
Wellcome Trust Centre           valdar at well.ox.ac.uk
for Human Genetics, Oxford      www.well.ox.ac.uk/~valdar
#
William,

As a first shot, use lm with a matrix response.  That fits them all at 
once with one QR-decomposition.  No analogue for glm or lmer, though, 
since for those the iterative fits run do depend on the response.

Brian
On Mon, 18 Apr 2005, William Valdar wrote:

            

  
    
#
Thanks Brian, that's very helpful. Also thanks to Kevin Wright who 
suggested using lsfit(x,Y) as being faster than lm for a Y matrix.

I've since worked out that I can bypass even more lm machinery by basing 
my permutation test significance thresholds on the RSS from qr.resid().
Since,

     y = QRb + e
   Q'y = Rb + Q'e
   RSS = || Q'y - Rb ||

then I can do

   X.qr <- qr(X)

once, and for every instance of y calculate

   e   <- qr.resid(X.qr, y)
   rss <- e %*% e

recording them in

   rss.for.all.fits[i] <- rss

which gives me an empirical distribution of RSS scores. The degrees of 
freedom in my X matrix are constant throughout (I should have said that 
before), so all RSS's are on a level footing and map trivially to the 
p-value. I can therefore take the RSS at, say, the 5th percentile, turn it 
into a p-value and report that as my 5% threshold.

Thanks again,

William

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Dr William Valdar               ++44 (0)1865 287 717
Wellcome Trust Centre           valdar at well.ox.ac.uk
for Human Genetics, Oxford      www.well.ox.ac.uk/~valdar
#
William Valdar wrote:
Actually you do not need to calculate the residuals to be able to 
calculate RSS.  If you write Q = [Q1 Q2] where Q1 is the first p columns 
and Q2 is the remaining n - p columns and R1 for the first p rows of R 
then your expression for RSS can be extended as

   RSS = || Q'y - Rb || = || Q1'y - R1 b || + || Q2'y ||

because the last n - p rows of R are zero.  At the least squares value 
of b the first term is zero when X has full column rank.  Thus

   rss <- sum(qr.qty(X.qr, y)[-(1:p)]^2)

qr.qty should be slightly faster than qr.resid because it performs only 
one (virtual) multiplication by Q or Q'.  I doubt that the difference 
would be noticeable in practice.