Dear All, Is there is a fast way of refitting lm() when the design matrix stays constant but the response is different? For example, y1 ~ X y2 ~ X y3 ~ X ...etc. where y1 is the 1st instance of the response vector. Calling lm() every time seems rather wasteful since the QR-decomposition of X needs to be calculated only once. It would be nice if qr() was called only once and then the same QR-factorization used in all subsequent fits. However, I can't see a way to do this easily. Can anybody else? Why do I want to do this? I'm fitting ~1000 different X's to a response vector (for biologists: 1000 genetic markers to a measured phenotype with 2000 cases) and wish to establish global significance thresholds for multiple testing. The fits have a complex dependency structure that makes the Bonferroni correction inappropriate. So I intend to refit all ~1000 X's with a shuffled response many times. However, this runs too slow for my needs. Of course, not having to redo QR will only help if QR is a rate limiting step in lm(), so if anybody can tell me it's not, then that would be very helpful too. I would also like to do this for glm() and lmer() fits. Ideally. Many thanks, William =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Dr William Valdar ++44 (0)1865 287 717 Wellcome Trust Centre valdar at well.ox.ac.uk for Human Genetics, Oxford www.well.ox.ac.uk/~valdar
refitting lm() with same x, different y
4 messages · William Valdar, Brian Ripley, Douglas Bates
William, As a first shot, use lm with a matrix response. That fits them all at once with one QR-decomposition. No analogue for glm or lmer, though, since for those the iterative fits run do depend on the response. Brian
On Mon, 18 Apr 2005, William Valdar wrote:
Dear All, Is there is a fast way of refitting lm() when the design matrix stays constant but the response is different? For example, y1 ~ X y2 ~ X y3 ~ X ...etc. where y1 is the 1st instance of the response vector. Calling lm() every time seems rather wasteful since the QR-decomposition of X needs to be calculated only once. It would be nice if qr() was called only once and then the same QR-factorization used in all subsequent fits. However, I can't see a way to do this easily. Can anybody else? Why do I want to do this? I'm fitting ~1000 different X's to a response vector (for biologists: 1000 genetic markers to a measured phenotype with 2000 cases) and wish to establish global significance thresholds for multiple testing. The fits have a complex dependency structure that makes the Bonferroni correction inappropriate. So I intend to refit all ~1000 X's with a shuffled response many times. However, this runs too slow for my needs. Of course, not having to redo QR will only help if QR is a rate limiting step in lm(), so if anybody can tell me it's not, then that would be very helpful too. I would also like to do this for glm() and lmer() fits. Ideally.
Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595
From: Brian Ripley As a first shot, use lm with a matrix response. That fits them all at once with one QR-decomposition. No analogue for glm or lmer, though, since for those the iterative fits run do depend on the response.
Thanks Brian, that's very helpful. Also thanks to Kevin Wright who
suggested using lsfit(x,Y) as being faster than lm for a Y matrix.
I've since worked out that I can bypass even more lm machinery by basing
my permutation test significance thresholds on the RSS from qr.resid().
Since,
y = QRb + e
Q'y = Rb + Q'e
RSS = || Q'y - Rb ||
then I can do
X.qr <- qr(X)
once, and for every instance of y calculate
e <- qr.resid(X.qr, y)
rss <- e %*% e
recording them in
rss.for.all.fits[i] <- rss
which gives me an empirical distribution of RSS scores. The degrees of
freedom in my X matrix are constant throughout (I should have said that
before), so all RSS's are on a level footing and map trivially to the
p-value. I can therefore take the RSS at, say, the 5th percentile, turn it
into a p-value and report that as my 5% threshold.
Thanks again,
William
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Dr William Valdar ++44 (0)1865 287 717
Wellcome Trust Centre valdar at well.ox.ac.uk
for Human Genetics, Oxford www.well.ox.ac.uk/~valdar
William Valdar wrote:
From: Brian Ripley As a first shot, use lm with a matrix response. That fits them all at once with one QR-decomposition. No analogue for glm or lmer, though, since for those the iterative fits run do depend on the response.
Thanks Brian, that's very helpful. Also thanks to Kevin Wright who
suggested using lsfit(x,Y) as being faster than lm for a Y matrix.
I've since worked out that I can bypass even more lm machinery by basing
my permutation test significance thresholds on the RSS from qr.resid().
Since,
y = QRb + e
Q'y = Rb + Q'e
RSS = || Q'y - Rb ||
then I can do
X.qr <- qr(X)
once, and for every instance of y calculate
e <- qr.resid(X.qr, y)
rss <- e %*% e
recording them in
rss.for.all.fits[i] <- rss
which gives me an empirical distribution of RSS scores. The degrees of
freedom in my X matrix are constant throughout (I should have said that
before), so all RSS's are on a level footing and map trivially to the
p-value. I can therefore take the RSS at, say, the 5th percentile, turn
it into a p-value and report that as my 5% threshold.
Actually you do not need to calculate the residuals to be able to calculate RSS. If you write Q = [Q1 Q2] where Q1 is the first p columns and Q2 is the remaining n - p columns and R1 for the first p rows of R then your expression for RSS can be extended as RSS = || Q'y - Rb || = || Q1'y - R1 b || + || Q2'y || because the last n - p rows of R are zero. At the least squares value of b the first term is zero when X has full column rank. Thus rss <- sum(qr.qty(X.qr, y)[-(1:p)]^2) qr.qty should be slightly faster than qr.resid because it performs only one (virtual) multiplication by Q or Q'. I doubt that the difference would be noticeable in practice.