Diag "Hat" matrix - R-help | R Mailing Lists

Thu, Jun 7, 2001 8:51 AM #

Hi R users:

What is the difference between in the computation of the diag of the
"hat" matrix in:
"lm.influence" and the matrix operations with "solve()" and "t()"?

I mean, this is my X matrix

        x1    x2    x3    x4     x5
 [1,] 0.297 0.310 0.290 0.220 0.1560
 [2,] 0.360 0.390 0.369 0.297 0.2050
 [3,] 0.075 0.058 0.047 0.034 0.0230
 [4,] 0.114 0.100 0.081 0.058 0.0420
 [5,] 0.229 0.213 0.198 0.142 0.0102
 [6,] 0.315 0.304 0.267 0.202 0.1470
 [7,] 0.477 0.518 0.496 0.395 0.2850
 [8,] 0.072 0.063 0.047 0.036 0.0240
 [9,] 0.099 0.092 0.074 0.056 0.0038
[10,] 0.420 0.452 0.425 0.332 0.2350
[11,] 0.189 0.178 0.153 0.107 0.0760
[12,] 0.369 0.391 0.364 0.286 0.2000
[13,] 0.142 0.124 0.105 0.077 0.0560
[14,] 0.094 0.087 0.072 0.049 0.0320
[15,] 0.171 0.161 0.145 0.094 0.0680
[16,] 0.378 0.420 0.380 0.281 0.2000

If I use:
diag(X%*%solve(t(X)%*%X)%*%t(X))
I obtain:
 [1] 0.15248181 0.27102872 0.11476375 0.12941386 0.90455886 0.32246292
 [7] 0.43858581 0.16533854 0.37415984 0.19100227 0.17023090 0.15125134
[13] 0.17855019 0.06023773 0.52137996 0.85455350

But when I use the lm.influence() function
lm.influence(mt)$hat
I obtain:
 [1] 0.1735989 0.2999146 0.2334095 0.1455117 0.9216644 0.7553856
0.4486403
 [8] 0.2755802 0.4188349 0.1914242 0.1790093 0.1573939 0.1787553
0.1975511
[15] 0.5664988 0.8568274
mt is a model of the type y~x1+x2+x3+x4+x5, where y is:
y
[1]  17  17  35  69  69 173 173  17  17  73  17  35  69  35  35  52

As you see the differences are no too small.
Where is the problem? Is only a numerical stability problem?

Thank you very much for your help


Kenneth Cabrera
krcabrer at perseus.unalmed.edu.co
krcabrer at epm.net.co
Universidad Nacional de Colombia
Sede Medellin

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._

Brian Ripley

Thu, Jun 7, 2001 9:00 AM #

On Thu, 7 Jun 2001, Kenneth Cabrera wrote:

None.  You have forgotten the intercept: the design matrix for

y~x1+x2+x3+x4+x5

has a first column of ones.  Try using cbind(1, X).

[...]

Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272860 (secr)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._

Peter Dalgaard

Thu, Jun 7, 2001 9:11 AM #

Kenneth Cabrera <krcabrer at perseus.unalmed.edu.co> writes:

...

The intercept? What if you use X2<-cbind(1,X) or y~x1+x2+x3+x4+x5-1

O__  ---- Peter Dalgaard             Blegdamsvej 3  
  c/ /'_ --- Dept. of Biostatistics     2200 Cph. N   
 (*) \(*) -- University of Copenhagen   Denmark      Ph: (+45) 35327918
~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk)             FAX: (+45) 35327907
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._

Douglas Bates

Thu, Jun 7, 2001 9:19 AM #

Kenneth Cabrera <krcabrer at perseus.unalmed.edu.co> writes:

If your model formula is y~x1+x2+x3+x4+x5 then the model matrix would
have a column of 1's followed by what you wrote.  Try defining X as
 X <- model.matrix(y~x1+x2+x3+x4+x5, data = <whatever>)
and re-doing your calculation.

In general, please note that

is a *very* bad way of calculating the influences.  Whenever you find
yourself writing something like solve(t(X)%*%X) you're doing the
numerical linear algebra badly.  The rule of thumb in numerical
linear algebra is that if the best way you can think of doing a
calculation involves taking the inverse of a matrix then you need to
think more about the calculation.

The way that lm.influence calculates the influences is through an
orthogonal-triangular decomposition which will be both faster and more
accurate than what you have written.

On today's machines "faster" doesn't matter that much for small or
moderate sized data sets.  On very large data sets faster still does
matter. "Accurate" should always be a concern.

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._