Skip to content

Diag "Hat" matrix

4 messages · Kenneth Cabrera, Brian Ripley, Peter Dalgaard +1 more

#
Hi R users:

What is the difference between in the computation of the diag of the
"hat" matrix in:
"lm.influence" and the matrix operations with "solve()" and "t()"?

I mean, this is my X matrix

        x1    x2    x3    x4     x5
 [1,] 0.297 0.310 0.290 0.220 0.1560
 [2,] 0.360 0.390 0.369 0.297 0.2050
 [3,] 0.075 0.058 0.047 0.034 0.0230
 [4,] 0.114 0.100 0.081 0.058 0.0420
 [5,] 0.229 0.213 0.198 0.142 0.0102
 [6,] 0.315 0.304 0.267 0.202 0.1470
 [7,] 0.477 0.518 0.496 0.395 0.2850
 [8,] 0.072 0.063 0.047 0.036 0.0240
 [9,] 0.099 0.092 0.074 0.056 0.0038
[10,] 0.420 0.452 0.425 0.332 0.2350
[11,] 0.189 0.178 0.153 0.107 0.0760
[12,] 0.369 0.391 0.364 0.286 0.2000
[13,] 0.142 0.124 0.105 0.077 0.0560
[14,] 0.094 0.087 0.072 0.049 0.0320
[15,] 0.171 0.161 0.145 0.094 0.0680
[16,] 0.378 0.420 0.380 0.281 0.2000

If I use:
diag(X%*%solve(t(X)%*%X)%*%t(X))
I obtain:
 [1] 0.15248181 0.27102872 0.11476375 0.12941386 0.90455886 0.32246292
 [7] 0.43858581 0.16533854 0.37415984 0.19100227 0.17023090 0.15125134
[13] 0.17855019 0.06023773 0.52137996 0.85455350

But when I use the lm.influence() function
lm.influence(mt)$hat
I obtain:
 [1] 0.1735989 0.2999146 0.2334095 0.1455117 0.9216644 0.7553856
0.4486403
 [8] 0.2755802 0.4188349 0.1914242 0.1790093 0.1573939 0.1787553
0.1975511
[15] 0.5664988 0.8568274
mt is a model of the type y~x1+x2+x3+x4+x5, where y is:
y
[1]  17  17  35  69  69 173 173  17  17  73  17  35  69  35  35  52

As you see the differences are no too small.
Where is the problem? Is only a numerical stability problem?

Thank you very much for your help


Kenneth Cabrera
krcabrer at perseus.unalmed.edu.co
krcabrer at epm.net.co
Universidad Nacional de Colombia
Sede Medellin

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
#
On Thu, 7 Jun 2001, Kenneth Cabrera wrote:

            
None.  You have forgotten the intercept: the design matrix for

y~x1+x2+x3+x4+x5

has a first column of ones.  Try using cbind(1, X).

[...]
#
Kenneth Cabrera <krcabrer at perseus.unalmed.edu.co> writes:
...
The intercept? What if you use X2<-cbind(1,X) or y~x1+x2+x3+x4+x5-1
#
Kenneth Cabrera <krcabrer at perseus.unalmed.edu.co> writes:
If your model formula is y~x1+x2+x3+x4+x5 then the model matrix would
have a column of 1's followed by what you wrote.  Try defining X as
 X <- model.matrix(y~x1+x2+x3+x4+x5, data = <whatever>)
and re-doing your calculation.

In general, please note that
is a *very* bad way of calculating the influences.  Whenever you find
yourself writing something like solve(t(X)%*%X) you're doing the
numerical linear algebra badly.  The rule of thumb in numerical
linear algebra is that if the best way you can think of doing a
calculation involves taking the inverse of a matrix then you need to
think more about the calculation.

The way that lm.influence calculates the influences is through an
orthogonal-triangular decomposition which will be both faster and more
accurate than what you have written.

On today's machines "faster" doesn't matter that much for small or
moderate sized data sets.  On very large data sets faster still does
matter. "Accurate" should always be a concern.
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._