Matrix mulitplication

An embedded and charset-unspecified text was scrubbed...
Name: not available
Url: https://stat.ethz.ch/pipermail/r-help/attachments/20040216/05033c40/attachment.pl
ABCD are four matrix.
A * Inverse((Transpose(A)*Tranpose(B)*B*A+C)) * Transpose(A) *
Transpose(B) * D

how to write in R in an efficient way?
The only "efficiency saving" I can see here is to evaluate transposes
only once:

  At <- t(A)
  Bt <- t(B)
  A%*%solve(At%*%Bt%*%B%*%A + C)%*%At%*%Bt%*%D

Ted.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at nessie.mcc.ac.uk>
Fax-to-email: +44 (0)870 167 1972
Date: 16-Feb-04                                       Time: 22:35:46
------------------------------ XFMail ------------------------------
On 16-Feb-04 Grace Conlon wrote:
ABCD are four matrix.
A * Inverse((Transpose(A)*Tranpose(B)*B*A+C)) * Transpose(A) *
Transpose(B) * D

how to write in R in an efficient way?
The only "efficiency saving" I can see here is to evaluate transposes
only once:

  At <- t(A)
  Bt <- t(B)
  A%*%solve(At%*%Bt%*%B%*%A + C)%*%At%*%Bt%*%D
Sorry! Missed a trick here:

    At <- t(A)
    Bt <- t(B)
    E  <- B%*%A
    Et <- t(E)
    A%*%solve(Et%*%E + C)%*%Et%*%D

(saves 2 multiplications at the relatively cheap cost of 1 transpose)

Ted.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at nessie.mcc.ac.uk>
Fax-to-email: +44 (0)870 167 1972
Date: 16-Feb-04                                       Time: 23:05:21
------------------------------ XFMail ------------------------------
(Ted Harding) <Ted.Harding at nessie.mcc.ac.uk> writes:
Sorry! Missed a trick here:

    At <- t(A)
    Bt <- t(B)
    E  <- B%*%A
    Et <- t(E)
    A%*%solve(Et%*%E + C)%*%Et%*%D

(saves 2 multiplications at the relatively cheap cost of 1 transpose)
Well, you might consider getting rid of the first two transposes since
you're not actually using them for anything....
O__  ---- Peter Dalgaard             Blegdamsvej 3  
  c/ /'_ --- Dept. of Biostatistics     2200 Cph. N   
 (*) \(*) -- University of Copenhagen   Denmark      Ph: (+45) 35327918
~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk)             FAX: (+45) 35327907
(Ted Harding) <Ted.Harding at nessie.mcc.ac.uk> writes:

Sorry! Missed a trick here:

    At <- t(A)
    Bt <- t(B)
    E  <- B%*%A
    Et <- t(E)
    A%*%solve(Et%*%E + C)%*%Et%*%D

(saves 2 multiplications at the relatively cheap cost of 1 transpose)
Well, you might consider getting rid of the first two transposes since
you're not actually using them for anything....
Touch? ... !
Ted.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at nessie.mcc.ac.uk>
Fax-to-email: +44 (0)870 167 1972
Date: 16-Feb-04                                       Time: 23:41:57
------------------------------ XFMail ------------------------------
One can also use "crossprod" AND use "solve" to actually "solve" 
the system of linear equations rather than just get the inverse, which 
is later multiplied by t(BA)%*%D.  However, the difference seems very 
small: 

 > set.seed(1)
 > n <- 500
 > A <- array(rnorm(n^2), dim=c(n,n))
 > B <- array(rnorm(n^2), dim=c(n,n))
 > C. <- array(rnorm(n^2), dim=c(n,n))
 > D <- array(rnorm(n^2), dim=c(n,n))
 >
 > BA <- B%*%A
 >
 > start.time <- proc.time()
 > A1 <- A%*%solve(t(BA)%*%BA+C.)%*%BA%*%D
 > proc.time()-start.time
[1] 4.75 0.03 5.13   NA   NA
 >
 > start.time <- proc.time()
 > A2 <- A%*%solve(crossprod(BA)+C., crossprod(t(BA), D))
 > proc.time()-start.time
[1] 4.19 0.01 4.49   NA   NA
 > all.equal(A1, A2)
[1] TRUE

      This was in R 1.8.1 under Windows 2000 on an IBM Thinkpad T30 with 
a Mobile Intel Pentium 4-M, 1.8Ghz, 1Gbyte RAM.  The same script under 
S-Plus 6.2 produced the following elapsed times: 

[1] 3.325 0.121 3.815 0.000 0.000

[1] 2.934 0.070 3.355 0.000 0.000

      Thus, roughly, using "crossprod" plus "solving with solve" gave a 
10% speed improvement and S-Plus 6.2 gave a 25% speed improvement in 
this one small benchmark.  Using smaller matrices did not show as big a 
difference just because the time was too short to measure accurately, I 
think. 

      Enough trivia for now. 
      Best Wishes,
      spencer graves

On 16-Feb-04 Peter Dalgaard wrote:

(Ted Harding) <Ted.Harding at nessie.mcc.ac.uk> writes:

Sorry! Missed a trick here:

   At <- t(A)
   Bt <- t(B)
   E  <- B%*%A
   Et <- t(E)
   A%*%solve(Et%*%E + C)%*%Et%*%D

(saves 2 multiplications at the relatively cheap cost of 1 transpose)

Well, you might consider getting rid of the first two transposes since
you're not actually using them for anything....

Touch? ... !
Ted.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at nessie.mcc.ac.uk>
Fax-to-email: +44 (0)870 167 1972
Date: 16-Feb-04                                       Time: 23:41:57
------------------------------ XFMail ------------------------------

______________________________________________
R-help at stat.math.ethz.ch mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Spencer Graves <spencer.graves at pdf.com> writes:
      One can also use "crossprod" AND use "solve" to actually "solve"
      the system of linear equations rather than just get the inverse,
      which is later multiplied by t(BA)%*%D.  However, the difference
      seems very small: 
Thanks for pointing that out Spencer.  I was about to do the same.
set.seed(1)

 > n <- 500
 > A <- array(rnorm(n^2), dim=c(n,n))
 > B <- array(rnorm(n^2), dim=c(n,n))
 > C. <- array(rnorm(n^2), dim=c(n,n))
 > D <- array(rnorm(n^2), dim=c(n,n))
 >
 > BA <- B%*%A
 >
 > start.time <- proc.time()
 > A1 <- A%*%solve(t(BA)%*%BA+C.)%*%BA%*%D
 > proc.time()-start.time
[1] 4.75 0.03 5.13   NA   NA
 >
 > start.time <- proc.time()
 > A2 <- A%*%solve(crossprod(BA)+C., crossprod(t(BA), D))
 > proc.time()-start.time
[1] 4.19 0.01 4.49   NA   NA
A minor point on the methodology.  You can do this in one step as

system.time(A2 <- A%*%solve(crossprod(BA)+C., crossprod(t(BA), D)))

Also, in R the second and subsequent timings tend to be a bit faster
than the first.  I think this is due to heap storage being allocated
the first time that large chunks of memory are used and not needing to
be allocated for subsequent uses.
system.time(A1 <- A%*%solve(t(BA)%*%BA+C.)%*%BA%*%D)
[1] 0.78 0.09 0.87 0.00 0.00
system.time(A1 <- A%*%solve(t(BA)%*%BA+C.)%*%BA%*%D)
[1] 0.71 0.05 0.76 0.00 0.00
system.time(A1 <- A%*%solve(t(BA)%*%BA+C.)%*%BA%*%D)
[1] 0.79 0.08 0.87 0.00 0.00
system.time(A1 <- A%*%solve(t(BA)%*%BA+C.)%*%BA%*%D)
[1] 0.72 0.04 0.76 0.00 0.00
system.time(A2 <- A%*%solve(crossprod(BA)+C., crossprod(t(BA), D)))
[1] 0.52 0.07 0.59 0.00 0.00
system.time(A2 <- A%*%solve(crossprod(BA)+C., crossprod(t(BA), D)))
[1] 0.53 0.06 0.59 0.00 0.00
system.time(A2 <- A%*%solve(crossprod(BA)+C., crossprod(t(BA), D)))
[1] 0.56 0.03 0.59 0.00 0.00
system.time(A2 <- A%*%solve(crossprod(BA)+C., crossprod(t(BA), D)))
[1] 0.54 0.05 0.59 0.00 0.00
 > all.equal(A1, A2)
[1] TRUE

      This was in R 1.8.1 under Windows 2000 on an IBM Thinkpad T30
      with a Mobile Intel Pentium 4-M, 1.8Ghz, 1Gbyte RAM.  The same
      script under S-Plus 6.2 produced the following elapsed times:
      [1] 3.325 0.121 3.815 0.000 0.000
This is using R-devel (to be 1.9.0) on a 2.0 GHz Pentium-4 desktop
computer running Linux and with Goto's BLAS.  I'm not sure exactly
which of the changes from your system are resulting in the much faster
execution time but it is definitely not all due to the processor speed.
My guess is that most of the gain is due to the optimized BLAS.
Goto's BLAS are a big win on a Pentium-4 under Linux.  (Thanks to
Brian Ripley for modifying the configure script for R to accept
--with-blas=-lgoto .)

Corresponding timings on a Athlon XP 2500+ (1.83 GHz) running Linux
with Atlas are
system.time(A1 <- A%*%solve(t(BA)%*%BA+C.)%*%BA%*%D)
[1] 1.29 0.04 1.34 0.00 0.00
system.time(A1 <- A%*%solve(t(BA)%*%BA+C.)%*%BA%*%D)
[1] 0.88 0.06 0.95 0.00 0.00
system.time(A1 <- A%*%solve(t(BA)%*%BA+C.)%*%BA%*%D)
[1] 0.79 0.05 0.85 0.00 0.00
system.time(A1 <- A%*%solve(t(BA)%*%BA+C.)%*%BA%*%D)
[1] 0.82 0.04 0.87 0.00 0.00
system.time(A2 <- A%*%solve(crossprod(BA)+C., crossprod(t(BA), D)))
[1] 0.61 0.06 0.67 0.00 0.00
system.time(A2 <- A%*%solve(crossprod(BA)+C., crossprod(t(BA), D)))
[1] 0.66 0.02 0.69 0.00 0.00
system.time(A2 <- A%*%solve(crossprod(BA)+C., crossprod(t(BA), D)))
[1] 0.51 0.10 0.61 0.00 0.00
system.time(A2 <- A%*%solve(crossprod(BA)+C., crossprod(t(BA), D)))
[1] 0.59 0.10 0.71 0.00 0.00

There you can see the faster execution of the second and subsequent
timings.

I completely agree with you that using crossprod and the non-inverse
form of solve, where appropriate, helps.  However, one of the best
optimizations for numerical linear algebra calculations is the use of
optimized BLAS.  (I will avoid going in to the Linux vs Windows
comparisons :-)
Dear Doug: 

      Thanks for pointing out "system.time".  I considered using that 
but didn't because it doesn't work under S-Plus 6.2.  I could write my 
own, but ... . 

      Regarding Gabor Grothendieck suggestion to use the 
Sherman-Morrison-Woodbury formula, this can also be used in recursive 
computations, and often is in Kalman filtering and other applications 
where BA is of reduced dimensionality. 

      Best Wishes,
      spencer graves

Spencer Graves <spencer.graves at pdf.com> writes:

     One can also use "crossprod" AND use "solve" to actually "solve"
     the system of linear equations rather than just get the inverse,
     which is later multiplied by t(BA)%*%D.  However, the difference
     seems very small: 

Thanks for pointing that out Spencer.  I was about to do the same.

set.seed(1)

n <- 500
A <- array(rnorm(n^2), dim=c(n,n))
B <- array(rnorm(n^2), dim=c(n,n))
C. <- array(rnorm(n^2), dim=c(n,n))
D <- array(rnorm(n^2), dim=c(n,n))

BA <- B%*%A

start.time <- proc.time()
A1 <- A%*%solve(t(BA)%*%BA+C.)%*%BA%*%D
proc.time()-start.time
[1] 4.75 0.03 5.13   NA   NA
start.time <- proc.time()
A2 <- A%*%solve(crossprod(BA)+C., crossprod(t(BA), D))
proc.time()-start.time
[1] 4.19 0.01 4.49   NA   NA

A minor point on the methodology.  You can do this in one step as

system.time(A2 <- A%*%solve(crossprod(BA)+C., crossprod(t(BA), D)))

Also, in R the second and subsequent timings tend to be a bit faster
than the first.  I think this is due to heap storage being allocated
the first time that large chunks of memory are used and not needing to
be allocated for subsequent uses.

system.time(A1 <- A%*%solve(t(BA)%*%BA+C.)%*%BA%*%D)

[1] 0.78 0.09 0.87 0.00 0.00

system.time(A1 <- A%*%solve(t(BA)%*%BA+C.)%*%BA%*%D)

[1] 0.71 0.05 0.76 0.00 0.00

system.time(A1 <- A%*%solve(t(BA)%*%BA+C.)%*%BA%*%D)

[1] 0.79 0.08 0.87 0.00 0.00

system.time(A1 <- A%*%solve(t(BA)%*%BA+C.)%*%BA%*%D)

[1] 0.72 0.04 0.76 0.00 0.00

system.time(A2 <- A%*%solve(crossprod(BA)+C., crossprod(t(BA), D)))

[1] 0.52 0.07 0.59 0.00 0.00

system.time(A2 <- A%*%solve(crossprod(BA)+C., crossprod(t(BA), D)))

[1] 0.53 0.06 0.59 0.00 0.00

system.time(A2 <- A%*%solve(crossprod(BA)+C., crossprod(t(BA), D)))

[1] 0.56 0.03 0.59 0.00 0.00

system.time(A2 <- A%*%solve(crossprod(BA)+C., crossprod(t(BA), D)))

[1] 0.54 0.05 0.59 0.00 0.00

all.equal(A1, A2)
[1] TRUE

     This was in R 1.8.1 under Windows 2000 on an IBM Thinkpad T30
     with a Mobile Intel Pentium 4-M, 1.8Ghz, 1Gbyte RAM.  The same
     script under S-Plus 6.2 produced the following elapsed times:
     [1] 3.325 0.121 3.815 0.000 0.000

This is using R-devel (to be 1.9.0) on a 2.0 GHz Pentium-4 desktop
computer running Linux and with Goto's BLAS.  I'm not sure exactly
which of the changes from your system are resulting in the much faster
execution time but it is definitely not all due to the processor speed.
My guess is that most of the gain is due to the optimized BLAS.
Goto's BLAS are a big win on a Pentium-4 under Linux.  (Thanks to
Brian Ripley for modifying the configure script for R to accept
--with-blas=-lgoto .)

Corresponding timings on a Athlon XP 2500+ (1.83 GHz) running Linux
with Atlas are

system.time(A1 <- A%*%solve(t(BA)%*%BA+C.)%*%BA%*%D)

[1] 1.29 0.04 1.34 0.00 0.00

system.time(A1 <- A%*%solve(t(BA)%*%BA+C.)%*%BA%*%D)

[1] 0.88 0.06 0.95 0.00 0.00

system.time(A1 <- A%*%solve(t(BA)%*%BA+C.)%*%BA%*%D)

[1] 0.79 0.05 0.85 0.00 0.00

system.time(A1 <- A%*%solve(t(BA)%*%BA+C.)%*%BA%*%D)

[1] 0.82 0.04 0.87 0.00 0.00

system.time(A2 <- A%*%solve(crossprod(BA)+C., crossprod(t(BA), D)))

[1] 0.61 0.06 0.67 0.00 0.00

system.time(A2 <- A%*%solve(crossprod(BA)+C., crossprod(t(BA), D)))

[1] 0.66 0.02 0.69 0.00 0.00

system.time(A2 <- A%*%solve(crossprod(BA)+C., crossprod(t(BA), D)))

[1] 0.51 0.10 0.61 0.00 0.00

system.time(A2 <- A%*%solve(crossprod(BA)+C., crossprod(t(BA), D)))

[1] 0.59 0.10 0.71 0.00 0.00

There you can see the faster execution of the second and subsequent
timings.

I completely agree with you that using crossprod and the non-inverse
form of solve, where appropriate, helps.  However, one of the best
optimizations for numerical linear algebra calculations is the use of
optimized BLAS.  (I will avoid going in to the Linux vs Windows
comparisons :-)

For the record, the next version of R will allow Goto's BLAS to be used 
under Windows for real (but not complex) calculations.

For fair timings, you want to run gc() before starting the run, or you 
end up paying to clear up the current state of the session.  That is a 
large part of the difference in the first run.  Another issue is that the 
first call to LAPACK in a session (not used here) has to load the DLL.

Spencer Graves <spencer.graves at pdf.com> writes:

      One can also use "crossprod" AND use "solve" to actually "solve"
      the system of linear equations rather than just get the inverse,
      which is later multiplied by t(BA)%*%D.  However, the difference
      seems very small: 
Thanks for pointing that out Spencer.  I was about to do the same.

set.seed(1)

 > n <- 500
 > A <- array(rnorm(n^2), dim=c(n,n))
 > B <- array(rnorm(n^2), dim=c(n,n))
 > C. <- array(rnorm(n^2), dim=c(n,n))
 > D <- array(rnorm(n^2), dim=c(n,n))
 >
 > BA <- B%*%A
 >
 > start.time <- proc.time()
 > A1 <- A%*%solve(t(BA)%*%BA+C.)%*%BA%*%D
 > proc.time()-start.time
[1] 4.75 0.03 5.13   NA   NA
 >
 > start.time <- proc.time()
 > A2 <- A%*%solve(crossprod(BA)+C., crossprod(t(BA), D))
 > proc.time()-start.time
[1] 4.19 0.01 4.49   NA   NA
A minor point on the methodology.  You can do this in one step as

system.time(A2 <- A%*%solve(crossprod(BA)+C., crossprod(t(BA), D)))

Also, in R the second and subsequent timings tend to be a bit faster
than the first.  I think this is due to heap storage being allocated
the first time that large chunks of memory are used and not needing to
be allocated for subsequent uses.
I don't think so.  As I understand it, large objects are allocated
separately (not really out the the heap as it used to be), and the storage
is not reused by R (but it may well be by the malloc system.).
system.time(A1 <- A%*%solve(t(BA)%*%BA+C.)%*%BA%*%D)
[1] 0.78 0.09 0.87 0.00 0.00
system.time(A1 <- A%*%solve(t(BA)%*%BA+C.)%*%BA%*%D)
[1] 0.71 0.05 0.76 0.00 0.00
system.time(A1 <- A%*%solve(t(BA)%*%BA+C.)%*%BA%*%D)
[1] 0.79 0.08 0.87 0.00 0.00
system.time(A1 <- A%*%solve(t(BA)%*%BA+C.)%*%BA%*%D)
[1] 0.72 0.04 0.76 0.00 0.00
system.time(A2 <- A%*%solve(crossprod(BA)+C., crossprod(t(BA), D)))
[1] 0.52 0.07 0.59 0.00 0.00
system.time(A2 <- A%*%solve(crossprod(BA)+C., crossprod(t(BA), D)))
[1] 0.53 0.06 0.59 0.00 0.00
system.time(A2 <- A%*%solve(crossprod(BA)+C., crossprod(t(BA), D)))
[1] 0.56 0.03 0.59 0.00 0.00
system.time(A2 <- A%*%solve(crossprod(BA)+C., crossprod(t(BA), D)))
[1] 0.54 0.05 0.59 0.00 0.00

 > all.equal(A1, A2)
[1] TRUE

      This was in R 1.8.1 under Windows 2000 on an IBM Thinkpad T30
      with a Mobile Intel Pentium 4-M, 1.8Ghz, 1Gbyte RAM.  The same
      script under S-Plus 6.2 produced the following elapsed times:
      [1] 3.325 0.121 3.815 0.000 0.000
This is using R-devel (to be 1.9.0) on a 2.0 GHz Pentium-4 desktop
computer running Linux and with Goto's BLAS.  I'm not sure exactly
which of the changes from your system are resulting in the much faster
execution time but it is definitely not all due to the processor speed.
My guess is that most of the gain is due to the optimized BLAS.
Goto's BLAS are a big win on a Pentium-4 under Linux.  (Thanks to
Brian Ripley for modifying the configure script for R to accept
--with-blas=-lgoto .)

Corresponding timings on a Athlon XP 2500+ (1.83 GHz) running Linux
with Atlas are

system.time(A1 <- A%*%solve(t(BA)%*%BA+C.)%*%BA%*%D)
[1] 1.29 0.04 1.34 0.00 0.00
system.time(A1 <- A%*%solve(t(BA)%*%BA+C.)%*%BA%*%D)
[1] 0.88 0.06 0.95 0.00 0.00
system.time(A1 <- A%*%solve(t(BA)%*%BA+C.)%*%BA%*%D)
[1] 0.79 0.05 0.85 0.00 0.00
system.time(A1 <- A%*%solve(t(BA)%*%BA+C.)%*%BA%*%D)
[1] 0.82 0.04 0.87 0.00 0.00
system.time(A2 <- A%*%solve(crossprod(BA)+C., crossprod(t(BA), D)))
[1] 0.61 0.06 0.67 0.00 0.00
system.time(A2 <- A%*%solve(crossprod(BA)+C., crossprod(t(BA), D)))
[1] 0.66 0.02 0.69 0.00 0.00
system.time(A2 <- A%*%solve(crossprod(BA)+C., crossprod(t(BA), D)))
[1] 0.51 0.10 0.61 0.00 0.00
system.time(A2 <- A%*%solve(crossprod(BA)+C., crossprod(t(BA), D)))
[1] 0.59 0.10 0.71 0.00 0.00

There you can see the faster execution of the second and subsequent
timings.

I completely agree with you that using crossprod and the non-inverse
form of solve, where appropriate, helps.  However, one of the best
optimizations for numerical linear algebra calculations is the use of
optimized BLAS.  (I will avoid going in to the Linux vs Windows
comparisons :-)

______________________________________________
R-help at stat.math.ethz.ch mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595