[Rcpp-devel] performance profile from Rcpp or RcppArmadillo code? Matrix subviews, etc

I received a big pile of Rcpp code from some folks and am using it as
a way to learn about Armadillo and Rcpp.

From gazing at the code, and guessing about style and usage, I some
ways to make a program run faster.  But it is all by guessing and
testing.

I wish I could get a profile (via gcc) to find out where the Rcpp part
is spending its time.

Has anybody made a list of "dos and don'ts" for making faster
RcppArmadillo code?

Mostly the usual C++'s dos and dont's
So far, I'm absolutely sure that RcppArmadillo follows many of the
usual rules about fast code. Don't allocate huge matrices over and
over for no good reason, for example. Armadillo has the delayed
processing feature that automatically accelerates some calculations.
That's quite amazing to me.
I think you are talking about the expression template feature of Armadillo. It does not accelerate things, just avoid non-necessary duplication of calculations.
But some of the practical details of writing code are uncertain. In
particular, I wonder about the slowdown in making calculations on
submatrices and the relative penalty you pay for re-calculating things
versus allocating space to store them. (I once saw a demonstration
that it is faster to recalculate multiplication problems than it is to
store the results in some cases....)
It obviously depends on how often you need to access the results an how long it take to calculate it. Modern CPUs are capable of compute one multiplication and one addition in one cycle. In the case of linear algebra, four floats or two doubles can be done in with one SIMD instruction. In contrast, accessing the results can take up to hundreds of cycles if a cache miss happens. So if you only access the calculated results a few times, it is not allocated on stack, and the calculation is relatively fast, then recalculation can be faster than store and access later.

In contrast, say your calculation results in a scalar or small fixed size matrix, then storing it in automatic memory (on stack) and accessing it  later is almost surely faster than calculating it again if the later accessing happens close to the site of storage.

For more advices, I would suggest Agner's optimization manual (http://www.agner.org/optimize/), just the first section is relevant to most practitioners.
Here are some of the things I would like to read about. I have noticed
in R that accessing subsections of a matrix is slow, compared to
analyzing the whole thing. For example. It is SLOWER to repeatedly
work on a subsection of a matrix x than it is to extract that piece
one time and work on the copy. I'm pasting in the R code to
demonstrate this at the end of this message.

Does Armadillo follow that same logic, so that repeatedly accessing a
subview of a matrix is slower than creating a new copy of the
submatrix and then working on it?
It really depends on how the subview is accessed. Two main factors that affect performance is if the matrix is accessed linearly and if the accessing to subview results in a temporary copy of it. The later is even worse than the former as it leads to free store. Matrix is really only a one dimensional array in disguise. And dynamic array is really pointers. Armadillo is mostly smart in avoiding temporary copies, thanks to expression template. But when it cannot determine that it is safe to avoid it, due to possible pointer aliasing, it does make copies. Non-contiguous access is slow because of inefficient use of cache and shall be avoided. There should be no performance difference when access a contiguous block (whole columns) of a matrix when compared to accessing a whole matrix. They are exactly same, accessed through a pointer linearly. When accessed non-linearly, the worst case is cache contention, which will be very slow.

In your example below, what you wanted to access are two rows, g(k1, k2), etc. Things cannot be much worse than this if they need to be accessed repeated even a temporary copy is not created. If you need to access a block row by row, then consider using row major storage.

Yan
Along those lines, has the performance of the Armadillo non-contiguous
matrix access been assessed? Are we well advised to just allocate new
matrices and access the non-contiguous subviews one time, than to
access them over and over?

Given a matrix g and vectors k1 and k2 that contain row numbers, we
want to do math on these sub matrices.

g(k1, k1)
g(k1, k2)
g(k2, k2)

I am *guessing* that repeated access to a non-contiguous matrix
subview is slower than accessing a contiguous block out of the matrix.
And accessing a whole matrix would be faster still. So maybe a person
ought to allocate.

arma::mat g11 = g(k1, k1);

rather than grabbing g(k1, k1) over and over. Know what I mean?

Oh, well. here's the example about R submatrix access

## fasterR-matrix-1.R
## Paul Johnson
## 2012-12-23

nReps <- 5000
nVars <- 100
nRows <- 200

## fn1 with elements is the fastest case scenario
##
fn1 <- function(x){
   solve(x)
}

## extract 2:(nVars+1) rows and columns, returns "anonymously"
fn2 <- function(x){
   y <- x[2:(nVars+1), 2:(nVars+1)]
   solve(y)
}

## names new matrix to return
fn3 <- function(x){
   y <- x[2:(nVars+1), 2:(nVars+1)]
   yinv <- solve(y)
}

## manufacture new matrix, copy results to it
fn4 <- function(x){
   y <- x[2:(nVars+1), 2:(nVars+1)]
   yinv <- solve(y)
   z <- matrix(0, nVars, nVars)
   z <- yinv
}

## manufacuture bigger matrix, copy results into part of it
fn5 <- function(x){
   y <- x[2:(nVars+1), 2:(nVars+1)]
   yinv <- solve(y)
   z <- matrix(0, (nVars+1), (nVars+1))
   z[2:(nVars+1), 2:(nVars+1)] <- yinv
}

## Now make up test data.
## Manufacture symmetric matrices
Xlist1 <- vector("list", nReps)
for(i in 1:nReps){
   x <- matrix(rnorm(nRows * nVars), nRows, nVars)
   Xlist1[[i]] <- crossprod(x)
}

## create padded theta-like matrix
Xlist2 <- vector("list", nReps)
for(i in 1:nReps){
   y <- Xlist1[[i]]
   y <- rbind(c(rep(1, nVars)), y)
   y <- cbind(c(-1, rep(1,nVars)), y)
   Xlist2[[i]] <- y
}

system.time(
   fn1l <- lapply(Xlist1, fn1)
   )

system.time(
   fn2l <- lapply(Xlist2, fn2)
   )

system.time(
   fn3l <- lapply(Xlist2, fn3)
   )

system.time(
   fn4l <- lapply(Xlist2, fn4)
   )

system.time(
   fn5l <- lapply(Xlist2, fn5)
   )

## Conclusion. It IS slower to repeatedly extract rows & columns 2:(nVars+1)
## than to just work with a whole matrix of nVars x nVars values.

-- 
Paul E. Johnson
Professor, Political Science      Assoc. Director
1541 Lilac Lane, Room 504      Center for Research Methods
University of Kansas                 University of Kansas
http://pj.freefaculty.org               http://quant.ku.edu
_______________________________________________
Rcpp-devel mailing list
Rcpp-devel at lists.r-forge.r-project.org
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/rcpp-devel

[Rcpp-devel] performance profile from Rcpp or RcppArmadillo code? Matrix subviews, etc

Thread (2 messages)