Skip to content
Prev 5053 / 10988 Next

[Rcpp-devel] performance profile from Rcpp or RcppArmadillo code? Matrix subviews, etc

On Jan 6, 2013, at 2:50 AM, Paul Johnson <pauljohn32 at gmail.com> wrote:

            
Mostly the usual C++'s dos and dont's
I think you are talking about the expression template feature of Armadillo. It does not accelerate things, just avoid non-necessary duplication of calculations.
It obviously depends on how often you need to access the results an how long it take to calculate it. Modern CPUs are capable of compute one multiplication and one addition in one cycle. In the case of linear algebra, four floats or two doubles can be done in with one SIMD instruction. In contrast, accessing the results can take up to hundreds of cycles if a cache miss happens. So if you only access the calculated results a few times, it is not allocated on stack, and the calculation is relatively fast, then recalculation can be faster than store and access later.

In contrast, say your calculation results in a scalar or small fixed size matrix, then storing it in automatic memory (on stack) and accessing it  later is almost surely faster than calculating it again if the later accessing happens close to the site of storage.

For more advices, I would suggest Agner's optimization manual (http://www.agner.org/optimize/), just the first section is relevant to most practitioners.
It really depends on how the subview is accessed. Two main factors that affect performance is if the matrix is accessed linearly and if the accessing to subview results in a temporary copy of it. The later is even worse than the former as it leads to free store. Matrix is really only a one dimensional array in disguise. And dynamic array is really pointers. Armadillo is mostly smart in avoiding temporary copies, thanks to expression template. But when it cannot determine that it is safe to avoid it, due to possible pointer aliasing, it does make copies. Non-contiguous access is slow because of inefficient use of cache and shall be avoided. There should be no performance difference when access a contiguous block (whole columns) of a matrix when compared to accessing a whole matrix. They are exactly same, accessed through a pointer linearly. When accessed non-linearly, the worst case is cache contention, which will be very slow.

In your example below, what you wanted to access are two rows, g(k1, k2), etc. Things cannot be much worse than this if they need to be accessed repeated even a temporary copy is not created. If you need to access a block row by row, then consider using row major storage.

Yan