Timing benefits of mapply() vs. for loop was: Wrap a loop inside a function - R-help

Thu, Jul 20, 2006 6:16 AM #

Yes, that significantly improves the speed so that the for loop and the
mapply function both return the same CPU time.

Also, thank you for your diagnosis of the likelihood matrix.

Harold

-----Original Message-----
From: Gabor Grothendieck [mailto:ggrothendieck at gmail.com] 
Sent: Thursday, July 20, 2006 8:56 AM
To: Doran, Harold
Cc: r-help at stat.math.ethz.ch; Phil Spector
Subject: Re: Timing benefits of mapply() vs. for loop was: 
[R] Wrap a loop inside a function

Note that if you use mapply in the way I suggested, which is 
not the same as in your post, then its just as fast.  (Also 
the version of mapply in your post gives different numerical 
results than
the for loop whereas mine gives the same.)   like.mat is the for
loop version, like.mat2 is your mapply version and like.mat3 
is my mapply version.

like.mat <- function(score, items, theta){

+   like.mat <- matrix(numeric(length(items) * length(theta)), ncol =
+     length(theta))
+   for(i in 1:length(items)) like.mat[i, ] <- pcm(theta, items[[i]],
score[[i]])
+   like.mat
+ }

system.time(for(i in 1:100) like.mat(score,items,theta))

[1] 1.30 0.00 1.34   NA   NA

like.mat2 <- function(score, items, theta)

+   matrix(mapply(pcm, rep(theta,length(items)), items, score),
+     ncol = length(theta), byrow = TRUE)

system.time(for(i in 1:100) like.mat2(score,items,theta))

[1] 5.70 0.00 5.91   NA   NA

all.equal(like.mat(score, items, theta), like.mat2(score, items, 
theta))

[1] "Mean relative  difference: 1.268095"

like.mat3 <- function(score, items, theta)

+ t(mapply(pcm, items, score, MoreArgs = list(theta = theta)))

system.time(for(i in 1:100) like.mat3(score,items,theta))

[1] 1.32 0.01 1.39   NA   NA

m3 <- like.mat3(score, items, theta)
dimnames(m3) <- NULL
all.equal(like.mat(score, items, theta), m3)

[1] TRUE

On 7/20/06, Doran, Harold <HDoran at air.org> wrote:



List:

Thank you for the replies to my post yesterday. Gabor and Phil also 
gave useful replies on how to improve the function by relying on 
mapply rather than the explicit for loop. In general, I try and use 
the family of apply functions rather than the looping

constructs such

as for, while etc as a matter of practice.

However, it seems the mapply function in this case is

slower (in terms

of CPU speed) than the for loop. Here is an example.

# data needed for example

items <- list(item1 = c(0,1,2), item2 = c(0,1), item3 =

c(0,1,2,3,4),

item4 = c(0,1), item5=c(0,1,2,3,4),
item6=c(0,1,2,3))
score <- c(2,1,3,1,3,2)
theta <- c(-1,-.5,0,.5,1)

# My old function using the for loop

like.mat <- function(score, items, theta){
   like.mat <- matrix(numeric(length(items) * length(theta)), ncol =
length(theta))
   for(i in 1:length(items)) like.mat[i, ] <- pcm(theta, items[[i]],
score[[i]])
   like.mat
   }

system.time(like.mat(score,items,theta))
[1]  0  0  0 NA NA

# Revised using mapply

like.mat <- function(score, items, theta){

matrix(mapply(pcm,rep(theta,length(items)),items,score),ncol=length(th

eta),byrow=TRUE)
}

system.time(like.mat(score,items,theta))

[1] 0.03 0.00 0.03   NA   NA


It is obviously slower to use mapply, but nominaly. So,

let's actually

look at this within the context of the full program I am

working on.

For context, I am evaluating an integral using Gaussian quadrature. 
This is a psychometric problem where the function 'pcm" is Master's 
partial credit model and 'score' is the student's score on

test item

i. When an item has two categories (0,1), pcm reduces to

the Rasch model for dichotomous data.

The dnorm is set at N(0,1) by default, but the parameters of the 
population distribution are estimated from a separate procedure and 
are normally input into the function, but this default

works for the example.

Here is the full program.

library(statmod)

# Master's partial credit model
pcm <- function(theta,d,score){
     exp(rowSums(outer(theta,d[1:score],'-')))/
     apply(exp(apply(outer(theta,d, '-'), 1, cumsum)), 2, sum)
   }

like.mat <- function(score, items, theta){
   like.mat <- matrix(numeric(length(items) * length(theta)), ncol =
length(theta))
   for(i in 1:length(items)) like.mat[i, ] <- pcm(theta, items[[i]],
score[[i]])
   like.mat
   }

# turn this off for now
#like.mat <- function(score, items, theta){

#matrix(mapply(pcm,rep(theta,length(items)),items,score),ncol=length(t

heta),byrow=TRUE)
#}

class.numer <- function(score,items, prof_cut, mu=0,

sigma=1, aboveQ){

   gauss_numer <- gauss.quad(49,kind="laguerre")
   if(aboveQ==FALSE){
      mat <- rbind(like.mat(score,items, 
(prof_cut-gauss_numer$nodes)), dnorm(prof_cut-gauss_numer$nodes, 
mean=mu, sd=sigma))

      } else { mat <- rbind(like.mat(score,items, 
(gauss_numer$nodes+prof_cut)), dnorm(gauss_numer$nodes+prof_cut, 
mean=mu,
sd=sigma))

   }
   f_y <- rbind(apply(mat, 2, prod), exp(gauss_numer$nodes),
gauss_numer$weights)
   sum(apply(f_y,2,prod))
   }

class.denom <- function(score,items, mu=0, sigma=1){
   gauss_denom <- gauss.quad.prob(49, dist='normal', mu=mu,

sigma=sigma)

   mat <-
rbind(like.mat(score,items,gauss_denom$nodes),gauss_denom$weights)
   sum(apply(mat, 2, prod))
   }

class.acc <-function(score,items,prof_cut, mu=0, sigma=1, 
aboveQ=TRUE){
   result <- class.numer(score,items,prof_cut, mu,sigma, 
aboveQ)/class.denom(score,items, mu, sigma)
   return(result)
   }

# Test the function "class.acc"
items <- list(item1 = c(0,1,2), item2 = c(0,1), item3 =

c(0,1,2,3,4),

item4 = c(0,1), item5=c(0,1,2,3,4),
item6=c(0,1,2,3))
score <- c(2,1,3,1,3,2)

# This is the system time when using the for loop for the like.mat 
function
system.time(class.acc(score,items,1,aboveQ=T))
[1] 0.04 0.00 0.04   NA   NA

# This is the system time using the mapply for the like.mat function
system.time(class.acc(score,items,1,aboveQ=T))
[1] 0.70 0.00 0.73   NA   NA


There is a substantial improvement in CPU seconds when the

for loop is

applied rather than using the mapply function. I experimented with 
adding more items and varying the quadrature points and it always 
turned out the for loop was faster.

Given this result, I wonder what advice might be offered.

Is there an

inherent reason one might opt for the mapply function (such as 
reliability,
etc) even when it compromises computational speed? Or, should the 
issue of computational speed be considered ahead of common

advice to

rely on the family of apply functions instead of the explicit loops.

Thanks for your consideration of my question, Harold

orm       i386-pc-mingw32
arch           i386
os             mingw32
system         i386, mingw32
status
major          2
minor          3.0
year           2006
month          04
day            24
svn rev        37909
language       R
version.string Version 2.3.0 (2006-04-24)