Skip to content

Vectorizing a for-loop for cross-validation in R

4 messages · Aleksandre Gavashelishvili, Charles C. Berry, Eric Berger

#
I'm trying to speed up a script that otherwise takes days to handle larger
data sets. So, is there a way to completely vectorize or paralellize the
following script:

                *# k-fold cross validation*

df <- trees # a data frame 'trees' from R.
df <- df[sample(nrow(df)), ] # randomly shuffles the data.
k <- 10 # Number of folds. Note k=nrow(df) in the leave-one-out cross
validation.
folds <- cut(seq(from=1, to=nrow(df)), breaks=k, labels=FALSE) # creates
unique numbers for k equally size folds.
df$ID <- folds # adds fold IDs.
df[paste("pred", 1:3, sep="")] <- NA # adds multiple columns "pred1"
"pred2" "pred3" to speed up the following loop.

library(mgcv)

for(i in 1:k) {
  # looping for different models:
  m1 <- gam(Volume ~ s(Height), data=df, subset=(ID != i))
  m2 <- gam(Volume ~ s(Girth), data=df, subset=(ID != i))
  m3 <- gam(Volume ~ s(Girth) + s(Height), data=df, subset=(ID != i))

  # looping for predictions:
  df[df$ID==i, "pred1"] <- predict(m1, df[df$ID==i, ], type="response")
  df[df$ID==i, "pred2"] <- predict(m2, df[df$ID==i, ], type="response")
  df[df$ID==i, "pred3"] <- predict(m3, df[df$ID==i, ], type="response")
}

# calculating residuals:
df$res1 <- with(df, Volume - pred1)
df$res2 <- with(df, Volume - pred2)
df$res3 <- with(df, Volume - pred3)

Model <- paste("m", 1:3, sep="") # creates a vector of model names.

# creating a vector of mean-square errors (MSE):
MSE <- with(df, c(
  sum(res1^2) / nrow(df),
  sum(res2^2) / nrow(df),
  sum(res3^2) / nrow(df)
))

model.mse <- data.frame(Model, MSE) # creates a data frame of model names
and mean-square errors.
model.mse <- model.mse[order(model.mse$MSE), ] # rearranges the previous
data frame in order of increasing mean-square errors.

I'd appreciate any help. This code takes several days if run on >=30,000
different GAM models and 3 predictors. Could you please help with
re-writing the script into sapply() or foreach()/doParallel format?

Thanks
Lexo
#
See inline.
Rprof()

replicate(100, {
})

Rprof(NULL)

summaryRprof()

## read ?Rprof to get a sense of what it does

## read the summary to determine where time is being spent.

## the result was surprising to me. YMMV.

## there may be redundancies that you can eliminate by 
##  - doing the setup within gam() one time and saving it
##  - calling the worker functions by modifying the setup 
##    in a loop or function and saving the results
This is something you should learn to do. It is pretty standard practice. Use the body of your for loop as the body of a function, add arguments, and create a suitable return value. The something like

	lapply( 1:k, your.loop.body.function, other.arg1, other.arg2, ...)

should work.  If it does, then parallel::mclapply(...) should also work.

HTH,

Chuck
#
Charles writes about saving execution time by eliminating redundancies.
If you see redundancies related to calling a time-consuming function
multiple times with the same arguments, a very easy way to speed up your
program is to memoise the functions using the package memoise.

HTH,
Eric
On Wed, Jan 23, 2019 at 8:34 PM Berry, Charles <ccberry at ucsd.edu> wrote:

            

  
  
#
Before posting on the r-help list I did run Rprof(). In my posting I asked
for help with re-writing the specific script into sapply() or
foreach()/doParallel format.

Thanks anyway for your time and suggestions,
Lexo
On Thu, Jan 24, 2019 at 12:38 AM Eric Berger <ericjberger at gmail.com> wrote: