I'm trying to speed up a script that otherwise takes days to handle larger
data sets. So, is there a way to completely vectorize or paralellize the
following script:
*# k-fold cross validation*
df <- trees # a data frame 'trees' from R.
df <- df[sample(nrow(df)), ] # randomly shuffles the data.
k <- 10 # Number of folds. Note k=nrow(df) in the leave-one-out cross
validation.
folds <- cut(seq(from=1, to=nrow(df)), breaks=k, labels=FALSE) # creates
unique numbers for k equally size folds.
df$ID <- folds # adds fold IDs.
df[paste("pred", 1:3, sep="")] <- NA # adds multiple columns "pred1"
"pred2" "pred3" to speed up the following loop.
library(mgcv)
for(i in 1:k) {
# looping for different models:
m1 <- gam(Volume ~ s(Height), data=df, subset=(ID != i))
m2 <- gam(Volume ~ s(Girth), data=df, subset=(ID != i))
m3 <- gam(Volume ~ s(Girth) + s(Height), data=df, subset=(ID != i))
# looping for predictions:
df[df$ID==i, "pred1"] <- predict(m1, df[df$ID==i, ], type="response")
df[df$ID==i, "pred2"] <- predict(m2, df[df$ID==i, ], type="response")
df[df$ID==i, "pred3"] <- predict(m3, df[df$ID==i, ], type="response")
}
# calculating residuals:
df$res1 <- with(df, Volume - pred1)
df$res2 <- with(df, Volume - pred2)
df$res3 <- with(df, Volume - pred3)
Model <- paste("m", 1:3, sep="") # creates a vector of model names.
# creating a vector of mean-square errors (MSE):
MSE <- with(df, c(
sum(res1^2) / nrow(df),
sum(res2^2) / nrow(df),
sum(res3^2) / nrow(df)
))
model.mse <- data.frame(Model, MSE) # creates a data frame of model names
and mean-square errors.
model.mse <- model.mse[order(model.mse$MSE), ] # rearranges the previous
data frame in order of increasing mean-square errors.
I'd appreciate any help. This code takes several days if run on >=30,000
different GAM models and 3 predictors. Could you please help with
re-writing the script into sapply() or foreach()/doParallel format?
Thanks
Lexo
Vectorizing a for-loop for cross-validation in R
4 messages · Aleksandre Gavashelishvili, Charles C. Berry, Eric Berger
See inline.
On Jan 23, 2019, at 2:17 AM, Aleksandre Gavashelishvili <aleksandre.gavashelishvili at iliauni.edu.ge> wrote:
I'm trying to speed up a script that otherwise takes days to handle larger
data sets. So, is there a way to completely vectorize or paralellize the
following script:
*# k-fold cross validation*
df <- trees # a data frame 'trees' from R.
df <- df[sample(nrow(df)), ] # randomly shuffles the data.
k <- 10 # Number of folds. Note k=nrow(df) in the leave-one-out cross
validation.
folds <- cut(seq(from=1, to=nrow(df)), breaks=k, labels=FALSE) # creates
unique numbers for k equally size folds.
df$ID <- folds # adds fold IDs.
df[paste("pred", 1:3, sep="")] <- NA # adds multiple columns "pred1"
"pred2" "pred3" to speed up the following loop.
library(mgcv)
Rprof()
replicate(100, {
for(i in 1:k) {
# looping for different models:
m1 <- gam(Volume ~ s(Height), data=df, subset=(ID != i))
m2 <- gam(Volume ~ s(Girth), data=df, subset=(ID != i))
m3 <- gam(Volume ~ s(Girth) + s(Height), data=df, subset=(ID != i))
# looping for predictions:
df[df$ID==i, "pred1"] <- predict(m1, df[df$ID==i, ], type="response")
df[df$ID==i, "pred2"] <- predict(m2, df[df$ID==i, ], type="response")
df[df$ID==i, "pred3"] <- predict(m3, df[df$ID==i, ], type="response")
}
}) Rprof(NULL) summaryRprof() ## read ?Rprof to get a sense of what it does ## read the summary to determine where time is being spent. ## the result was surprising to me. YMMV. ## there may be redundancies that you can eliminate by ## - doing the setup within gam() one time and saving it ## - calling the worker functions by modifying the setup ## in a loop or function and saving the results
# calculating residuals:
df$res1 <- with(df, Volume - pred1)
df$res2 <- with(df, Volume - pred2)
df$res3 <- with(df, Volume - pred3)
Model <- paste("m", 1:3, sep="") # creates a vector of model names.
# creating a vector of mean-square errors (MSE):
MSE <- with(df, c(
sum(res1^2) / nrow(df),
sum(res2^2) / nrow(df),
sum(res3^2) / nrow(df)
))
model.mse <- data.frame(Model, MSE) # creates a data frame of model names
and mean-square errors.
model.mse <- model.mse[order(model.mse$MSE), ] # rearranges the previous
data frame in order of increasing mean-square errors.
I'd appreciate any help. This code takes several days if run on >=30,000
different GAM models and 3 predictors. Could you please help with
re-writing the script into sapply() or foreach()/doParallel format?
This is something you should learn to do. It is pretty standard practice. Use the body of your for loop as the body of a function, add arguments, and create a suitable return value. The something like lapply( 1:k, your.loop.body.function, other.arg1, other.arg2, ...) should work. If it does, then parallel::mclapply(...) should also work. HTH, Chuck
Charles writes about saving execution time by eliminating redundancies. If you see redundancies related to calling a time-consuming function multiple times with the same arguments, a very easy way to speed up your program is to memoise the functions using the package memoise. HTH, Eric
On Wed, Jan 23, 2019 at 8:34 PM Berry, Charles <ccberry at ucsd.edu> wrote:
See inline.
On Jan 23, 2019, at 2:17 AM, Aleksandre Gavashelishvili <
aleksandre.gavashelishvili at iliauni.edu.ge> wrote:
I'm trying to speed up a script that otherwise takes days to handle
larger
data sets. So, is there a way to completely vectorize or paralellize the
following script:
*# k-fold cross validation*
df <- trees # a data frame 'trees' from R.
df <- df[sample(nrow(df)), ] # randomly shuffles the data.
k <- 10 # Number of folds. Note k=nrow(df) in the leave-one-out cross
validation.
folds <- cut(seq(from=1, to=nrow(df)), breaks=k, labels=FALSE) # creates
unique numbers for k equally size folds.
df$ID <- folds # adds fold IDs.
df[paste("pred", 1:3, sep="")] <- NA # adds multiple columns "pred1"
"pred2" "pred3" to speed up the following loop.
library(mgcv)
Rprof()
replicate(100, {
for(i in 1:k) {
# looping for different models:
m1 <- gam(Volume ~ s(Height), data=df, subset=(ID != i))
m2 <- gam(Volume ~ s(Girth), data=df, subset=(ID != i))
m3 <- gam(Volume ~ s(Girth) + s(Height), data=df, subset=(ID != i))
# looping for predictions:
df[df$ID==i, "pred1"] <- predict(m1, df[df$ID==i, ], type="response")
df[df$ID==i, "pred2"] <- predict(m2, df[df$ID==i, ], type="response")
df[df$ID==i, "pred3"] <- predict(m3, df[df$ID==i, ], type="response")
}
}) Rprof(NULL) summaryRprof() ## read ?Rprof to get a sense of what it does ## read the summary to determine where time is being spent. ## the result was surprising to me. YMMV. ## there may be redundancies that you can eliminate by ## - doing the setup within gam() one time and saving it ## - calling the worker functions by modifying the setup ## in a loop or function and saving the results
# calculating residuals:
df$res1 <- with(df, Volume - pred1)
df$res2 <- with(df, Volume - pred2)
df$res3 <- with(df, Volume - pred3)
Model <- paste("m", 1:3, sep="") # creates a vector of model names.
# creating a vector of mean-square errors (MSE):
MSE <- with(df, c(
sum(res1^2) / nrow(df),
sum(res2^2) / nrow(df),
sum(res3^2) / nrow(df)
))
model.mse <- data.frame(Model, MSE) # creates a data frame of model names
and mean-square errors.
model.mse <- model.mse[order(model.mse$MSE), ] # rearranges the previous
data frame in order of increasing mean-square errors.
I'd appreciate any help. This code takes several days if run on >=30,000
different GAM models and 3 predictors. Could you please help with
re-writing the script into sapply() or foreach()/doParallel format?
This is something you should learn to do. It is pretty standard practice.
Use the body of your for loop as the body of a function, add arguments, and
create a suitable return value. The something like
lapply( 1:k, your.loop.body.function, other.arg1, other.arg2, ...)
should work. If it does, then parallel::mclapply(...) should also work.
HTH,
Chuck
______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Before posting on the r-help list I did run Rprof(). In my posting I asked for help with re-writing the specific script into sapply() or foreach()/doParallel format. Thanks anyway for your time and suggestions, Lexo
On Thu, Jan 24, 2019 at 12:38 AM Eric Berger <ericjberger at gmail.com> wrote:
Charles writes about saving execution time by eliminating redundancies. If you see redundancies related to calling a time-consuming function multiple times with the same arguments, a very easy way to speed up your program is to memoise the functions using the package memoise. HTH, Eric On Wed, Jan 23, 2019 at 8:34 PM Berry, Charles <ccberry at ucsd.edu> wrote:
See inline.
On Jan 23, 2019, at 2:17 AM, Aleksandre Gavashelishvili <
aleksandre.gavashelishvili at iliauni.edu.ge> wrote:
I'm trying to speed up a script that otherwise takes days to handle
larger
data sets. So, is there a way to completely vectorize or paralellize the
following script:
*# k-fold cross validation*
df <- trees # a data frame 'trees' from R.
df <- df[sample(nrow(df)), ] # randomly shuffles the data.
k <- 10 # Number of folds. Note k=nrow(df) in the leave-one-out cross
validation.
folds <- cut(seq(from=1, to=nrow(df)), breaks=k, labels=FALSE) # creates
unique numbers for k equally size folds.
df$ID <- folds # adds fold IDs.
df[paste("pred", 1:3, sep="")] <- NA # adds multiple columns "pred1"
"pred2" "pred3" to speed up the following loop.
library(mgcv)
Rprof()
replicate(100, {
for(i in 1:k) {
# looping for different models:
m1 <- gam(Volume ~ s(Height), data=df, subset=(ID != i))
m2 <- gam(Volume ~ s(Girth), data=df, subset=(ID != i))
m3 <- gam(Volume ~ s(Girth) + s(Height), data=df, subset=(ID != i))
# looping for predictions:
df[df$ID==i, "pred1"] <- predict(m1, df[df$ID==i, ], type="response")
df[df$ID==i, "pred2"] <- predict(m2, df[df$ID==i, ], type="response")
df[df$ID==i, "pred3"] <- predict(m3, df[df$ID==i, ], type="response")
}
}) Rprof(NULL) summaryRprof() ## read ?Rprof to get a sense of what it does ## read the summary to determine where time is being spent. ## the result was surprising to me. YMMV. ## there may be redundancies that you can eliminate by ## - doing the setup within gam() one time and saving it ## - calling the worker functions by modifying the setup ## in a loop or function and saving the results
# calculating residuals:
df$res1 <- with(df, Volume - pred1)
df$res2 <- with(df, Volume - pred2)
df$res3 <- with(df, Volume - pred3)
Model <- paste("m", 1:3, sep="") # creates a vector of model names.
# creating a vector of mean-square errors (MSE):
MSE <- with(df, c(
sum(res1^2) / nrow(df),
sum(res2^2) / nrow(df),
sum(res3^2) / nrow(df)
))
model.mse <- data.frame(Model, MSE) # creates a data frame of model
names
and mean-square errors. model.mse <- model.mse[order(model.mse$MSE), ] # rearranges the previous data frame in order of increasing mean-square errors. I'd appreciate any help. This code takes several days if run on >=30,000 different GAM models and 3 predictors. Could you please help with re-writing the script into sapply() or foreach()/doParallel format?
This is something you should learn to do. It is pretty standard practice.
Use the body of your for loop as the body of a function, add arguments, and
create a suitable return value. The something like
lapply( 1:k, your.loop.body.function, other.arg1, other.arg2, ...)
should work. If it does, then parallel::mclapply(...) should also work.
HTH,
Chuck
______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.