My code seems to be spending most of its time in assignment statements,
in some cases simple assignment of a model frame or model matrix.
Can anyone provide any insights into what's going on, or how to speed
things up?
For starters, is it possible that the reports are not accurate, or that
I am misreading them. In R 3.0.1 (running under ESS):
> Rprof(line.profiling=TRUE)
> system.time(r <- totalEffect(dodata[[1]], dodata[[2]], 1:3, 4))
user system elapsed
21.629 0.756 22.469
!> Rprof(NULL)
> summaryRprof(lines="both")
$by.self
self.time self.pct total.time total.pct
box.R#158 6.74 29.56 13.06 57.28
simulator.multinomial.R#64 2.92 12.81 2.96 12.98
simulator.multinomial.R#63 2.76 12.11 2.76 12.11
box.R#171 2.54 11.14 5.08 22.28
simulator.d1.R#70 0.98 4.30 0.98 4.30
simulator.d1.R#71 0.98 4.30 0.98 4.30
densMap.R#42 0.72 3.16 0.86 3.77
"standardGeneric" 0.52 2.28 11.30 49.56
......
Here's some of the code, with comments at the line numbers
box.R:
sp <- merge(sexpartner, data, by="studyidx")
sp$y <- numFactor(sp$pEthnic) #I think y is not used but must be present
data(sims.c1[[k]]) <- sp ###<<<<< line 158
sp0 <- sp
sp <- sim(sims.c1[[k]], i)
ctable[[k]] <- update.c1(ctable[[k]], sp)
if (is.null(i.c1.in)) {
i.c1.in <- match("pEthnic", colnames(sp0))
i.c1.out <- match(c("studyidx", "n", "pEthnic"), colnames(sp))
}
sp0 <- merge(sp0[,-i.c1.in], sp[,i.c1.out], by=c("studyidx", "n"))
# d1
sp0 <- sp0[sp0$pIsMale == 1,]
# avoid lots of conversion warnings
sp0$pEthnic <- factor(sp0$pEthnic, levels=partRaceLevels)
data(sims.d1[[k]]) <- sp0 ###<<<<< line 171
sp <- sim(sims.d1[[k]], i)
dtable[[k]] <- update.d1(dtable[[k]], sp)
rngstate[[k]] <- .Random.seed
The timing seems odd since it doesn't appear there's anything to do at
the 2 lines except invoke data<-, but if that's slow I would expect the
time to go to the data<- function (in a different file) and not to the
call.
In fact the other big time items are inside the data<- functions.
simulator.multinomial.R:
setMethod("data<-", c("simulator.multinomial", "data.frame"),
function(obj, value) {
mf <- model.frame(obj at dataFormula, data=value)
mf$iCluster <- fromOrig(obj at idmap, as.character(mf$studyidx))
if (any(is.na(mf$iCluster)))
stop("New studyidx--need to draw from meta distn")
mm <- model.matrix(obj at modelFormula, data=mf)
obj at data <- mf ##<<< line 63
obj at mm <- mm ##<<< line 64
return(obj)
})
The mm and data slots have type restrictions, but no other validation
tests.
setClass("simulator.multinomial",
representation(fit="stanfit", idmap="sIDMap",
modelFormula="formula",
categories="ANY", # could be factor or character
# categories should be in the order of their numeric codes in y
# cached results
coef="list",
data="data.frame",
dataFormula="formula",
mm="matrix"))
Does it matter that, e.g., a model frame is more than a vanilla data frame?
I thought assignment, given R's lazy copying behavior, was essentially
resetting a pointer, and so should be fast.
Or maybe the time is going to garbage collecting the previous contents
of the slots?
Ross Boylan
spending most of my time in assignments?
3 messages · Duncan Murdoch, Ross Boylan
On 13-12-19 6:37 PM, Ross Boylan wrote:
My code seems to be spending most of its time in assignment statements, in some cases simple assignment of a model frame or model matrix. Can anyone provide any insights into what's going on, or how to speed things up?
You are seeing a lot of time being spent on complex assignments. For example, line 158 is data(sims.c1[[k]]) <- sp That makes a function call to `data<-` to do the assignment, and that could be slow. Since it's an S4 method there's a bunch of machinery involved in dispatching it; most of that would not have line number information, so it'll be charged to that line. I can't really suggest how to speed it up. Duncan Murdoch
For starters, is it possible that the reports are not accurate, or that I am misreading them. In R 3.0.1 (running under ESS):
> Rprof(line.profiling=TRUE) > system.time(r <- totalEffect(dodata[[1]], dodata[[2]], 1:3, 4))
user system elapsed 21.629 0.756 22.469 !> Rprof(NULL)
> summaryRprof(lines="both")
$by.self
self.time self.pct total.time total.pct
box.R#158 6.74 29.56 13.06 57.28
simulator.multinomial.R#64 2.92 12.81 2.96 12.98
simulator.multinomial.R#63 2.76 12.11 2.76 12.11
box.R#171 2.54 11.14 5.08 22.28
simulator.d1.R#70 0.98 4.30 0.98 4.30
simulator.d1.R#71 0.98 4.30 0.98 4.30
densMap.R#42 0.72 3.16 0.86 3.77
"standardGeneric" 0.52 2.28 11.30 49.56
......
Here's some of the code, with comments at the line numbers
box.R:
sp <- merge(sexpartner, data, by="studyidx")
sp$y <- numFactor(sp$pEthnic) #I think y is not used but must be present
data(sims.c1[[k]]) <- sp ###<<<<< line 158
sp0 <- sp
sp <- sim(sims.c1[[k]], i)
ctable[[k]] <- update.c1(ctable[[k]], sp)
if (is.null(i.c1.in)) {
i.c1.in <- match("pEthnic", colnames(sp0))
i.c1.out <- match(c("studyidx", "n", "pEthnic"), colnames(sp))
}
sp0 <- merge(sp0[,-i.c1.in], sp[,i.c1.out], by=c("studyidx", "n"))
# d1
sp0 <- sp0[sp0$pIsMale == 1,]
# avoid lots of conversion warnings
sp0$pEthnic <- factor(sp0$pEthnic, levels=partRaceLevels)
data(sims.d1[[k]]) <- sp0 ###<<<<< line 171
sp <- sim(sims.d1[[k]], i)
dtable[[k]] <- update.d1(dtable[[k]], sp)
rngstate[[k]] <- .Random.seed
The timing seems odd since it doesn't appear there's anything to do at
the 2 lines except invoke data<-, but if that's slow I would expect the
time to go to the data<- function (in a different file) and not to the
call.
In fact the other big time items are inside the data<- functions.
simulator.multinomial.R:
setMethod("data<-", c("simulator.multinomial", "data.frame"),
function(obj, value) {
mf <- model.frame(obj at dataFormula, data=value)
mf$iCluster <- fromOrig(obj at idmap, as.character(mf$studyidx))
if (any(is.na(mf$iCluster)))
stop("New studyidx--need to draw from meta distn")
mm <- model.matrix(obj at modelFormula, data=mf)
obj at data <- mf ##<<< line 63
obj at mm <- mm ##<<< line 64
return(obj)
})
The mm and data slots have type restrictions, but no other validation
tests.
setClass("simulator.multinomial",
representation(fit="stanfit", idmap="sIDMap",
modelFormula="formula",
categories="ANY", # could be factor or character
# categories should be in the order of their numeric codes in y
# cached results
coef="list",
data="data.frame",
dataFormula="formula",
mm="matrix"))
Does it matter that, e.g., a model frame is more than a vanilla data frame?
I thought assignment, given R's lazy copying behavior, was essentially
resetting a pointer, and so should be fast.
Or maybe the time is going to garbage collecting the previous contents
of the slots?
Ross Boylan
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
6 days later
On Thu, 2013-12-19 at 20:37 -0500, Duncan Murdoch wrote:
On 13-12-19 6:37 PM, Ross Boylan wrote:
My code seems to be spending most of its time in assignment statements, in some cases simple assignment of a model frame or model matrix. Can anyone provide any insights into what's going on, or how to speed things up?
You are seeing a lot of time being spent on complex assignments. For example, line 158 is data(sims.c1[[k]]) <- sp That makes a function call to `data<-` to do the assignment, and that could be slow. Since it's an S4 method there's a bunch of machinery involved in dispatching it; most of that would not have line number information, so it'll be charged to that line. I can't really suggest how to speed it up. Duncan Murdoch
Simply reexpressing the same computations without assigning to S4 slots
or dispatching using S4 cut the execution time to under 50% of what it
had been.
I attempted to recreate this with a small bit of code, but got nothing
like the effect in the original.
Here's the example; v1 is like my original code and v2 is like the
restructured code:
slow <- setClass("slow",
representation(form="formula", data="data.frame", mm="matrix"))
slowb <- setClass("slowb", representation(form="formula", data="data.frame", mm="matrix"))
slowc <- setClass("slowc", representation(form="formula", data="data.frame", mm="matrix"))
# original had only 3 classes
# fake data. Using my real data didn't seem to affect relative times much.
# real data had ~ 100 columns in model matrix
mydata <- data.frame(a=rnorm(1500), b=rnorm(1500), c=rnorm(1500),d=as.factor(rep(c("x", "y", "z"), 500)))
myformula <- ~a*d + b*d + c*d
mymm <- model.matrix(myformula, mydata)
if (!isGeneric("putData<-"))
setGeneric("putData<-", function(obj, value) standardGeneric("putData<-"))
setMethod("putData<-", c("slow", "data.frame"),
function(obj, value){
myf <- model.frame(obj at form, value)
obj at data <- myf
obj at mm <- model.matrix(obj at form, myf)
obj
}
)
setMethod("putData<-", c("slowb", "data.frame"),
function(obj, value){
myf <- model.frame(obj at form, value)
obj at data <- myf
obj at mm <- model.matrix(obj at form, myf)
obj
}
)
setMethod("putData<-", c("slowc", "data.frame"),
function(obj, value){
myf <- model.frame(obj at form, value)
obj at data <- myf
obj at mm <- model.matrix(obj at form, myf)
obj
}
)
v1 <- function(n) {
s <- list(slow(form=myformula, mm=mymm), 1:5)
for (i in 1:n) {
mydata$b <- rnorm(nrow(mydata))
putData(s[[1]]) <- mydata
}
mm <- s[[1]]@mm
}
# v2 eliminates the dispatch on putData and the assignment
# to the S4 slot
v2 <- function(n) {
s <- slow(form=myformula, mm=mymm)
for (i in 1:n) {
mydata$b <- rnorm(nrow(mydata))
myf <- model.frame(s at form, mydata)
mm <- model.matrix(s at form, myf)
}
mm
}
system.time(r <- v1(100))
user system elapsed 0.304 0.000 0.307
system.time(r <- v2(100))
user system elapsed
0.26 0.00 0.26
Ross Boylan
For starters, is it possible that the reports are not accurate, or that I am misreading them. In R 3.0.1 (running under ESS):
> Rprof(line.profiling=TRUE) > system.time(r <- totalEffect(dodata[[1]], dodata[[2]], 1:3, 4))
user system elapsed 21.629 0.756 22.469 !> Rprof(NULL)
> summaryRprof(lines="both")
$by.self
self.time self.pct total.time total.pct
box.R#158 6.74 29.56 13.06 57.28
simulator.multinomial.R#64 2.92 12.81 2.96 12.98
simulator.multinomial.R#63 2.76 12.11 2.76 12.11
box.R#171 2.54 11.14 5.08 22.28
simulator.d1.R#70 0.98 4.30 0.98 4.30
simulator.d1.R#71 0.98 4.30 0.98 4.30
densMap.R#42 0.72 3.16 0.86 3.77
"standardGeneric" 0.52 2.28 11.30 49.56
......
Here's some of the code, with comments at the line numbers
box.R:
sp <- merge(sexpartner, data, by="studyidx")
sp$y <- numFactor(sp$pEthnic) #I think y is not used but must be present
data(sims.c1[[k]]) <- sp ###<<<<< line 158
sp0 <- sp
sp <- sim(sims.c1[[k]], i)
ctable[[k]] <- update.c1(ctable[[k]], sp)
if (is.null(i.c1.in)) {
i.c1.in <- match("pEthnic", colnames(sp0))
i.c1.out <- match(c("studyidx", "n", "pEthnic"), colnames(sp))
}
sp0 <- merge(sp0[,-i.c1.in], sp[,i.c1.out], by=c("studyidx", "n"))
# d1
sp0 <- sp0[sp0$pIsMale == 1,]
# avoid lots of conversion warnings
sp0$pEthnic <- factor(sp0$pEthnic, levels=partRaceLevels)
data(sims.d1[[k]]) <- sp0 ###<<<<< line 171
sp <- sim(sims.d1[[k]], i)
dtable[[k]] <- update.d1(dtable[[k]], sp)
rngstate[[k]] <- .Random.seed
The timing seems odd since it doesn't appear there's anything to do at
the 2 lines except invoke data<-, but if that's slow I would expect the
time to go to the data<- function (in a different file) and not to the
call.
In fact the other big time items are inside the data<- functions.
simulator.multinomial.R:
setMethod("data<-", c("simulator.multinomial", "data.frame"),
function(obj, value) {
mf <- model.frame(obj at dataFormula, data=value)
mf$iCluster <- fromOrig(obj at idmap, as.character(mf$studyidx))
if (any(is.na(mf$iCluster)))
stop("New studyidx--need to draw from meta distn")
mm <- model.matrix(obj at modelFormula, data=mf)
obj at data <- mf ##<<< line 63
obj at mm <- mm ##<<< line 64
return(obj)
})
The mm and data slots have type restrictions, but no other validation
tests.
setClass("simulator.multinomial",
representation(fit="stanfit", idmap="sIDMap",
modelFormula="formula",
categories="ANY", # could be factor or character
# categories should be in the order of their numeric codes in y
# cached results
coef="list",
data="data.frame",
dataFormula="formula",
mm="matrix"))
Does it matter that, e.g., a model frame is more than a vanilla data frame?
I thought assignment, given R's lazy copying behavior, was essentially
resetting a pointer, and so should be fast.
Or maybe the time is going to garbage collecting the previous contents
of the slots?
Ross Boylan
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.