This is the Rprof() report by self time.
Is it also possible that these routines, which take long self.time, are
causing the optim() to be slow?
$by.self
self.time self.pct total.time total.pct
"FUN" 94.16 16.5 94.16 16.5
"unlist" 80.46 14.1 120.54 21.1
"lapply" 76.94 13.5 255.48 44.7
"match" 60.76 10.6 60.88 10.7
"as.matrix.data.frame" 31.00 5.4 51.12 8.9
"as.character" 29.28 5.1 29.28 5.1
"unique.default" 24.36 4.3 24.40 4.3
"data.frame" 21.06 3.7 55.78 9.8
"split.default" 20.42 3.6 84.38 14.8
"tapply" 13.84 2.4 414.28 72.5
"structure" 11.32 2.0 22.36 3.9
"factor" 11.08 1.9 127.68 22.3
"attributes<-" 11.00 1.9 11.00 1.9
"==" 10.56 1.8 10.56 1.8
"%*%" 10.30 1.8 10.30 1.8
"as.vector" 10.22 1.8 10.22 1.8
"as.integer" 9.86 1.7 9.86 1.7
"list" 9.64 1.7 9.64 1.7
"exp" 7.12 1.2 7.12 1.2
"as.data.frame.integer" 5.98 1.0 8.10 1.4
To: bbom419 at hotmail.com
CC: jholtman at gmail.com; r-help at r-project.org
Subject: Re: [R] avoiding loop
From: mtmorgan at fhcrc.org
Date: Sun, 1 Nov 2009 22:14:09 -0800
parkbomee <bbom419 at hotmail.com> writes:
Thank you all.
What Chuck has suggested might not be applicable since the number of
different times is around 40,000.
The object of optimization in my function is the varying "value",
which is basically data * parameter, of which "parameter" is the
object of optimization..
And from the r profiling with a subset of data,
I got this report..any idea what "<Anonymous>" is?
$by.total
total.time total.pct self.time self.pct
"<Anonymous>" 571.56 100.0 0.02 0.0
"optim" 571.56 100.0 0.00 0.0
"fn" 571.54 100.0 0.98 0.2
You're giving us 'by.total', so these are saying that all the time was
spent in these functions or the functions they called. Probably all
are in 'optim' and its arguments; since little self.time is spent
here, there isn't much to work with
"eval" 423.74 74.1 0.00 0.0
"with.default" 423.74 74.1 0.00 0.0
"with" 423.74 74.1 0.00 0.0
These are probably in the internals of optim, where the function
you're trying to optimize is being set up for evaluation. Again
there's little self.time, and all these say is that a big piece of the
time is being spent in code called by this code.
"tapply" 414.28 72.5 13.84 2.4
"lapply" 255.48 44.7 76.94 13.5
"factor" 127.68 22.3 11.08 1.9
"unlist" 120.54 21.1 80.46 14.1
"FUN" 94.16 16.5 94.16 16.5
these look like they are tapply-related calls (looking at the code for
tapply, it calls lapply, factor, and unlist, and FUN is the function
argument to tapply), perhaps from the function you're optimizing (did
you implement this as suggested below? it would really help to have a
possibly simplified version of the code you're calling).
There is material to work with here, as apparently a fairly large
amount of self.time is being spent in each of these functions. So
here's a sample data set
n <- 100000
set.seed(123)
df <- data.frame(time=sort(as.integer(ceiling(runif(n)*n/5))),
value=ceiling(runif(n)*5))
It would have been helpful for you to provide reproducible code like
that above, so that the characteristics of your data were easily
reproducible. Let's time tapply
+ system.time(x0 <<- tapply0(df$value, df$time, sum), gcFirst=TRUE)[[1]]
+ })
[1] 0.316 0.316 0.308 0.320 0.304
tapply is quite general, but in your case I think you'd be happy with
tapply1 <- function(X, INDEX, FUN)
unlist(lapply(split(X, INDEX), FUN), use.names=FALSE)
+ system.time(x1 <<- tapply1(df$value, df$time, sum), gcFirst=TRUE)[[1]]
+ })
[1] 0.156 0.148 0.152 0.144 0.152
so about twice the speed (timing depends quite a bit on what 'time' is,
integer or numeric or character or factor). The vector values of the
two calculations are identical, though tapply presents the data as an
array with column names
identical(as.vector(x0), x1)
[1] TRUE
tapply allows FUN to be anything, but if the interest is in the sum of
each time interval, and the time intervals can be assumed to be sorted
(sorting is not expensive, so could be done on the fly), then
tapply2 <- function(X, INDEX)
{
csum <- cumsum(c(0, X))
idx <- diff(INDEX) != 0
csum[c(FALSE, idx, TRUE)] - csum[c(TRUE, idx, FALSE)]
}
calculates the cumulative sum and the points in INDEX where the time
intervals change. It then takes the difference over the appropriate
interval.
+ system.time(x2 <<- tapply2(df$value, df$time), gcFirst=TRUE)[[1]]
+ })
[1] 0.024 0.024 0.024 0.024 0.024
identical(as.vector(x0), x2)
[1] TRUE
This approach could be subject to rounding error (if csum gets very
large and the intervals remain small). To calculate values where
choice == 1 I think you'd want to
tapply2(df$value * (df$choice==1), df$time)
rather than sub-setting, so that the result of tapply2 is always a
vector of the same length even when some time intervals never have
choice==1.
Because tapply in these examples seems so fast compared to your
calculation, I wonder whether optim is evaluating your function many
times, and that reformulating the optimization might lead to a very
substantial speed-up?
Martin
Date: Sun, 1 Nov 2009 15:35:41 -0400
Subject: Re: [R] avoiding loop
From: jholtman at gmail.com
To: bbom419 at hotmail.com
CC: dwinsemius at comcast.net; d.rizopoulos at erasmusmc.nl;
r-help at r-project.org
What you need to do is to understand how to use Rprof so that you can
determine where the time is being spent. It probably indicates that
this is not the source of slowness in your optimization function. How
much time are we talking about? You may spent more time trying to
optimize the function than just running the current version even if it
is "slow" (slow is a relative term and does not hold much meaning
without some context round it).
On Sat, Oct 31, 2009 at 11:36 PM, parkbomee <bbom419 at hotmail.com>
wrote:
Thank you both.
However, using tapply() instead of a loop does not seem to improve my
code much.
I am using this inside of an optimization function,
and it still takes more than it needs...
CC: bbom419 at hotmail.com; r-help at r-project.org
From: dwinsemius at comcast.net
To: d.rizopoulos at erasmusmc.nl
Subject: Re: [R] avoiding loop
Date: Sat, 31 Oct 2009 22:26:17 -0400
This is pretty much equivalent:
tapply(DF$value[DF$choice==1], DF$time[DF$choice==1], sum) /
tapply(DF$value, DF$time, sum)
And both will probably fail if the number of groups with choice==1
is
different than the number overall.
--
David.
On Oct 31, 2009, at 5:14 PM, Dimitris Rizopoulos wrote:
one approach is the following:
# say 'DF' is your data frame, then
with(DF, {
ind <- choice == 1
n <- tapply(value[ind], time[ind], sum)
d <- tapply(value, time, sum)
n / d
})
I hope it helps.
Best,
Dimitris
parkbomee wrote:
Hi all,
I am trying to figure out a way to improve my code's efficiency
by
avoiding the use of loop.
I want to calculate a conditional mean(?) given time.
For example, from the data below, I want to calculate sum((value|
choice==1)/sum(value)) across time.
Is there a way to do it without using a loop?
time cum_time choice value
1 4 1 3
1 4 0 2
1 4 0 3
1 4 0 3
2 6 1 4
2 6 0 4
2 6 0 2
2 6 0 4
2 6 0 2
2 6 0 2 3 4
1 2 3 4 0 3 3
4 0 5 3 4 0 2
My code looks like
objective[1] = value[1] / sum(value[1:cum_time[1])
for (i in 2:max(time)){
objective[i] = value[cum_time[i-1]+1] /
sum(value[(cum_time[i-1]+1) : cum_time[i])])
}
sum(objective)
Anyone have an idea that I can do this without using a loop??
Thanks.