Hello,
I've been dealing with a set of values that contain time stamps and part
of my summary needs to look at just weekend data. In trying to limit the
data I've found a large difference in performance in the way I index a
data frame. I've constructed a minimal example here to try to explain my
observation.
is.weekend <- function(x) {
tm <- as.POSIXlt(x,origin="1970/01/01")
format(tm,"%a") %in% c("Sat","Sun")
}
use.lapply <- function(data) {
data[do.call(rbind,lapply(data$TIME,FUN=is.weekend)),]
}
use.sapply <- function(data) {
data[sapply(data$TIME,FUN=is.weekend),]
}
use.vapply <- function(data) {
data[vapply(data$TIME,FUN=is.weekend,FALSE),]
}
use.indexing <- function(data) {
data[is.weekend(data$TIME),]
}
And the results of these methods:
> names(csv.data)
[1] "TIME" "FILE" "RADIAN" "BITS" "DURATION"
> length(csv.data$TIME)
[1] 21471
> system.time(v1 <- use.lapply(csv.data))
user system elapsed
19.562 6.402 25.967
> system.time(v2 <- use.sapply(csv.data))
user system elapsed
19.456 6.492 25.951
> system.time(v3 <- use.vapply(csv.data))
user system elapsed
19.334 6.468 25.808
> system.time(v4 <- use.indexing(csv.data))
user system elapsed
0.032 0.020 0.052
> all(identical(v1,v2),identical(v2,v3),identical(v3,v4))
[1] TRUE
Forgive what is probably a trivial question, but why is there such a
large difference in the *apply functions as opposed to the direct
indexing method? On the surface it seems as though the use.indexing
method uses the entire vector as an argument to the function while the
others /might/ iterate over the values using one at a time as an
argument to the function. In either case all elements must be part of
the calculation...
Thanks for any insight.
Jesse
Data Frame Indexing
2 messages · Jesse Brown, jim holtman
The problem is that the way you are using "*apply", there are individual calls to the function for each item. In the direct indexing, you are only making a single call with a vector of values; Here is a illustration that shows the number of calls:
# count the calls f.test <- function(x) callCnt <<- callCnt + 1 # test function; just increment counter # test vector x <- 1:100 callCnt <- 0 invisible(sapply(x, f.test)) callCnt # notice that there were 100 calls made
[1] 100 This again indicates that you need to think about how to vectorize your operations. Also if you have used Rprof, it may have shown where you were spending time.
On Mon, Aug 22, 2011 at 8:13 AM, Jesse Brown <jesse.r.brown at lmco.com> wrote:
Hello,
I've been dealing with a set of values that contain time stamps and part of
my summary needs to look at just weekend data. In trying to limit the data
I've found a large difference in performance in the way I index a data
frame. I've constructed a minimal example here to try to explain my
observation.
? is.weekend <- function(x) {
? ? ? tm <- as.POSIXlt(x,origin="1970/01/01")
? ? ? format(tm,"%a") %in% c("Sat","Sun")
? }
? use.lapply <- function(data) {
? ? ? data[do.call(rbind,lapply(data$TIME,FUN=is.weekend)),]
? }
? use.sapply <- function(data) {
? ? ? data[sapply(data$TIME,FUN=is.weekend),]
? }
? use.vapply <- function(data) {
? ? ? data[vapply(data$TIME,FUN=is.weekend,FALSE),]
? }
? use.indexing <- function(data) {
? ? ? data[is.weekend(data$TIME),]
? }
And the results of these methods:
? ?> names(csv.data)
? [1] "TIME" ? ? "FILE" ? ? "RADIAN" ? "BITS" ? ? "DURATION"
? ?> length(csv.data$TIME)
? [1] 21471
? ?> system.time(v1 <- use.lapply(csv.data))
? ? ?user ?system elapsed
? ?19.562 ? 6.402 ?25.967
? ?> system.time(v2 <- use.sapply(csv.data))
? ? ?user ?system elapsed
? ?19.456 ? 6.492 ?25.951
? ?> system.time(v3 <- use.vapply(csv.data))
? ? ?user ?system elapsed
? ?19.334 ? 6.468 ?25.808
? ?> system.time(v4 <- use.indexing(csv.data))
? ? ?user ?system elapsed
? ? 0.032 ? 0.020 ? 0.052
? ?> all(identical(v1,v2),identical(v2,v3),identical(v3,v4))
? [1] TRUE
Forgive what is probably a trivial question, but why is there such a large
difference in the *apply functions as opposed to the direct indexing method?
On the surface it seems as though the use.indexing method uses the entire
vector as an argument to the function while the others /might/ iterate over
the values using one at a time as an argument to the function. In either
case all elements must be part of the calculation...
Thanks for any insight.
Jesse
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Jim Holtman Data Munger Guru What is the problem that you are trying to solve?