merge performace degradation in 2.9.1
Is there a way to avoid the degradation in performance in 2.9.1?
If the example is to demonstrate a difference between R versions that you really need to get to the bottom of then read no further. However, if the example is actually what you want to do then you can speed it up by using a data.table as follows to reduce the 26 secs to 1 sec. Time on my PC at home (quite old now!) :
system.time(Out <- merge(X, Y, by="mon", all=TRUE))
user system elapsed
25.63 0.58 26.98
Using a data.table instead :
X <- data.table(group=rep(12:1, each=N), mon=rep(rev(month.abb), each=N),
key="mon")
Y <- data.table(mon=month.abb, letter=letters[1:12], key="mon")
tables()
NAME NROW COLS KEY
[1,] X 1,200,000 group,mon mon
[2,] Y 12 mon,letter mon
system.time(X$letter <- Y[X,letter]) # Y[X] is the syntax for merge of two data.tables
user system elapsed 0.98 0.11 1.10
identical(Out$letter, X$letter)
[1] TRUE
identical(Out$mon, X$mon)
[1] TRUE
identical(Out$group, X$group)
[1] TRUE To do the multi-column equi-join of X and Z, set a key of 2 columns. 'nomatch' is the equivalent of 'all' and can be set to 0 (inner join) or NA (outer join). "Adrian Dragulescu" <adrian_d at eskimo.com> wrote in message news:Pine.LNX.4.64.0907090953580.1125 at shell.eskimo.com...
I have noticed a significant performance degradation using merge in 2.9.1
relative to 2.8.1. Here is what I observed:
N <- 100000
X <- data.frame(group=rep(12:1, each=N), mon=rep(rev(month.abb),
each=N))
X$mon <- as.character(X$mon)
Y <- data.frame(mon=month.abb, letter=letters[1:12])
Y$mon <- as.character(Y$mon)
Z <- cbind(Y, group=1:12)
system.time(Out <- merge(X, Y, by="mon", all=TRUE))
# R 2.8.1 is 17% faster than R 2.9.1 for N=100000
system.time(Out <- merge(X, Z, by=c("mon", "group"), all=TRUE))
# R 2.8.1 is 16% faster than R 2.9.1 for N=100000
Here is the head of summaryRprof() for 2.8.1
$by.self
self.time self.pct total.time total.pct
sort.list 4.60 56.5 4.60 56.5
make.unique 1.68 20.6 2.18 26.8
as.character 0.50 6.1 0.50 6.1
duplicated.default 0.50 6.1 0.50 6.1
merge.data.frame 0.20 2.5 8.02 98.5
[.data.frame 0.16 2.0 7.10 87.2
and for 2.9.1
$by.self
self.time self.pct total.time total.pct
sort.list 4.66 39.2 4.66 39.2
nchar 3.28 27.6 3.28 27.6
make.unique 1.42 12.0 1.92 16.2
as.character 0.50 4.2 0.50 4.2
data.frame 0.46 3.9 4.12 34.7
[.data.frame 0.44 3.7 7.28 61.3
As you notice the 2.9.1 has an nchar entry that is quite time consuming.
Is there a way to avoid the degradation in performance in 2.9.1?
Thank you,
Adrian
As an aside, I got interested in testing merge in 2.9.1 by reading the
r-devel message from 30-May-2009 "Degraded performance with rank()" by Tim
Bergsma, as he mentions doing merges, but only today decided to test.