[R] Yet another set of codes to optimize
Daren Tan daren76 at hotmail.com
Fri Dec 5 03:41:23 CET 2008
I have problems converting my dataset from long to wide format.
Previous
attempts using reshape package and aggregate function were
unsuccessful
as they took too long. Apparently, my simplified solution also lasted
as long.
My complete codes is given below. When sample.size = 10000, the
execution takes about 20 seconds. But sample.size = 100000 seems to
take
eternity. My actual sample.size is 15000000 i.e. 15 million.
sample.size <- 10000
m <- data.frame(Name=sample(1:100000, sample.size, T),
Your for loop is tabulating the items in m.ids and m[,3]
so think of using table(). E.g., replace
res <- matrix(0, nr=length(v1), nc=length(v2), dimnames=list(v1,
v2))
for(i in 1:nrow(m)) {
x <- m.ids[i]
y <- m[i,3]
res[x, y] <- res[x, y] + 1
}
with
res<-table(factor(m.ids,levels=v1), factor(m[,3]))
There is a bit of trickiness in putting this table into
the data.frame. Since as.data.frame(tableObject) works very
differently than as.data.frame(matrixObject), the naive
data.frame(m.12.unique[,1], m.12.unique[,2], res, row.names=NULL)
fails. You need to convert the table res into a matrix with
the same data, dimensions, and dimnames.
data.frame(m.12.unique[,1], m.12.unique[,2], as.matrix(res),
row.names=NULL)
also fails because a "table" object is a "matrix" object so
as.matrix(tableObject) returns its input, unchanged.
as(res,"matrix") seems to work, as the the wordier
but more explicit array(res,dim(res),dimnames(res)).
res1 <-
function(m) {
m.12.unique <- unique(m[,1:2])
m.12.unique <- m.12.unique[order(m.12.unique[,1], m.12.unique[,2]),]
v1 <- paste(m.12.unique[,1], m.12.unique[,2], sep=".")
v2 <- c(sort(unique(m[,3])))
res <- matrix(0, nr=length(v1), nc=length(v2), dimnames=list(v1,
v2))
m.ids <- paste(m[,1], m[,2], sep=".")
res <- table(factor(m.ids,levels=v1), factor(m[,3]))
res <- data.frame(m.12.unique[,1], m.12.unique[,2],
as(res, "matrix"), row.names=NULL)
colnames(res) <- c("Name", "Type", v2)
return(res)
}
Here is a table of times for your original function, time0,
and this modified one, time0. It looks like res1 eventually
becomes worse than linear, but for a much larger size than
your original. sort() and unique() cannot have linear time
so they may be becoming factors at size=1e6.
size time0 time1
1 10 0.012 0.012
2 100 0.032 0.014
3 200 0.061 0.016
4 400 0.126 0.020
5 800 0.286 0.028
6 1000 0.383 0.033
7 2000 2.337 0.054
8 4000 8.578 0.100
9 8000 39.955 0.214
10 10000 68.767 0.318
11 20000 327.973 1.057
12 100000 NA 3.021
12 1000000 NA 89.881
Bill Dunlap
TIBCO Software Inc - Spotfire Division
wdunlap tibco.com