More efficient option to append()?
Hi:
This seems to take a bit less code, avoids explicit loops (by using
mapply() instead, where the loops are internal) and takes about 10
seconds on my system:
m <- cbind(x = sample(1:15,2000000, replace=T),
y = sample(1:10*1000, 2000000, replace=T))
sum(m[, 1])
# [1] 16005804
ff <- function(x, y) rep(y, x)
system.time(w <- do.call(c, mapply(ff, m[, 1], m[, 2])))
user system elapsed
9.75 0.00 9.75
length(w)
[1] 16005804
count(w)
x freq 1 1000 1603184 2 2000 1590599 3 3000 1596661 4 4000 1607112 5 5000 1598571 6 6000 1599195 7 7000 1600475 8 8000 1601718 9 9000 1598896 10 10000 1609393 HTH, Dennis PS: It would have been a good idea to keep the OP in the loop of this thread. On Thu, Aug 18, 2011 at 12:46 AM, Timothy Bates
<timothy.c.bates at gmail.com> wrote:
This takes a few seconds to do 1 million lines, and remains explicit/for loop form
numberofSalaryBands = 1000000 # 2000000
x ? ? ? ?= sample(1:15,numberofSalaryBands, replace=T)
y ? ? ? ?= sample((1:10)*1000, numberofSalaryBands, replace=T)
df ? ? ? = data.frame(x,y)
finalN ? = sum(df$x)
myVar ? ?= rep(NA, finalN)
outIndex = 1
i ? ? ? ?= 1
for (i in 1:numberofSalaryBands) {
? ? ? ?kount = df$x[i]
? ? ? ?myVar[outIndex:(outIndex+kount-1)] = rep(df$y[i], kount) # Make x[i] copies of value y[i]
? ? ? ?outIndex = outIndex+kount
}
head(myVar)
plyr::count(myVar)
On Aug 18, 2011, at 12:17 AM, Alex Ruiz Euler wrote:
Dear R community,
I have a 2 million by 2 matrix that looks like this:
x<-sample(1:15,2000000, replace=T)
y<-sample(1:10*1000, 2000000, replace=T)
? ? ?x ? ? y
[1,] 10 ?4000
[2,] ?3 ?1000
[3,] ?3 ?4000
[4,] ?8 ?6000
[5,] ?2 ?9000
[6,] ?3 ?8000
[7,] ?2 10000
(...)
The first column is a population expansion factor for the number in the
second column (household income). I want to expand the second column
with the first so that I end up with a vector beginning with 10
observations of 4000, then 3 observations of 1000 and so on. In my mind
the natural approach would be to create a NULL vector and append the
expansions:
myvar<-NULL
myvar<-append(myvar, replicate(x[1],y[1]), 1)
for (i in 2:length(x)) {
myvar<-append(myvar,replicate(x[i],y[i]),sum(x[1:i])+1)
}
to end with a vector of sum(x), which in my real database corresponds
to 22 million observations.
This works fine --if I only run it for the first, say, 1000
observations. If I try to perform this on all 2 million observations
it takes long, way too long for this to be useful (I left it running
11 hours yesterday to no avail).
I know R performs well with operations on relatively large vectors. Why
is this so inefficient? And what would be the smart way to do this?
Thanks in advance.
Alex
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.