Anyone got any hints on how to make this code more efficient? An early
version (which to be fair did more than this one is) ran for 330 hours
and produced no output.
I have a two column table, Dat, with 12,000,000 rows and I want to
produce a lookup table, ltable, in a 1 dimensional matrix with one
copy of each of the values in Dat:
for (i in 1:nrow(Dat))
{
for (j in 1:2)
{
#If next value is already in ltable, do nothing
if (is.na(match(Dat[i,j], ltable))){ltable <- rbind(ltable,Dat[i,j])}
}
}
but it takes forever to produce anything.
Any advice gratefully received.
Thomas
Running *slow*
8 messages · R. Michael Weylandt, Patrick Burns, Thomas Friedrichsmeier +1 more
?unique x <- matrix(c(1:6, 6:1),ncol=2) x.temp <- x dim(x.temp) <- NULL unique(x.temp) Michael
On Thu, Oct 6, 2011 at 8:37 AM, Thomas <chesney.alt at gmail.com> wrote:
Anyone got any hints on how to make this code more efficient? An early
version (which to be fair did more than this one is) ran for 330 hours and
produced no output.
I have a two column table, Dat, with 12,000,000 rows and I want to produce a
lookup table, ltable, in a 1 dimensional matrix with one copy of each of the
values in Dat:
for (i in 1:nrow(Dat))
{
for (j in 1:2)
{
#If next value is already in ltable, do nothing
if (is.na(match(Dat[i,j], ltable))){ltable <- rbind(ltable,Dat[i,j])}
}
}
but it takes forever to produce anything.
Any advice gratefully received.
Thomas
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Probably most of the time you're waiting for this you are in Circle 2 of 'The R Inferno'. If the values are numbers, you might also be in Circle 1.
On 06/10/2011 13:37, Thomas wrote:
Anyone got any hints on how to make this code more efficient? An early
version (which to be fair did more than this one is) ran for 330 hours
and produced no output.
I have a two column table, Dat, with 12,000,000 rows and I want to
produce a lookup table, ltable, in a 1 dimensional matrix with one copy
of each of the values in Dat:
for (i in 1:nrow(Dat))
{
for (j in 1:2)
{
#If next value is already in ltable, do nothing
if (is.na(match(Dat[i,j], ltable))){ltable <- rbind(ltable,Dat[i,j])}
}
}
but it takes forever to produce anything.
Any advice gratefully received.
Thomas
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Patrick Burns pburns at pburns.seanet.com twitter: @portfolioprobe http://www.portfolioprobe.com/blog http://www.burns-stat.com (home of 'Some hints for the R beginner' and 'The R Inferno')
Patrick is right, most of the time is probably taken up for the reasons documented in the (masterful) R Inferno, namely the rbind() calls. There is another problem though and it gets at the very core of R, and for that matter, all interpreted languages that I'm familiar with. I'll give a fairly elementary explanation and gloss over many of the subtleties that R core worries about so we mere mortals don't have to. At the end of the day, everything is looped, there's no way to get around it. However, from a code perspective we have a choice of looping in C or R. Whenever possible it is better to loop in C than R and most of the key built-in functions, like unique(), are designed to do just that. The reason for it is pretty straightforward: consider what has to happen to run a loop in R: Iterator is defined: a sequence of C calls start this first line of loop is hit -> interpreted by R -> sent to C code -> executed -> changed back into an R result -> passed to the next line of the loop iterator is increased: C again second line of loop is hit -> interpreted by R -> sent to C code -> executed -> changed back into an R result -> passed to the next line of the loop etc. Complicated and/or multiple lines of code only compound the problem because you have to go up and down multiple times at each iteration. Looping on the C level gets rid of all those "translations" between C/R, save 2, and thereby mightily increases efficiency. Hence, even if you are using the same (or heaven forbid a faster!) algorithm on the R level, it can look super slow because of all the moving up and down the ladder; I don't know how unique.C is implemented, but my guess is it's more or less like what you have now, with more efficient memory usage/preallocation, it just looks *much* faster because of the C architecture. DISCLAIMER: there are quite a few inaccuracies, most small, maybe a few large, in here, and I probably only am aware of a small fraction thereof, but this wasn't intended to be a super accurate explanation. On another note, I should explain my solution a little more clearly. A straight call to unique() would check for unique ROWS not values of x. I take x, make a copy so as not to harm the original object, strip if of its dimensionality (thereby converting it to a vector efficiently), and then apply unique() which will now find unique values. It's not a huge thing, but not immediately apparent from what I did. Hope this helps, Michael
On Thu, Oct 6, 2011 at 11:59 AM, Patrick Burns <pburns at pburns.seanet.com> wrote:
Probably most of the time you're waiting for this you are in Circle 2 of 'The R Inferno'. ?If the values are numbers, you might also be in Circle 1. On 06/10/2011 13:37, Thomas wrote:
Anyone got any hints on how to make this code more efficient? An early
version (which to be fair did more than this one is) ran for 330 hours
and produced no output.
I have a two column table, Dat, with 12,000,000 rows and I want to
produce a lookup table, ltable, in a 1 dimensional matrix with one copy
of each of the values in Dat:
for (i in 1:nrow(Dat))
{
for (j in 1:2)
{
#If next value is already in ltable, do nothing
if (is.na(match(Dat[i,j], ltable))){ltable <- rbind(ltable,Dat[i,j])}
}
}
but it takes forever to produce anything.
Any advice gratefully received.
Thomas
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
-- Patrick Burns pburns at pburns.seanet.com twitter: @portfolioprobe http://www.portfolioprobe.com/blog http://www.burns-stat.com (home of 'Some hints for the R beginner' and 'The R Inferno')
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Thank you Michael and Patrick for your responses. Michael - your code ran in
under 5 minutes, which I find stunning, and Patrick I have sent the Inferno
doc to the copier for printing and reading this weekend.
I now have 8 million values in my lookup table and want to replace each
value in Dat with the index of that value in the lookup table. In line with
Chapter 2 in the Inferno doc, I created a list of appropriate size first,
rather than growing it, but still couldn't figure out how to do it without
looping in R, so it still runs extremely slowly, even just to process the
first 1000 values in Dat. My original code (before I tried specifiying the
size of Dat2) was:
Dat2 <- c()
for (i in 1:nrow(Dat))
{
for (j in 1:2)
{
Dat2 <- c(Dat2, match(Dat[i,j], ltable))
}}
write(t(edgelist), "EL.txt", ncolumns=2)
Can anyone suggest a way of doing this without looping in R? Or is the
bottleneck the c function? I am looking at apply this morning, but Gentleman
(2009) suggests apply isn't very efficient.
--
View this message in context: http://r.789695.n4.nabble.com/Running-slow-tp3878093p3881365.html
Sent from the R help mailing list archive at Nabble.com.
Hi, Thomas, if I'm not completely mistaken Dat2 <- match( t( Dat), ltable) should do what you want. Hth -- Gerrit
On Fri, 7 Oct 2011, thomas.chesney wrote:
Thank you Michael and Patrick for your responses. Michael - your code ran in
under 5 minutes, which I find stunning, and Patrick I have sent the Inferno
doc to the copier for printing and reading this weekend.
I now have 8 million values in my lookup table and want to replace each
value in Dat with the index of that value in the lookup table. In line with
Chapter 2 in the Inferno doc, I created a list of appropriate size first,
rather than growing it, but still couldn't figure out how to do it without
looping in R, so it still runs extremely slowly, even just to process the
first 1000 values in Dat. My original code (before I tried specifiying the
size of Dat2) was:
Dat2 <- c()
for (i in 1:nrow(Dat))
{
for (j in 1:2)
{
Dat2 <- c(Dat2, match(Dat[i,j], ltable))
}}
write(t(edgelist), "EL.txt", ncolumns=2)
Can anyone suggest a way of doing this without looping in R? Or is the
bottleneck the c function? I am looking at apply this morning, but Gentleman
(2009) suggests apply isn't very efficient.
--
View this message in context: http://r.789695.n4.nabble.com/Running-slow-tp3878093p3881365.html
Sent from the R help mailing list archive at Nabble.com.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Gerrit, Looks like it does and in less than--an incredible--one minute! Thank you! -- View this message in context: http://r.789695.n4.nabble.com/Running-slow-tp3878093p3881588.html Sent from the R help mailing list archive at Nabble.com.
Making a bit more sense now: "If you are translating code into R that has a double for loop, think." The R Inferno, Page 18. -- View this message in context: http://r.789695.n4.nabble.com/Running-slow-tp3878093p3881951.html Sent from the R help mailing list archive at Nabble.com.