practical to loop over 2million rows?
Hi Jay, A few comments. 1) As you know, vectorize when possible. Even if you must have a loop, perhaps you can avoid nested loops or at least speed each iteration. 2) Write your loop in a function and then byte compile it using the cmpfun() function from the compiler package. This can help dramatically (though still not to the extent of vectorization). 3) If you really need to speed up some aspect and are stuck with a loop, checkout the R + Rcpp + inline + C++ tool chain, which allows you to write inline C++ code, compile it fairly easily, and move data to and from it. Here is an example of a question I answered on SO where the OP had an algorithm to implement in R and I ran through with the R implemention, the compiled R implementation, and one using Rcpp and compare timings. It should give you a bit of a sense for what you are dealing with at least. You are correct that some things can help speed in R loops, such as preallocation, and also depending what you are doing, some classes are faster than others. If you are working with a vector of integers, don't store them as doubles in a data frame (that is a silly extreme, but hopefully you get the point). Good luck, Josh
On Wed, Oct 10, 2012 at 1:31 PM, Jay Rice <jsrice18 at gmail.com> wrote:
New to R and having issues with loops. I am aware that I should use
vectorization whenever possible and use the apply functions, however,
sometimes a loop seems necessary.
I have a data set of 2 million rows and have tried run a couple of loops of
varying complexity to test efficiency. If I do a very simple loop such as
add every item in a column I get an answer quickly.
If I use a nested ifelse statement in a loop it takes me 13 minutes to get
an answer on just 50,000 rows. I am aware of a few methods to speed up
loops. Preallocating memory space and compute as much outside of the loop
as possible (or use create functions and just loop over the function) but
it seems that even with these speed ups I might have too much data to run
loops. Here is the loop I ran that took 13 minutes. I realize I can
accomplish the same goal using vectorization (and in fact did so).
y<-numeric(length(x))
for(i in 1:length(x))
ifelse(!is.na(x[i]), y[i]<-x[i],
ifelse(strataID[i+1]==strataID[i], y<-x[i+1], y<-x[i-1]))
Presumably, complicated loops would be more intensive than the nested if
statement above. If I write more efficient loops time will come down but I
wonder if I will ever be able to write efficient enough code to perform a
complicated loop over 2 million rows in a reasonable time.
Is it useless for me to try to do any complicated loops on 2 million rows,
or if I get much better at programming in R will it be manageable even for
complicated situations?
Jay
[[alternative HTML version deleted]]
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Joshua Wiley Ph.D. Student, Health Psychology Programmer Analyst II, Statistical Consulting Group University of California, Los Angeles https://joshuawiley.com/