Skip to content

practical to loop over 2million rows?

6 messages · Jay, Joshua Wiley, David Winsemius +2 more

#
Hi Jay,

A few comments.

1) As you know, vectorize when possible.  Even if you must have a
loop, perhaps you can avoid nested loops or at least speed each
iteration.
2) Write your loop in a function and then byte compile it using the
cmpfun() function from the compiler package.  This can help
dramatically (though still not to the extent of vectorization).
3) If you really need to speed up some aspect and are stuck with a
loop, checkout the R + Rcpp + inline + C++ tool chain, which allows
you to write inline C++ code, compile it fairly easily, and move data
to and from it.

Here is an example of a question I answered on SO where the OP had an
algorithm to implement in R and I ran through with the R implemention,
the compiled R implementation, and one using Rcpp and compare timings.
 It should give you a bit of a sense for what you are dealing with at
least.

You are correct that some things can help speed in R loops, such as
preallocation, and also depending what you are doing, some classes are
faster than others.  If you are working with a vector of integers,
don't store them as doubles in a data frame (that is a silly extreme,
but hopefully you get the point).

Good luck,

Josh
On Wed, Oct 10, 2012 at 1:31 PM, Jay Rice <jsrice18 at gmail.com> wrote:

  
    
#
On Oct 10, 2012, at 1:31 PM, Jay Rice wrote:

            
You should describe what you want to do and you should learn to use the vectorized capabilities of R  and leave the for-loops for process that really need them
Instead :

y[!is.na(x)] <- x[!is.na(x)]  # No loop.
When you index outside the range of the length of x you get NA as a result. Furthermore you are setting y to be only a single element. So I think 'y' will be a single NA at the end of all this.
[1] 1 1 2 2 1 2 2 2 2 1
[1] NA

 There is no implicit indexing of the LHS of an assignment operation. How long is strataID? And why not do this inside a dataframe?
You will gain efficiency when you learn vectorization. And when you learn to test your code for correct behavior.
David Winsemius, MD
Alameda, CA, USA
#
This is a classic example from my tag line:

Tell me what  you want to do, not how you want to do it.

For example you provided no information as to what the objects were.
I hope that 'stratID' is at least of length one greater than 'x' based
on your loops.  Also on the last iteration you are trying to access an
element outside of x (x[length(x) + 1]).

The first part is easy for setting 'y'

indx <- !is.na(x)
y[indx] <- x[indx]

For the second part you can do something like:

indx <- head(stratID, -1) == tail(stratID, -1)  # get the comparison

but since you did not provide any data, the rest is left to the reader.
On Wed, Oct 10, 2012 at 5:16 PM, David Winsemius <dwinsemius at comcast.net> wrote:

  
    
#
On Oct 10, 2012, at 6:45 PM, jim holtman wrote:

            
That's perhaps faster than the approach I offered because if only uses is.na(x) once.
Jay; if you have not figured it out yet, Jim Holtman is one of premier data-meisters around here. You could probably write an excellent book simply by going to the Archives and pasting together all the elegant solutions he has provided over the years. His moniker 'Data Munger Guru' is well deserved.
"You" meaning Jay. (At least I hope that is what Jim meant.) 

Jim; Perhaps you tagline should say: "Tell me what you have, and only then, what you want to do with it."
#
maybe take a closer look at the ifelse help page and the examples?

First, ifelse is intended to be vectorized. If you nest it in a loop, you're effectively nesting a loop inside a loop. And by putting ifelse inside ifelse, you've done that twice. And then you've run the loops on vectors of length one, so 'twas all in vain...
Second, the two things after the condition in ifelse are not instructions, they are arguments to the function. Putting y<-something in as an argument means '(promise to) store something in a variable called y, and then pass y to the function'. You probably didn't mean that.
Third, ifelse returns a vector of the results; you're not using the return value for anything.

For a single 'if' that takes some action, you want 'if' and 'else' _separately_, not 'ifelse'
y<-length(x) #length() already returns a numeric value. So if you must do this with a loop, it would look more like
 
for(i in 1:length(x)+1) { #because x[i-1] wand x[i+1] won't be there for all i otherwise  
	if (!is.na(x[i])) , y[i]<-x[i]
	if(strataID[i+1]==strataID[i]) y<-x[i+1] else y<-x[i] #I changed the second x index  because I can't see why it differed from the strataID index
               #or, using the fact that 'if' also returns something:
               # y <- if(strataID[i+1]==strataID[i]) x[i+1] else x[i]
} 

Finally, if you don't preallocate y at the length you want, R will have to move the whole of y to a new memory location with one more space every time you append something to it. There's a section on that in the R inferno. It's a really good way of slowing R down.

So let's try something else.
strataID <- sample(letters[1:3], 2000000, replace=T) #a nice long strata identifier with some matches likely
x <- rnorm(2000000) #some random numbers
x <- ifelse(x < -2, NA, x) #a few NA's now in x, though it does take a few seconds for the 2 million observations

i <- 1:(length(x)-1)  #A long indexing vector with space for the last x[i+1]
y <- x  #That puts all the NA's in the right place in y, allocates y and happens to put all the current values of x into y too.
system.time( y[i]<-ifelse( strataID[i+1]==strataID[i], x[i+1], x[i]  ) )
                              #does the whole loop and stores it in the 'right' places in y - 
                              # though it will foul up those NA's because of your x indexing. And incidentally it doesn't change the last y either
                               #On my allegedly 2GHz machine the systemt time result was 2.87 seconds for the 2 million 'rows' 


#Incidentally, a look at what we ended up with:
data.frame(s=strataID, y=y)[1:30,]
#says you probably aren;t getting anything useful from the exercise other than a feel for what can go wrong with loops.
*******************************************************************
This email and any attachments are confidential. Any use...{{dropped:8}}