Skip to content

slow computation of functions over large datasets

11 messages · Caroline Faisst, ONKELINX, Thierry, jim holtman +3 more

#
Dear Caroline,

Here is a faster and more elegant solution.
+ 	ddply(exampledata, .(orderID), function(x){
+ 		data.frame(itemPrice = x$itemPrice, orderAmount = cumsum(x$itemPrice))
+ 	})
+ })
   user  system elapsed 
   1.67    0.00    1.69
+ {exampledata[i,"orderAmount"]<-ifelse(exampledata[i,"orderID"]==exampledata[i-1,"orderID"],exampledata[i-1,"orderAmount"]+exampledata[i,"itemPrice"],exampledata[i,"itemPrice"])})
   user  system elapsed 
  11.94    0.02   11.97

Best regards,

Thierry
#
On Aug 3, 2011, at 9:25 AM, Caroline Faisst wrote:

            
Ouch. Schools really should stop teaching SAS and BASIC as a first  
language.
Try instead using 'ave' to calculate a cumulative 'sum' within  
"orderID":

exampledata$orderAmt <- with(exampledata,  ave(itemPrice, orderID,  
FUN=cumsum) )

I assure you this will be more reproducible,  faster, and  
understandable.
"medium" dataset really. Barely nudges the RAM dial on my machine.
> system.time( exampledata2$orderAmt <- with(exampledata2,   
ave(itemPrice, orderID, FUN=cumsum) ) )
    user  system elapsed
  35.106   0.811  35.822

On a three year-old machine. Not as fast as I expected, but not long  
enough to require refilling the coffee cup either.

-- 
David.

  
    
#
On Aug 3, 2011, at 9:59 AM, ONKELINX, Thierry wrote:

            
I tried running this method on the "large dataset" (2MM row) the OP  
offered, and needed to eventually interrupt it so I could get my  
console back:

 > system.time({
+  	ddply(exampledata2, .(orderID), function(x){
+  		data.frame(itemPrice = x$itemPrice, orderAmount = cumsum(x 
$itemPrice))
+  	})
+  })

Timing stopped at: 808.473 1013.749 1816.125

The same task with ave() took 35 seconds.
#
This takes about 2 secs for 1M rows:
+                         , list(total = sum(itemPrice))
+                         , by = orderID
+                         ]
+            )
   user  system elapsed
   1.30    0.05    1.34
Classes ?data.table? and 'data.frame':  198708 obs. of  2 variables:
 $ orderID: int  1 2 3 4 5 6 8 9 10 11 ...
 $ total  : num  49 37 72 92 50 76 34 22 65 39 ...
orderID total
[1,]       1    49
[2,]       2    37
[3,]       3    72
[4,]       4    92
[5,]       5    50
[6,]       6    76
On Wed, Aug 3, 2011 at 9:25 AM, Caroline Faisst
<caroline.faisst at gmail.com> wrote:

  
    
#
On Aug 3, 2011, at 12:20 PM, jim holtman wrote:

            
Interesting. Impressive. And I noted that the OP wanted what cumsum  
would provide and for some reason creating that longer result is even  
faster on my machine than the shorter result using sum.
#
Hello, 
  Perhaps transpose the table attach(as.data.frame(t(data))) and use ColSums() function with order id as header.
             -Ken Hutchison
On Aug 3, 2554 BE, at 1:12 PM, David Winsemius <dwinsemius at comcast.net> wrote:

            
#
On Aug 3, 2011, at 2:01 PM, Ken wrote:

            
Got any code? The OP offered a reproducible example, after all.
#
Sorry about the lack of code, but using Davids example, would:
tapply(itemPrice, INDEX=orderID, FUN=sum)
work?
  -Ken Hutchison
On Aug 3, 2554 BE, at 2:09 PM, David Winsemius <dwinsemius at comcast.net> wrote:

            
#
On Aug 3, 2011, at 3:05 PM, Ken wrote:

            
Doesn't do the cumulative sums or the assignment into column of the  
same data.frame. That's why I used ave, because it keeps the sequence  
correct.