slow computation of functions over large datasets

An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20110803/d8b43376/attachment.pl>
Dear Caroline,

Here is a faster and more elegant solution.
n <- 10000
exampledata <- data.frame(orderID = sample(floor(n / 5), n, replace = TRUE), itemPrice = rpois(n, 10))
library(plyr)
system.time({
+ 	ddply(exampledata, .(orderID), function(x){
+ 		data.frame(itemPrice = x$itemPrice, orderAmount = cumsum(x$itemPrice))
+ 	})
+ })
   user  system elapsed 
   1.67    0.00    1.69
exampledata[1,"orderAmount"]<-exampledata[1,"itemPrice"]
system.time(for (i in 2:length(exampledata[,1]))
+ {exampledata[i,"orderAmount"]<-ifelse(exampledata[i,"orderID"]==exampledata[i-1,"orderID"],exampledata[i-1,"orderAmount"]+exampledata[i,"itemPrice"],exampledata[i,"itemPrice"])})
   user  system elapsed 
  11.94    0.02   11.97

Best regards,

Thierry
-----Oorspronkelijk bericht-----
Van: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org]
Namens Caroline Faisst
Verzonden: woensdag 3 augustus 2011 15:26
Aan: r-help at r-project.org
Onderwerp: [R] slow computation of functions over large datasets

Hello there,

I'm computing the total value of an order from the price of the order items using
a "for" loop and the "ifelse" function. I do this on a large dataframe (close to
1m lines). The computation of this function is painfully slow: in 1min only about
90 rows are calculated.

The computation time taken for a given number of rows increases with the size
of the dataset, see the example with my function below:

# small dataset: function performs well

exampledata<-
data.frame(orderID=c(1,1,1,2,2,3,3,3,4),itemPrice=c(10,17,9,12,25,10,1,9,7))

exampledata[1,"orderAmount"]<-exampledata[1,"itemPrice"]

system.time(for (i in 2:length(exampledata[,1]))
{exampledata[i,"orderAmount"]<-
ifelse(exampledata[i,"orderID"]==exampledata[i-1,"orderID"],exampledata[i-
1,"orderAmount"]+exampledata[i,"itemPrice"],exampledata[i,"itemPrice"])})

# large dataset: the very same computational task takes much longer

exampledata2<-
data.frame(orderID=c(1,1,1,2,2,3,3,3,4,5:2000000),itemPrice=c(10,17,9,12,25,1
0,1,9,7,25:2000020))

exampledata2[1,"orderAmount"]<-exampledata2[1,"itemPrice"]

system.time(for (i in 2:9)
{exampledata2[i,"orderAmount"]<-
ifelse(exampledata2[i,"orderID"]==exampledata2[i-
1,"orderID"],exampledata2[i-
1,"orderAmount"]+exampledata2[i,"itemPrice"],exampledata2[i,"itemPrice"])})

Does someone know a way to increase the speed?

Thank you very much!

Caroline

	[[alternative HTML version deleted]]

Hello there,

I?m computing the total value of an order from the price of the  
order items
using a ?for? loop and the ?ifelse? function.
Ouch. Schools really should stop teaching SAS and BASIC as a first  
language.
I do this on a large dataframe
(close to 1m lines). The computation of this function is painfully  
slow: in
1min only about 90 rows are calculated.

The computation time taken for a given number of rows increases with  
the
size of the dataset, see the example with my function below:

# small dataset: function performs well

exampledata<- 
data 
.frame 
(orderID=c(1,1,1,2,2,3,3,3,4),itemPrice=c(10,17,9,12,25,10,1,9,7))

exampledata[1,"orderAmount"]<-exampledata[1,"itemPrice"]

system.time(for (i in 2:length(exampledata[,1]))
{exampledata[i,"orderAmount"]<- 
ifelse 
(exampledata 
[i 
,"orderID 
"]==exampledata[i-1,"orderID"],exampledata[i-1,"orderAmount"] 
+exampledata[i,"itemPrice"],exampledata[i,"itemPrice"])})
Try instead using 'ave' to calculate a cumulative 'sum' within  
"orderID":

exampledata$orderAmt <- with(exampledata,  ave(itemPrice, orderID,  
FUN=cumsum) )

I assure you this will be more reproducible,  faster, and  
understandable.
# large dataset:
"medium" dataset really. Barely nudges the RAM dial on my machine.
the very same computational task takes much longer

exampledata2<- 
data 
.frame 
(orderID 
= 
c 
(1,1,1,2,2,3,3,3,4,5 
:2000000),itemPrice=c(10,17,9,12,25,10,1,9,7,25:2000020))

exampledata2[1,"orderAmount"]<-exampledata2[1,"itemPrice"]

system.time(for (i in 2:9)
{exampledata2[i,"orderAmount"]<- 
ifelse 
(exampledata2 
[i 
,"orderID 
"]==exampledata2[i-1,"orderID"],exampledata2[i-1,"orderAmount"] 
+exampledata2[i,"itemPrice"],exampledata2[i,"itemPrice"])})

> system.time( exampledata2$orderAmt <- with(exampledata2,   
ave(itemPrice, orderID, FUN=cumsum) ) )
    user  system elapsed
  35.106   0.811  35.822

On a three year-old machine. Not as fast as I expected, but not long  
enough to require refilling the coffee cup either.

-- 
David.
Does someone know a way to increase the speed?

David Winsemius, MD
West Hartford, CT

Dear Caroline,

Here is a faster and more elegant solution.

n <- 10000
exampledata <- data.frame(orderID = sample(floor(n / 5), n, replace  
= TRUE), itemPrice = rpois(n, 10))
library(plyr)
system.time({
+ 	ddply(exampledata, .(orderID), function(x){
+ 		data.frame(itemPrice = x$itemPrice, orderAmount = cumsum(x 
$itemPrice))
+ 	})
+ })
  user  system elapsed
  1.67    0.00    1.69
exampledata[1,"orderAmount"]<-exampledata[1,"itemPrice"]
system.time(for (i in 2:length(exampledata[,1]))
+ {exampledata[i,"orderAmount"]<- 
ifelse 
(exampledata 
[i 
,"orderID 
"]==exampledata[i-1,"orderID"],exampledata[i-1,"orderAmount"] 
+exampledata[i,"itemPrice"],exampledata[i,"itemPrice"])})
  user  system elapsed
 11.94    0.02   11.97
I tried running this method on the "large dataset" (2MM row) the OP  
offered, and needed to eventually interrupt it so I could get my  
console back:

 > system.time({
+  	ddply(exampledata2, .(orderID), function(x){
+  		data.frame(itemPrice = x$itemPrice, orderAmount = cumsum(x 
$itemPrice))
+  	})
+  })

Timing stopped at: 808.473 1013.749 1816.125

The same task with ave() took 35 seconds.
david.

>
> Best regards,
>
> Thierry
>> -----Oorspronkelijk bericht-----
>> Van: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org 
>> ]
>> Namens Caroline Faisst
>> Verzonden: woensdag 3 augustus 2011 15:26
>> Aan: r-help at r-project.org
>> Onderwerp: [R] slow computation of functions over large datasets
>>
>> Hello there,
>>
>>
>> I'm computing the total value of an order from the price of the  
>> order items using
>> a "for" loop and the "ifelse" function. I do this on a large  
>> dataframe (close to
>> 1m lines). The computation of this function is painfully slow: in  
>> 1min only about
>> 90 rows are calculated.
>>
>>
>> The computation time taken for a given number of rows increases  
>> with the size
>> of the dataset, see the example with my function below:
>>
>>
>> # small dataset: function performs well
>>
>> exampledata<-
>> data 
>> .frame 
>> (orderID=c(1,1,1,2,2,3,3,3,4),itemPrice=c(10,17,9,12,25,10,1,9,7))
>>
>> exampledata[1,"orderAmount"]<-exampledata[1,"itemPrice"]
>>
>> system.time(for (i in 2:length(exampledata[,1]))
>> {exampledata[i,"orderAmount"]<-
>> ifelse 
>> (exampledata[i,"orderID"]==exampledata[i-1,"orderID"],exampledata[i-
>> 1,"orderAmount"] 
>> +exampledata[i,"itemPrice"],exampledata[i,"itemPrice"])})
>>
>> # large dataset: the very same computational task takes much longer
>>
>> exampledata2<-
>> data 
>> .frame 
>> (orderID=c(1,1,1,2,2,3,3,3,4,5:2000000),itemPrice=c(10,17,9,12,25,1
>> 0,1,9,7,25:2000020))
>>
>> exampledata2[1,"orderAmount"]<-exampledata2[1,"itemPrice"]
>>
>> system.time(for (i in 2:9)
>> {exampledata2[i,"orderAmount"]<-
>> ifelse(exampledata2[i,"orderID"]==exampledata2[i-
>> 1,"orderID"],exampledata2[i-
>> 1,"orderAmount"] 
>> +exampledata2[i,"itemPrice"],exampledata2[i,"itemPrice"])})
>>
>> Does someone know a way to increase the speed?
>>
>>
>> Thank you very much!
>>
>> Caroline

David Winsemius, MD
West Hartford, CT
This takes about 2 secs for 1M rows:
n <- 1000000
exampledata <- data.frame(orderID = sample(floor(n / 5), n, replace = TRUE), itemPrice = rpois(n, 10))
require(data.table)
# convert to data.table
ed.dt <- data.table(exampledata)
system.time(result <- ed.dt[
+                         , list(total = sum(itemPrice))
+                         , by = orderID
+                         ]
+            )
   user  system elapsed
   1.30    0.05    1.34
str(result)
Classes ?data.table? and 'data.frame':  198708 obs. of  2 variables:
 $ orderID: int  1 2 3 4 5 6 8 9 10 11 ...
 $ total  : num  49 37 72 92 50 76 34 22 65 39 ...
head(result)
orderID total
[1,]       1    49
[2,]       2    37
[3,]       3    72
[4,]       4    92
[5,]       5    50
[6,]       6    76

On Wed, Aug 3, 2011 at 9:25 AM, Caroline Faisst
Hello there,

I?m computing the total value of an order from the price of the order items
using a ?for? loop and the ?ifelse? function. I do this on a large dataframe
(close to 1m lines). The computation of this function is painfully slow: in
1min only about 90 rows are calculated.

The computation time taken for a given number of rows increases with the
size of the dataset, see the example with my function below:

# small dataset: function performs well

exampledata<-data.frame(orderID=c(1,1,1,2,2,3,3,3,4),itemPrice=c(10,17,9,12,25,10,1,9,7))

exampledata[1,"orderAmount"]<-exampledata[1,"itemPrice"]

system.time(for (i in 2:length(exampledata[,1]))
{exampledata[i,"orderAmount"]<-ifelse(exampledata[i,"orderID"]==exampledata[i-1,"orderID"],exampledata[i-1,"orderAmount"]+exampledata[i,"itemPrice"],exampledata[i,"itemPrice"])})

# large dataset: the very same computational task takes much longer

exampledata2<-data.frame(orderID=c(1,1,1,2,2,3,3,3,4,5:2000000),itemPrice=c(10,17,9,12,25,10,1,9,7,25:2000020))

exampledata2[1,"orderAmount"]<-exampledata2[1,"itemPrice"]

system.time(for (i in 2:9)
{exampledata2[i,"orderAmount"]<-ifelse(exampledata2[i,"orderID"]==exampledata2[i-1,"orderID"],exampledata2[i-1,"orderAmount"]+exampledata2[i,"itemPrice"],exampledata2[i,"itemPrice"])})

Does someone know a way to increase the speed?

Thank you very much!

Caroline

? ? ? ?[[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?

This takes about 2 secs for 1M rows:

n <- 1000000
exampledata <- data.frame(orderID = sample(floor(n / 5), n, replace  
= TRUE), itemPrice = rpois(n, 10))
require(data.table)
# convert to data.table
ed.dt <- data.table(exampledata)
system.time(result <- ed.dt[
+                         , list(total = sum(itemPrice))
+                         , by = orderID
+                         ]
+            )
  user  system elapsed
  1.30    0.05    1.34
Interesting. Impressive. And I noted that the OP wanted what cumsum  
would provide and for some reason creating that longer result is even  
faster on my machine than the shorter result using sum.
David.
>>
>> str(result)
> Classes ?data.table? and 'data.frame':  198708 obs. of  2 variables:
> $ orderID: int  1 2 3 4 5 6 8 9 10 11 ...
> $ total  : num  49 37 72 92 50 76 34 22 65 39 ...
>> head(result)
>     orderID total
> [1,]       1    49
> [2,]       2    37
> [3,]       3    72
> [4,]       4    92
> [5,]       5    50
> [6,]       6    76
>>
>
>
> On Wed, Aug 3, 2011 at 9:25 AM, Caroline Faisst
> <caroline.faisst at gmail.com> wrote:
>> Hello there,
>>
>>
>> I?m computing the total value of an order from the price of the  
>> order items
>> using a ?for? loop and the ?ifelse? function. I do this on a large  
>> dataframe
>> (close to 1m lines). The computation of this function is painfully  
>> slow: in
>> 1min only about 90 rows are calculated.
>>
>>
>> The computation time taken for a given number of rows increases  
>> with the
>> size of the dataset, see the example with my function below:
>>
>>
>> # small dataset: function performs well
>>
>> exampledata<- 
>> data 
>> .frame 
>> (orderID=c(1,1,1,2,2,3,3,3,4),itemPrice=c(10,17,9,12,25,10,1,9,7))
>>
>> exampledata[1,"orderAmount"]<-exampledata[1,"itemPrice"]
>>
>> system.time(for (i in 2:length(exampledata[,1]))
>> {exampledata[i,"orderAmount"]<- 
>> ifelse 
>> (exampledata 
>> [i 
>> ,"orderID 
>> "]==exampledata[i-1,"orderID"],exampledata[i-1,"orderAmount"] 
>> +exampledata[i,"itemPrice"],exampledata[i,"itemPrice"])})
>>
>>
>> # large dataset: the very same computational task takes much longer
>>
>> exampledata2<- 
>> data 
>> .frame 
>> (orderID 
>> = 
>> c 
>> (1,1,1,2,2,3,3,3,4,5 
>> :2000000),itemPrice=c(10,17,9,12,25,10,1,9,7,25:2000020))
>>
>> exampledata2[1,"orderAmount"]<-exampledata2[1,"itemPrice"]
>>
>> system.time(for (i in 2:9)
>> {exampledata2[i,"orderAmount"]<- 
>> ifelse 
>> (exampledata2 
>> [i 
>> ,"orderID 
>> "]==exampledata2[i-1,"orderID"],exampledata2[i-1,"orderAmount"] 
>> +exampledata2[i,"itemPrice"],exampledata2[i,"itemPrice"])})
>>
>>
>>
>> Does someone know a way to increase the speed?
>>
>>
>> Thank you very much!
>>
>> Caroline
>>
>>        [[alternative HTML version deleted]]
>>
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>>
>
>
>
> -- 
> Jim Holtman
> Data Munger Guru
>
> What is the problem that you are trying to solve?
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

David Winsemius, MD
West Hartford, CT
Hello, 
  Perhaps transpose the table attach(as.data.frame(t(data))) and use ColSums() function with order id as header.
             -Ken Hutchison

On Aug 3, 2011, at 12:20 PM, jim holtman wrote:

This takes about 2 secs for 1M rows:

n <- 1000000
exampledata <- data.frame(orderID = sample(floor(n / 5), n, replace = TRUE), itemPrice = rpois(n, 10))
require(data.table)
# convert to data.table
ed.dt <- data.table(exampledata)
system.time(result <- ed.dt[
+                         , list(total = sum(itemPrice))
+                         , by = orderID
+                         ]
+            )
 user  system elapsed
 1.30    0.05    1.34
Interesting. Impressive. And I noted that the OP wanted what cumsum would provide and for some reason creating that longer result is even faster on my machine than the shorter result using sum.

-- 
David.
str(result)
Classes ?data.table? and 'data.frame':  198708 obs. of  2 variables:
$ orderID: int  1 2 3 4 5 6 8 9 10 11 ...
$ total  : num  49 37 72 92 50 76 34 22 65 39 ...
head(result)
   orderID total
[1,]       1    49
[2,]       2    37
[3,]       3    72
[4,]       4    92
[5,]       5    50
[6,]       6    76

On Wed, Aug 3, 2011 at 9:25 AM, Caroline Faisst
<caroline.faisst at gmail.com> wrote:
Hello there,

I?m computing the total value of an order from the price of the order items
using a ?for? loop and the ?ifelse? function. I do this on a large dataframe
(close to 1m lines). The computation of this function is painfully slow: in
1min only about 90 rows are calculated.

The computation time taken for a given number of rows increases with the
size of the dataset, see the example with my function below:

# small dataset: function performs well

exampledata<-data.frame(orderID=c(1,1,1,2,2,3,3,3,4),itemPrice=c(10,17,9,12,25,10,1,9,7))

exampledata[1,"orderAmount"]<-exampledata[1,"itemPrice"]

system.time(for (i in 2:length(exampledata[,1]))
{exampledata[i,"orderAmount"]<-ifelse(exampledata[i,"orderID"]==exampledata[i-1,"orderID"],exampledata[i-1,"orderAmount"]+exampledata[i,"itemPrice"],exampledata[i,"itemPrice"])})

# large dataset: the very same computational task takes much longer

exampledata2<-data.frame(orderID=c(1,1,1,2,2,3,3,3,4,5:2000000),itemPrice=c(10,17,9,12,25,10,1,9,7,25:2000020))

exampledata2[1,"orderAmount"]<-exampledata2[1,"itemPrice"]

system.time(for (i in 2:9)
{exampledata2[i,"orderAmount"]<-ifelse(exampledata2[i,"orderID"]==exampledata2[i-1,"orderID"],exampledata2[i-1,"orderAmount"]+exampledata2[i,"itemPrice"],exampledata2[i,"itemPrice"])})

Does someone know a way to increase the speed?

Thank you very much!

Caroline

      [[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

-- 
Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
David Winsemius, MD
West Hartford, CT

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Hello,
 Perhaps transpose the table attach(as.data.frame(t(data))) and use  
ColSums() function with order id as header.
            -Ken Hutchison
Got any code? The OP offered a reproducible example, after all.
David.
>
> On Aug 3, 2554 BE, at 1:12 PM, David Winsemius  
> <dwinsemius at comcast.net> wrote:
>
>>
>> On Aug 3, 2011, at 12:20 PM, jim holtman wrote:
>>
>>> This takes about 2 secs for 1M rows:
>>>
>>>> n <- 1000000
>>>> exampledata <- data.frame(orderID = sample(floor(n / 5), n,  
>>>> replace = TRUE), itemPrice = rpois(n, 10))
>>>> require(data.table)
>>>> # convert to data.table
>>>> ed.dt <- data.table(exampledata)
>>>> system.time(result <- ed.dt[
>>> +                         , list(total = sum(itemPrice))
>>> +                         , by = orderID
>>> +                         ]
>>> +            )
>>> user  system elapsed
>>> 1.30    0.05    1.34
>>
>> Interesting. Impressive. And I noted that the OP wanted what cumsum  
>> would provide and for some reason creating that longer result is  
>> even faster on my machine than the shorter result using sum.
>>
>> -- 
>> David.
>>>>
>>>> str(result)
>>> Classes ?data.table? and 'data.frame':  198708 obs. of  2 variables:
>>> $ orderID: int  1 2 3 4 5 6 8 9 10 11 ...
>>> $ total  : num  49 37 72 92 50 76 34 22 65 39 ...
>>>> head(result)
>>>   orderID total
>>> [1,]       1    49
>>> [2,]       2    37
>>> [3,]       3    72
>>> [4,]       4    92
>>> [5,]       5    50
>>> [6,]       6    76
>>>>
>>>
>>>
>>> On Wed, Aug 3, 2011 at 9:25 AM, Caroline Faisst
>>> <caroline.faisst at gmail.com> wrote:
>>>> Hello there,
>>>>
>>>>
>>>> I?m computing the total value of an order from the price of the  
>>>> order items
>>>> using a ?for? loop and the ?ifelse? function. I do this on a  
>>>> large dataframe
>>>> (close to 1m lines). The computation of this function is  
>>>> painfully slow: in
>>>> 1min only about 90 rows are calculated.
>>>>
>>>>
>>>> The computation time taken for a given number of rows increases  
>>>> with the
>>>> size of the dataset, see the example with my function below:
>>>>
>>>>
>>>> # small dataset: function performs well
>>>>
>>>> exampledata<- 
>>>> data 
>>>> .frame 
>>>> (orderID=c(1,1,1,2,2,3,3,3,4),itemPrice=c(10,17,9,12,25,10,1,9,7))
>>>>
>>>> exampledata[1,"orderAmount"]<-exampledata[1,"itemPrice"]
>>>>
>>>> system.time(for (i in 2:length(exampledata[,1]))
>>>> {exampledata[i,"orderAmount"]<- 
>>>> ifelse 
>>>> (exampledata 
>>>> [i 
>>>> ,"orderID 
>>>> "]==exampledata[i-1,"orderID"],exampledata[i-1,"orderAmount"] 
>>>> +exampledata[i,"itemPrice"],exampledata[i,"itemPrice"])})
>>>>
>>>>
>>>> # large dataset: the very same computational task takes much longer
>>>>
>>>> exampledata2<- 
>>>> data 
>>>> .frame 
>>>> (orderID 
>>>> = 
>>>> c 
>>>> (1,1,1,2,2,3,3,3,4,5 
>>>> :2000000),itemPrice=c(10,17,9,12,25,10,1,9,7,25:2000020))
>>>>
>>>> exampledata2[1,"orderAmount"]<-exampledata2[1,"itemPrice"]
>>>>
>>>> system.time(for (i in 2:9)
>>>> {exampledata2[i,"orderAmount"]<- 
>>>> ifelse 
>>>> (exampledata2 
>>>> [i 
>>>> ,"orderID 
>>>> "]==exampledata2[i-1,"orderID"],exampledata2[i-1,"orderAmount"] 
>>>> +exampledata2[i,"itemPrice"],exampledata2[i,"itemPrice"])})
>>>>
>>>>
>>>>
>>>> Does someone know a way to increase the speed?
>>>>
>>>>
>>>> Thank you very much!
>>>>
>>>> Caroline
>>>>
>>>>      [[alternative HTML version deleted]]
>>>>
>>>>
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>
>>>>
>>>
>>>
>>>
>>> -- 
>>> Jim Holtman
>>> Data Munger Guru
>>>
>>> What is the problem that you are trying to solve?
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>
>> David Winsemius, MD
>> West Hartford, CT
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.

David Winsemius, MD
West Hartford, CT
Sorry about the lack of code, but using Davids example, would:
tapply(itemPrice, INDEX=orderID, FUN=sum)
work?
  -Ken Hutchison

On Aug 3, 2011, at 2:01 PM, Ken wrote:

Hello,
Perhaps transpose the table attach(as.data.frame(t(data))) and use ColSums() function with order id as header.
           -Ken Hutchison
Got any code? The OP offered a reproducible example, after all.

-- 
David.
On Aug 3, 2554 BE, at 1:12 PM, David Winsemius <dwinsemius at comcast.net> wrote:

On Aug 3, 2011, at 12:20 PM, jim holtman wrote:

This takes about 2 secs for 1M rows:

n <- 1000000
exampledata <- data.frame(orderID = sample(floor(n / 5), n, replace = TRUE), itemPrice = rpois(n, 10))
require(data.table)
# convert to data.table
ed.dt <- data.table(exampledata)
system.time(result <- ed.dt[
+                         , list(total = sum(itemPrice))
+                         , by = orderID
+                         ]
+            )
user  system elapsed
1.30    0.05    1.34
Interesting. Impressive. And I noted that the OP wanted what cumsum would provide and for some reason creating that longer result is even faster on my machine than the shorter result using sum.

-- 
David.
str(result)
Classes ?data.table? and 'data.frame':  198708 obs. of  2 variables:
$ orderID: int  1 2 3 4 5 6 8 9 10 11 ...
$ total  : num  49 37 72 92 50 76 34 22 65 39 ...
head(result)
 orderID total
[1,]       1    49
[2,]       2    37
[3,]       3    72
[4,]       4    92
[5,]       5    50
[6,]       6    76

On Wed, Aug 3, 2011 at 9:25 AM, Caroline Faisst
<caroline.faisst at gmail.com> wrote:
Hello there,

I?m computing the total value of an order from the price of the order items
using a ?for? loop and the ?ifelse? function. I do this on a large dataframe
(close to 1m lines). The computation of this function is painfully slow: in
1min only about 90 rows are calculated.

The computation time taken for a given number of rows increases with the
size of the dataset, see the example with my function below:

# small dataset: function performs well

exampledata<-data.frame(orderID=c(1,1,1,2,2,3,3,3,4),itemPrice=c(10,17,9,12,25,10,1,9,7))

exampledata[1,"orderAmount"]<-exampledata[1,"itemPrice"]

system.time(for (i in 2:length(exampledata[,1]))
{exampledata[i,"orderAmount"]<-ifelse(exampledata[i,"orderID"]==exampledata[i-1,"orderID"],exampledata[i-1,"orderAmount"]+exampledata[i,"itemPrice"],exampledata[i,"itemPrice"])})

# large dataset: the very same computational task takes much longer

exampledata2<-data.frame(orderID=c(1,1,1,2,2,3,3,3,4,5:2000000),itemPrice=c(10,17,9,12,25,10,1,9,7,25:2000020))

exampledata2[1,"orderAmount"]<-exampledata2[1,"itemPrice"]

system.time(for (i in 2:9)
{exampledata2[i,"orderAmount"]<-ifelse(exampledata2[i,"orderID"]==exampledata2[i-1,"orderID"],exampledata2[i-1,"orderAmount"]+exampledata2[i,"itemPrice"],exampledata2[i,"itemPrice"])})

Does someone know a way to increase the speed?

Thank you very much!

Caroline

    [[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

-- 
Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
David Winsemius, MD
West Hartford, CT

Sorry about the lack of code, but using Davids example, would:
tapply(itemPrice, INDEX=orderID, FUN=sum)
work?
Doesn't do the cumulative sums or the assignment into column of the  
same data.frame. That's why I used ave, because it keeps the sequence  
correct.
David.
>  -Ken Hutchison
>
> On Aug 3, 2554 BE, at 2:09 PM, David Winsemius  
> <dwinsemius at comcast.net> wrote:
>
>>
>> On Aug 3, 2011, at 2:01 PM, Ken wrote:
>>
>>> Hello,
>>> Perhaps transpose the table attach(as.data.frame(t(data))) and use  
>>> ColSums() function with order id as header.
>>>           -Ken Hutchison
>>
>> Got any code? The OP offered a reproducible example, after all.
>>
>> -- 
>> David.
>>>
>>> On Aug 3, 2554 BE, at 1:12 PM, David Winsemius <dwinsemius at comcast.net 
>>> > wrote:
>>>
>>>>
>>>> On Aug 3, 2011, at 12:20 PM, jim holtman wrote:
>>>>
>>>>> This takes about 2 secs for 1M rows:
>>>>>
>>>>>> n <- 1000000
>>>>>> exampledata <- data.frame(orderID = sample(floor(n / 5), n,  
>>>>>> replace = TRUE), itemPrice = rpois(n, 10))
>>>>>> require(data.table)
>>>>>> # convert to data.table
>>>>>> ed.dt <- data.table(exampledata)
>>>>>> system.time(result <- ed.dt[
>>>>> +                         , list(total = sum(itemPrice))
>>>>> +                         , by = orderID
>>>>> +                         ]
>>>>> +            )
>>>>> user  system elapsed
>>>>> 1.30    0.05    1.34
>>>>
>>>> Interesting. Impressive. And I noted that the OP wanted what  
>>>> cumsum would provide and for some reason creating that longer  
>>>> result is even faster on my machine than the shorter result using  
>>>> sum.
>>>>
>>>> -- 
>>>> David.
>>>>>>
>>>>>> str(result)
>>>>> Classes ?data.table? and 'data.frame':  198708 obs. of  2  
>>>>> variables:
>>>>> $ orderID: int  1 2 3 4 5 6 8 9 10 11 ...
>>>>> $ total  : num  49 37 72 92 50 76 34 22 65 39 ...
>>>>>> head(result)
>>>>> orderID total
>>>>> [1,]       1    49
>>>>> [2,]       2    37
>>>>> [3,]       3    72
>>>>> [4,]       4    92
>>>>> [5,]       5    50
>>>>> [6,]       6    76
>>>>>>
>>>>>
>>>>>
>>>>> On Wed, Aug 3, 2011 at 9:25 AM, Caroline Faisst
>>>>> <caroline.faisst at gmail.com> wrote:
>>>>>> Hello there,
>>>>>>
>>>>>>
>>>>>> I?m computing the total value of an order from the price of the  
>>>>>> order items
>>>>>> using a ?for? loop and the ?ifelse? function. I do this on a  
>>>>>> large dataframe
>>>>>> (close to 1m lines). The computation of this function is  
>>>>>> painfully slow: in
>>>>>> 1min only about 90 rows are calculated.
>>>>>>
>>>>>>
>>>>>> The computation time taken for a given number of rows increases  
>>>>>> with the
>>>>>> size of the dataset, see the example with my function below:
>>>>>>
>>>>>>
>>>>>> # small dataset: function performs well
>>>>>>
>>>>>> exampledata<- 
>>>>>> data 
>>>>>> .frame 
>>>>>> (orderID 
>>>>>> =c(1,1,1,2,2,3,3,3,4),itemPrice=c(10,17,9,12,25,10,1,9,7))
>>>>>>
>>>>>> exampledata[1,"orderAmount"]<-exampledata[1,"itemPrice"]
>>>>>>
>>>>>> system.time(for (i in 2:length(exampledata[,1]))
>>>>>> {exampledata[i,"orderAmount"]<- 
>>>>>> ifelse 
>>>>>> (exampledata 
>>>>>> [i 
>>>>>> ,"orderID 
>>>>>> "]==exampledata[i-1,"orderID"],exampledata[i-1,"orderAmount"] 
>>>>>> +exampledata[i,"itemPrice"],exampledata[i,"itemPrice"])})
>>>>>>
>>>>>>
>>>>>> # large dataset: the very same computational task takes much  
>>>>>> longer
>>>>>>
>>>>>> exampledata2<- 
>>>>>> data 
>>>>>> .frame 
>>>>>> (orderID 
>>>>>> = 
>>>>>> c 
>>>>>> (1,1,1,2,2,3,3,3,4,5 
>>>>>> :2000000),itemPrice=c(10,17,9,12,25,10,1,9,7,25:2000020))
>>>>>>
>>>>>> exampledata2[1,"orderAmount"]<-exampledata2[1,"itemPrice"]
>>>>>>
>>>>>> system.time(for (i in 2:9)
>>>>>> {exampledata2[i,"orderAmount"]<- 
>>>>>> ifelse 
>>>>>> (exampledata2 
>>>>>> [i 
>>>>>> ,"orderID 
>>>>>> "]==exampledata2[i-1,"orderID"],exampledata2[i-1,"orderAmount"] 
>>>>>> +exampledata2[i,"itemPrice"],exampledata2[i,"itemPrice"])})
>>>>>>
>>>>>>
>>>>>>
>>>>>> Does someone know a way to increase the speed?
>>>>>>
>>>>>>
>>>>>> Thank you very much!
>>>>>>
>>>>>> Caroline
>>>>>>
>>>>>>    [[alternative HTML version deleted]]
>>>>>>
>>>>>>
>>>>>> ______________________________________________
>>>>>> R-help at r-project.org mailing list
>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>>>>> and provide commented, minimal, self-contained, reproducible  
>>>>>> code.
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> -- 
>>>>> Jim Holtman
>>>>> Data Munger Guru
>>>>>
>>>>> What is the problem that you are trying to solve?
>>>>>
>>>>> ______________________________________________
>>>>> R-help at r-project.org mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>
>>>> David Winsemius, MD
>>>> West Hartford, CT
>>>>
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>
>> David Winsemius, MD
>> West Hartford, CT
>>

David Winsemius, MD
West Hartford, CT
An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20110804/1eae6dc5/attachment.pl>