An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20110803/d8b43376/attachment.pl>
slow computation of functions over large datasets
11 messages · Caroline Faisst, ONKELINX, Thierry, jim holtman +3 more
Dear Caroline, Here is a faster and more elegant solution.
n <- 10000
exampledata <- data.frame(orderID = sample(floor(n / 5), n, replace = TRUE), itemPrice = rpois(n, 10))
library(plyr)
system.time({
+ ddply(exampledata, .(orderID), function(x){
+ data.frame(itemPrice = x$itemPrice, orderAmount = cumsum(x$itemPrice))
+ })
+ })
user system elapsed
1.67 0.00 1.69
exampledata[1,"orderAmount"]<-exampledata[1,"itemPrice"] system.time(for (i in 2:length(exampledata[,1]))
+ {exampledata[i,"orderAmount"]<-ifelse(exampledata[i,"orderID"]==exampledata[i-1,"orderID"],exampledata[i-1,"orderAmount"]+exampledata[i,"itemPrice"],exampledata[i,"itemPrice"])})
user system elapsed
11.94 0.02 11.97
Best regards,
Thierry
-----Oorspronkelijk bericht-----
Van: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org]
Namens Caroline Faisst
Verzonden: woensdag 3 augustus 2011 15:26
Aan: r-help at r-project.org
Onderwerp: [R] slow computation of functions over large datasets
Hello there,
I'm computing the total value of an order from the price of the order items using
a "for" loop and the "ifelse" function. I do this on a large dataframe (close to
1m lines). The computation of this function is painfully slow: in 1min only about
90 rows are calculated.
The computation time taken for a given number of rows increases with the size
of the dataset, see the example with my function below:
# small dataset: function performs well
exampledata<-
data.frame(orderID=c(1,1,1,2,2,3,3,3,4),itemPrice=c(10,17,9,12,25,10,1,9,7))
exampledata[1,"orderAmount"]<-exampledata[1,"itemPrice"]
system.time(for (i in 2:length(exampledata[,1]))
{exampledata[i,"orderAmount"]<-
ifelse(exampledata[i,"orderID"]==exampledata[i-1,"orderID"],exampledata[i-
1,"orderAmount"]+exampledata[i,"itemPrice"],exampledata[i,"itemPrice"])})
# large dataset: the very same computational task takes much longer
exampledata2<-
data.frame(orderID=c(1,1,1,2,2,3,3,3,4,5:2000000),itemPrice=c(10,17,9,12,25,1
0,1,9,7,25:2000020))
exampledata2[1,"orderAmount"]<-exampledata2[1,"itemPrice"]
system.time(for (i in 2:9)
{exampledata2[i,"orderAmount"]<-
ifelse(exampledata2[i,"orderID"]==exampledata2[i-
1,"orderID"],exampledata2[i-
1,"orderAmount"]+exampledata2[i,"itemPrice"],exampledata2[i,"itemPrice"])})
Does someone know a way to increase the speed?
Thank you very much!
Caroline
[[alternative HTML version deleted]]
On Aug 3, 2011, at 9:25 AM, Caroline Faisst wrote:
Hello there, I?m computing the total value of an order from the price of the order items using a ?for? loop and the ?ifelse? function.
Ouch. Schools really should stop teaching SAS and BASIC as a first language.
I do this on a large dataframe
(close to 1m lines). The computation of this function is painfully
slow: in
1min only about 90 rows are calculated.
The computation time taken for a given number of rows increases with
the
size of the dataset, see the example with my function below:
# small dataset: function performs well
exampledata<-
data
.frame
(orderID=c(1,1,1,2,2,3,3,3,4),itemPrice=c(10,17,9,12,25,10,1,9,7))
exampledata[1,"orderAmount"]<-exampledata[1,"itemPrice"]
system.time(for (i in 2:length(exampledata[,1]))
{exampledata[i,"orderAmount"]<-
ifelse
(exampledata
[i
,"orderID
"]==exampledata[i-1,"orderID"],exampledata[i-1,"orderAmount"]
+exampledata[i,"itemPrice"],exampledata[i,"itemPrice"])})
Try instead using 'ave' to calculate a cumulative 'sum' within "orderID": exampledata$orderAmt <- with(exampledata, ave(itemPrice, orderID, FUN=cumsum) ) I assure you this will be more reproducible, faster, and understandable.
# large dataset:
"medium" dataset really. Barely nudges the RAM dial on my machine.
the very same computational task takes much longer
exampledata2<-
data
.frame
(orderID
=
c
(1,1,1,2,2,3,3,3,4,5
:2000000),itemPrice=c(10,17,9,12,25,10,1,9,7,25:2000020))
exampledata2[1,"orderAmount"]<-exampledata2[1,"itemPrice"]
system.time(for (i in 2:9)
{exampledata2[i,"orderAmount"]<-
ifelse
(exampledata2
[i
,"orderID
"]==exampledata2[i-1,"orderID"],exampledata2[i-1,"orderAmount"]
+exampledata2[i,"itemPrice"],exampledata2[i,"itemPrice"])})
> system.time( exampledata2$orderAmt <- with(exampledata2,
ave(itemPrice, orderID, FUN=cumsum) ) )
user system elapsed
35.106 0.811 35.822
On a three year-old machine. Not as fast as I expected, but not long
enough to require refilling the coffee cup either.
--
David.
Does someone know a way to increase the speed?
David Winsemius, MD West Hartford, CT
On Aug 3, 2011, at 9:59 AM, ONKELINX, Thierry wrote:
Dear Caroline, Here is a faster and more elegant solution.
n <- 10000
exampledata <- data.frame(orderID = sample(floor(n / 5), n, replace
= TRUE), itemPrice = rpois(n, 10))
library(plyr)
system.time({
+ ddply(exampledata, .(orderID), function(x){
+ data.frame(itemPrice = x$itemPrice, orderAmount = cumsum(x
$itemPrice))
+ })
+ })
user system elapsed
1.67 0.00 1.69
exampledata[1,"orderAmount"]<-exampledata[1,"itemPrice"] system.time(for (i in 2:length(exampledata[,1]))
+ {exampledata[i,"orderAmount"]<-
ifelse
(exampledata
[i
,"orderID
"]==exampledata[i-1,"orderID"],exampledata[i-1,"orderAmount"]
+exampledata[i,"itemPrice"],exampledata[i,"itemPrice"])})
user system elapsed
11.94 0.02 11.97
I tried running this method on the "large dataset" (2MM row) the OP
offered, and needed to eventually interrupt it so I could get my
console back:
> system.time({
+ ddply(exampledata2, .(orderID), function(x){
+ data.frame(itemPrice = x$itemPrice, orderAmount = cumsum(x
$itemPrice))
+ })
+ })
Timing stopped at: 808.473 1013.749 1816.125
The same task with ave() took 35 seconds.
david.
>
> Best regards,
>
> Thierry
>> -----Oorspronkelijk bericht-----
>> Van: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org
>> ]
>> Namens Caroline Faisst
>> Verzonden: woensdag 3 augustus 2011 15:26
>> Aan: r-help at r-project.org
>> Onderwerp: [R] slow computation of functions over large datasets
>>
>> Hello there,
>>
>>
>> I'm computing the total value of an order from the price of the
>> order items using
>> a "for" loop and the "ifelse" function. I do this on a large
>> dataframe (close to
>> 1m lines). The computation of this function is painfully slow: in
>> 1min only about
>> 90 rows are calculated.
>>
>>
>> The computation time taken for a given number of rows increases
>> with the size
>> of the dataset, see the example with my function below:
>>
>>
>> # small dataset: function performs well
>>
>> exampledata<-
>> data
>> .frame
>> (orderID=c(1,1,1,2,2,3,3,3,4),itemPrice=c(10,17,9,12,25,10,1,9,7))
>>
>> exampledata[1,"orderAmount"]<-exampledata[1,"itemPrice"]
>>
>> system.time(for (i in 2:length(exampledata[,1]))
>> {exampledata[i,"orderAmount"]<-
>> ifelse
>> (exampledata[i,"orderID"]==exampledata[i-1,"orderID"],exampledata[i-
>> 1,"orderAmount"]
>> +exampledata[i,"itemPrice"],exampledata[i,"itemPrice"])})
>>
>> # large dataset: the very same computational task takes much longer
>>
>> exampledata2<-
>> data
>> .frame
>> (orderID=c(1,1,1,2,2,3,3,3,4,5:2000000),itemPrice=c(10,17,9,12,25,1
>> 0,1,9,7,25:2000020))
>>
>> exampledata2[1,"orderAmount"]<-exampledata2[1,"itemPrice"]
>>
>> system.time(for (i in 2:9)
>> {exampledata2[i,"orderAmount"]<-
>> ifelse(exampledata2[i,"orderID"]==exampledata2[i-
>> 1,"orderID"],exampledata2[i-
>> 1,"orderAmount"]
>> +exampledata2[i,"itemPrice"],exampledata2[i,"itemPrice"])})
>>
>> Does someone know a way to increase the speed?
>>
>>
>> Thank you very much!
>>
>> Caroline
David Winsemius, MD
West Hartford, CT
This takes about 2 secs for 1M rows:
n <- 1000000 exampledata <- data.frame(orderID = sample(floor(n / 5), n, replace = TRUE), itemPrice = rpois(n, 10)) require(data.table) # convert to data.table ed.dt <- data.table(exampledata) system.time(result <- ed.dt[
+ , list(total = sum(itemPrice)) + , by = orderID + ] + ) user system elapsed 1.30 0.05 1.34
str(result)
Classes ?data.table? and 'data.frame': 198708 obs. of 2 variables: $ orderID: int 1 2 3 4 5 6 8 9 10 11 ... $ total : num 49 37 72 92 50 76 34 22 65 39 ...
head(result)
orderID total [1,] 1 49 [2,] 2 37 [3,] 3 72 [4,] 4 92 [5,] 5 50 [6,] 6 76
On Wed, Aug 3, 2011 at 9:25 AM, Caroline Faisst
<caroline.faisst at gmail.com> wrote:
Hello there,
I?m computing the total value of an order from the price of the order items
using a ?for? loop and the ?ifelse? function. I do this on a large dataframe
(close to 1m lines). The computation of this function is painfully slow: in
1min only about 90 rows are calculated.
The computation time taken for a given number of rows increases with the
size of the dataset, see the example with my function below:
# small dataset: function performs well
exampledata<-data.frame(orderID=c(1,1,1,2,2,3,3,3,4),itemPrice=c(10,17,9,12,25,10,1,9,7))
exampledata[1,"orderAmount"]<-exampledata[1,"itemPrice"]
system.time(for (i in 2:length(exampledata[,1]))
{exampledata[i,"orderAmount"]<-ifelse(exampledata[i,"orderID"]==exampledata[i-1,"orderID"],exampledata[i-1,"orderAmount"]+exampledata[i,"itemPrice"],exampledata[i,"itemPrice"])})
# large dataset: the very same computational task takes much longer
exampledata2<-data.frame(orderID=c(1,1,1,2,2,3,3,3,4,5:2000000),itemPrice=c(10,17,9,12,25,10,1,9,7,25:2000020))
exampledata2[1,"orderAmount"]<-exampledata2[1,"itemPrice"]
system.time(for (i in 2:9)
{exampledata2[i,"orderAmount"]<-ifelse(exampledata2[i,"orderID"]==exampledata2[i-1,"orderID"],exampledata2[i-1,"orderAmount"]+exampledata2[i,"itemPrice"],exampledata2[i,"itemPrice"])})
Does someone know a way to increase the speed?
Thank you very much!
Caroline
? ? ? ?[[alternative HTML version deleted]]
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Jim Holtman Data Munger Guru What is the problem that you are trying to solve?
On Aug 3, 2011, at 12:20 PM, jim holtman wrote:
This takes about 2 secs for 1M rows:
n <- 1000000 exampledata <- data.frame(orderID = sample(floor(n / 5), n, replace = TRUE), itemPrice = rpois(n, 10)) require(data.table) # convert to data.table ed.dt <- data.table(exampledata) system.time(result <- ed.dt[
+ , list(total = sum(itemPrice)) + , by = orderID + ] + ) user system elapsed 1.30 0.05 1.34
Interesting. Impressive. And I noted that the OP wanted what cumsum would provide and for some reason creating that longer result is even faster on my machine than the shorter result using sum.
David.
>>
>> str(result)
> Classes ?data.table? and 'data.frame': 198708 obs. of 2 variables:
> $ orderID: int 1 2 3 4 5 6 8 9 10 11 ...
> $ total : num 49 37 72 92 50 76 34 22 65 39 ...
>> head(result)
> orderID total
> [1,] 1 49
> [2,] 2 37
> [3,] 3 72
> [4,] 4 92
> [5,] 5 50
> [6,] 6 76
>>
>
>
> On Wed, Aug 3, 2011 at 9:25 AM, Caroline Faisst
> <caroline.faisst at gmail.com> wrote:
>> Hello there,
>>
>>
>> I?m computing the total value of an order from the price of the
>> order items
>> using a ?for? loop and the ?ifelse? function. I do this on a large
>> dataframe
>> (close to 1m lines). The computation of this function is painfully
>> slow: in
>> 1min only about 90 rows are calculated.
>>
>>
>> The computation time taken for a given number of rows increases
>> with the
>> size of the dataset, see the example with my function below:
>>
>>
>> # small dataset: function performs well
>>
>> exampledata<-
>> data
>> .frame
>> (orderID=c(1,1,1,2,2,3,3,3,4),itemPrice=c(10,17,9,12,25,10,1,9,7))
>>
>> exampledata[1,"orderAmount"]<-exampledata[1,"itemPrice"]
>>
>> system.time(for (i in 2:length(exampledata[,1]))
>> {exampledata[i,"orderAmount"]<-
>> ifelse
>> (exampledata
>> [i
>> ,"orderID
>> "]==exampledata[i-1,"orderID"],exampledata[i-1,"orderAmount"]
>> +exampledata[i,"itemPrice"],exampledata[i,"itemPrice"])})
>>
>>
>> # large dataset: the very same computational task takes much longer
>>
>> exampledata2<-
>> data
>> .frame
>> (orderID
>> =
>> c
>> (1,1,1,2,2,3,3,3,4,5
>> :2000000),itemPrice=c(10,17,9,12,25,10,1,9,7,25:2000020))
>>
>> exampledata2[1,"orderAmount"]<-exampledata2[1,"itemPrice"]
>>
>> system.time(for (i in 2:9)
>> {exampledata2[i,"orderAmount"]<-
>> ifelse
>> (exampledata2
>> [i
>> ,"orderID
>> "]==exampledata2[i-1,"orderID"],exampledata2[i-1,"orderAmount"]
>> +exampledata2[i,"itemPrice"],exampledata2[i,"itemPrice"])})
>>
>>
>>
>> Does someone know a way to increase the speed?
>>
>>
>> Thank you very much!
>>
>> Caroline
>>
>> [[alternative HTML version deleted]]
>>
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>>
>
>
>
> --
> Jim Holtman
> Data Munger Guru
>
> What is the problem that you are trying to solve?
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
David Winsemius, MD
West Hartford, CT
Hello,
Perhaps transpose the table attach(as.data.frame(t(data))) and use ColSums() function with order id as header.
-Ken Hutchison
On Aug 3, 2554 BE, at 1:12 PM, David Winsemius <dwinsemius at comcast.net> wrote:
On Aug 3, 2011, at 12:20 PM, jim holtman wrote:
This takes about 2 secs for 1M rows:
n <- 1000000 exampledata <- data.frame(orderID = sample(floor(n / 5), n, replace = TRUE), itemPrice = rpois(n, 10)) require(data.table) # convert to data.table ed.dt <- data.table(exampledata) system.time(result <- ed.dt[
+ , list(total = sum(itemPrice)) + , by = orderID + ] + ) user system elapsed 1.30 0.05 1.34
Interesting. Impressive. And I noted that the OP wanted what cumsum would provide and for some reason creating that longer result is even faster on my machine than the shorter result using sum. -- David.
str(result)
Classes ?data.table? and 'data.frame': 198708 obs. of 2 variables: $ orderID: int 1 2 3 4 5 6 8 9 10 11 ... $ total : num 49 37 72 92 50 76 34 22 65 39 ...
head(result)
orderID total [1,] 1 49 [2,] 2 37 [3,] 3 72 [4,] 4 92 [5,] 5 50 [6,] 6 76
On Wed, Aug 3, 2011 at 9:25 AM, Caroline Faisst <caroline.faisst at gmail.com> wrote:
Hello there,
I?m computing the total value of an order from the price of the order items
using a ?for? loop and the ?ifelse? function. I do this on a large dataframe
(close to 1m lines). The computation of this function is painfully slow: in
1min only about 90 rows are calculated.
The computation time taken for a given number of rows increases with the
size of the dataset, see the example with my function below:
# small dataset: function performs well
exampledata<-data.frame(orderID=c(1,1,1,2,2,3,3,3,4),itemPrice=c(10,17,9,12,25,10,1,9,7))
exampledata[1,"orderAmount"]<-exampledata[1,"itemPrice"]
system.time(for (i in 2:length(exampledata[,1]))
{exampledata[i,"orderAmount"]<-ifelse(exampledata[i,"orderID"]==exampledata[i-1,"orderID"],exampledata[i-1,"orderAmount"]+exampledata[i,"itemPrice"],exampledata[i,"itemPrice"])})
# large dataset: the very same computational task takes much longer
exampledata2<-data.frame(orderID=c(1,1,1,2,2,3,3,3,4,5:2000000),itemPrice=c(10,17,9,12,25,10,1,9,7,25:2000020))
exampledata2[1,"orderAmount"]<-exampledata2[1,"itemPrice"]
system.time(for (i in 2:9)
{exampledata2[i,"orderAmount"]<-ifelse(exampledata2[i,"orderID"]==exampledata2[i-1,"orderID"],exampledata2[i-1,"orderAmount"]+exampledata2[i,"itemPrice"],exampledata2[i,"itemPrice"])})
Does someone know a way to increase the speed?
Thank you very much!
Caroline
[[alternative HTML version deleted]]
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
-- Jim Holtman Data Munger Guru What is the problem that you are trying to solve?
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
David Winsemius, MD West Hartford, CT
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
On Aug 3, 2011, at 2:01 PM, Ken wrote:
Hello,
Perhaps transpose the table attach(as.data.frame(t(data))) and use
ColSums() function with order id as header.
-Ken Hutchison
Got any code? The OP offered a reproducible example, after all.
David.
>
> On Aug 3, 2554 BE, at 1:12 PM, David Winsemius
> <dwinsemius at comcast.net> wrote:
>
>>
>> On Aug 3, 2011, at 12:20 PM, jim holtman wrote:
>>
>>> This takes about 2 secs for 1M rows:
>>>
>>>> n <- 1000000
>>>> exampledata <- data.frame(orderID = sample(floor(n / 5), n,
>>>> replace = TRUE), itemPrice = rpois(n, 10))
>>>> require(data.table)
>>>> # convert to data.table
>>>> ed.dt <- data.table(exampledata)
>>>> system.time(result <- ed.dt[
>>> + , list(total = sum(itemPrice))
>>> + , by = orderID
>>> + ]
>>> + )
>>> user system elapsed
>>> 1.30 0.05 1.34
>>
>> Interesting. Impressive. And I noted that the OP wanted what cumsum
>> would provide and for some reason creating that longer result is
>> even faster on my machine than the shorter result using sum.
>>
>> --
>> David.
>>>>
>>>> str(result)
>>> Classes ?data.table? and 'data.frame': 198708 obs. of 2 variables:
>>> $ orderID: int 1 2 3 4 5 6 8 9 10 11 ...
>>> $ total : num 49 37 72 92 50 76 34 22 65 39 ...
>>>> head(result)
>>> orderID total
>>> [1,] 1 49
>>> [2,] 2 37
>>> [3,] 3 72
>>> [4,] 4 92
>>> [5,] 5 50
>>> [6,] 6 76
>>>>
>>>
>>>
>>> On Wed, Aug 3, 2011 at 9:25 AM, Caroline Faisst
>>> <caroline.faisst at gmail.com> wrote:
>>>> Hello there,
>>>>
>>>>
>>>> I?m computing the total value of an order from the price of the
>>>> order items
>>>> using a ?for? loop and the ?ifelse? function. I do this on a
>>>> large dataframe
>>>> (close to 1m lines). The computation of this function is
>>>> painfully slow: in
>>>> 1min only about 90 rows are calculated.
>>>>
>>>>
>>>> The computation time taken for a given number of rows increases
>>>> with the
>>>> size of the dataset, see the example with my function below:
>>>>
>>>>
>>>> # small dataset: function performs well
>>>>
>>>> exampledata<-
>>>> data
>>>> .frame
>>>> (orderID=c(1,1,1,2,2,3,3,3,4),itemPrice=c(10,17,9,12,25,10,1,9,7))
>>>>
>>>> exampledata[1,"orderAmount"]<-exampledata[1,"itemPrice"]
>>>>
>>>> system.time(for (i in 2:length(exampledata[,1]))
>>>> {exampledata[i,"orderAmount"]<-
>>>> ifelse
>>>> (exampledata
>>>> [i
>>>> ,"orderID
>>>> "]==exampledata[i-1,"orderID"],exampledata[i-1,"orderAmount"]
>>>> +exampledata[i,"itemPrice"],exampledata[i,"itemPrice"])})
>>>>
>>>>
>>>> # large dataset: the very same computational task takes much longer
>>>>
>>>> exampledata2<-
>>>> data
>>>> .frame
>>>> (orderID
>>>> =
>>>> c
>>>> (1,1,1,2,2,3,3,3,4,5
>>>> :2000000),itemPrice=c(10,17,9,12,25,10,1,9,7,25:2000020))
>>>>
>>>> exampledata2[1,"orderAmount"]<-exampledata2[1,"itemPrice"]
>>>>
>>>> system.time(for (i in 2:9)
>>>> {exampledata2[i,"orderAmount"]<-
>>>> ifelse
>>>> (exampledata2
>>>> [i
>>>> ,"orderID
>>>> "]==exampledata2[i-1,"orderID"],exampledata2[i-1,"orderAmount"]
>>>> +exampledata2[i,"itemPrice"],exampledata2[i,"itemPrice"])})
>>>>
>>>>
>>>>
>>>> Does someone know a way to increase the speed?
>>>>
>>>>
>>>> Thank you very much!
>>>>
>>>> Caroline
>>>>
>>>> [[alternative HTML version deleted]]
>>>>
>>>>
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Jim Holtman
>>> Data Munger Guru
>>>
>>> What is the problem that you are trying to solve?
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>
>> David Winsemius, MD
>> West Hartford, CT
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
David Winsemius, MD
West Hartford, CT
Sorry about the lack of code, but using Davids example, would: tapply(itemPrice, INDEX=orderID, FUN=sum) work? -Ken Hutchison
On Aug 3, 2554 BE, at 2:09 PM, David Winsemius <dwinsemius at comcast.net> wrote:
On Aug 3, 2011, at 2:01 PM, Ken wrote:
Hello,
Perhaps transpose the table attach(as.data.frame(t(data))) and use ColSums() function with order id as header.
-Ken Hutchison
Got any code? The OP offered a reproducible example, after all. -- David.
On Aug 3, 2554 BE, at 1:12 PM, David Winsemius <dwinsemius at comcast.net> wrote:
On Aug 3, 2011, at 12:20 PM, jim holtman wrote:
This takes about 2 secs for 1M rows:
n <- 1000000 exampledata <- data.frame(orderID = sample(floor(n / 5), n, replace = TRUE), itemPrice = rpois(n, 10)) require(data.table) # convert to data.table ed.dt <- data.table(exampledata) system.time(result <- ed.dt[
+ , list(total = sum(itemPrice)) + , by = orderID + ] + ) user system elapsed 1.30 0.05 1.34
Interesting. Impressive. And I noted that the OP wanted what cumsum would provide and for some reason creating that longer result is even faster on my machine than the shorter result using sum. -- David.
str(result)
Classes ?data.table? and 'data.frame': 198708 obs. of 2 variables: $ orderID: int 1 2 3 4 5 6 8 9 10 11 ... $ total : num 49 37 72 92 50 76 34 22 65 39 ...
head(result)
orderID total [1,] 1 49 [2,] 2 37 [3,] 3 72 [4,] 4 92 [5,] 5 50 [6,] 6 76
On Wed, Aug 3, 2011 at 9:25 AM, Caroline Faisst <caroline.faisst at gmail.com> wrote:
Hello there,
I?m computing the total value of an order from the price of the order items
using a ?for? loop and the ?ifelse? function. I do this on a large dataframe
(close to 1m lines). The computation of this function is painfully slow: in
1min only about 90 rows are calculated.
The computation time taken for a given number of rows increases with the
size of the dataset, see the example with my function below:
# small dataset: function performs well
exampledata<-data.frame(orderID=c(1,1,1,2,2,3,3,3,4),itemPrice=c(10,17,9,12,25,10,1,9,7))
exampledata[1,"orderAmount"]<-exampledata[1,"itemPrice"]
system.time(for (i in 2:length(exampledata[,1]))
{exampledata[i,"orderAmount"]<-ifelse(exampledata[i,"orderID"]==exampledata[i-1,"orderID"],exampledata[i-1,"orderAmount"]+exampledata[i,"itemPrice"],exampledata[i,"itemPrice"])})
# large dataset: the very same computational task takes much longer
exampledata2<-data.frame(orderID=c(1,1,1,2,2,3,3,3,4,5:2000000),itemPrice=c(10,17,9,12,25,10,1,9,7,25:2000020))
exampledata2[1,"orderAmount"]<-exampledata2[1,"itemPrice"]
system.time(for (i in 2:9)
{exampledata2[i,"orderAmount"]<-ifelse(exampledata2[i,"orderID"]==exampledata2[i-1,"orderID"],exampledata2[i-1,"orderAmount"]+exampledata2[i,"itemPrice"],exampledata2[i,"itemPrice"])})
Does someone know a way to increase the speed?
Thank you very much!
Caroline
[[alternative HTML version deleted]]
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
-- Jim Holtman Data Munger Guru What is the problem that you are trying to solve?
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
David Winsemius, MD West Hartford, CT
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
David Winsemius, MD West Hartford, CT
On Aug 3, 2011, at 3:05 PM, Ken wrote:
Sorry about the lack of code, but using Davids example, would: tapply(itemPrice, INDEX=orderID, FUN=sum) work?
Doesn't do the cumulative sums or the assignment into column of the same data.frame. That's why I used ave, because it keeps the sequence correct.
David.
> -Ken Hutchison
>
> On Aug 3, 2554 BE, at 2:09 PM, David Winsemius
> <dwinsemius at comcast.net> wrote:
>
>>
>> On Aug 3, 2011, at 2:01 PM, Ken wrote:
>>
>>> Hello,
>>> Perhaps transpose the table attach(as.data.frame(t(data))) and use
>>> ColSums() function with order id as header.
>>> -Ken Hutchison
>>
>> Got any code? The OP offered a reproducible example, after all.
>>
>> --
>> David.
>>>
>>> On Aug 3, 2554 BE, at 1:12 PM, David Winsemius <dwinsemius at comcast.net
>>> > wrote:
>>>
>>>>
>>>> On Aug 3, 2011, at 12:20 PM, jim holtman wrote:
>>>>
>>>>> This takes about 2 secs for 1M rows:
>>>>>
>>>>>> n <- 1000000
>>>>>> exampledata <- data.frame(orderID = sample(floor(n / 5), n,
>>>>>> replace = TRUE), itemPrice = rpois(n, 10))
>>>>>> require(data.table)
>>>>>> # convert to data.table
>>>>>> ed.dt <- data.table(exampledata)
>>>>>> system.time(result <- ed.dt[
>>>>> + , list(total = sum(itemPrice))
>>>>> + , by = orderID
>>>>> + ]
>>>>> + )
>>>>> user system elapsed
>>>>> 1.30 0.05 1.34
>>>>
>>>> Interesting. Impressive. And I noted that the OP wanted what
>>>> cumsum would provide and for some reason creating that longer
>>>> result is even faster on my machine than the shorter result using
>>>> sum.
>>>>
>>>> --
>>>> David.
>>>>>>
>>>>>> str(result)
>>>>> Classes ?data.table? and 'data.frame': 198708 obs. of 2
>>>>> variables:
>>>>> $ orderID: int 1 2 3 4 5 6 8 9 10 11 ...
>>>>> $ total : num 49 37 72 92 50 76 34 22 65 39 ...
>>>>>> head(result)
>>>>> orderID total
>>>>> [1,] 1 49
>>>>> [2,] 2 37
>>>>> [3,] 3 72
>>>>> [4,] 4 92
>>>>> [5,] 5 50
>>>>> [6,] 6 76
>>>>>>
>>>>>
>>>>>
>>>>> On Wed, Aug 3, 2011 at 9:25 AM, Caroline Faisst
>>>>> <caroline.faisst at gmail.com> wrote:
>>>>>> Hello there,
>>>>>>
>>>>>>
>>>>>> I?m computing the total value of an order from the price of the
>>>>>> order items
>>>>>> using a ?for? loop and the ?ifelse? function. I do this on a
>>>>>> large dataframe
>>>>>> (close to 1m lines). The computation of this function is
>>>>>> painfully slow: in
>>>>>> 1min only about 90 rows are calculated.
>>>>>>
>>>>>>
>>>>>> The computation time taken for a given number of rows increases
>>>>>> with the
>>>>>> size of the dataset, see the example with my function below:
>>>>>>
>>>>>>
>>>>>> # small dataset: function performs well
>>>>>>
>>>>>> exampledata<-
>>>>>> data
>>>>>> .frame
>>>>>> (orderID
>>>>>> =c(1,1,1,2,2,3,3,3,4),itemPrice=c(10,17,9,12,25,10,1,9,7))
>>>>>>
>>>>>> exampledata[1,"orderAmount"]<-exampledata[1,"itemPrice"]
>>>>>>
>>>>>> system.time(for (i in 2:length(exampledata[,1]))
>>>>>> {exampledata[i,"orderAmount"]<-
>>>>>> ifelse
>>>>>> (exampledata
>>>>>> [i
>>>>>> ,"orderID
>>>>>> "]==exampledata[i-1,"orderID"],exampledata[i-1,"orderAmount"]
>>>>>> +exampledata[i,"itemPrice"],exampledata[i,"itemPrice"])})
>>>>>>
>>>>>>
>>>>>> # large dataset: the very same computational task takes much
>>>>>> longer
>>>>>>
>>>>>> exampledata2<-
>>>>>> data
>>>>>> .frame
>>>>>> (orderID
>>>>>> =
>>>>>> c
>>>>>> (1,1,1,2,2,3,3,3,4,5
>>>>>> :2000000),itemPrice=c(10,17,9,12,25,10,1,9,7,25:2000020))
>>>>>>
>>>>>> exampledata2[1,"orderAmount"]<-exampledata2[1,"itemPrice"]
>>>>>>
>>>>>> system.time(for (i in 2:9)
>>>>>> {exampledata2[i,"orderAmount"]<-
>>>>>> ifelse
>>>>>> (exampledata2
>>>>>> [i
>>>>>> ,"orderID
>>>>>> "]==exampledata2[i-1,"orderID"],exampledata2[i-1,"orderAmount"]
>>>>>> +exampledata2[i,"itemPrice"],exampledata2[i,"itemPrice"])})
>>>>>>
>>>>>>
>>>>>>
>>>>>> Does someone know a way to increase the speed?
>>>>>>
>>>>>>
>>>>>> Thank you very much!
>>>>>>
>>>>>> Caroline
>>>>>>
>>>>>> [[alternative HTML version deleted]]
>>>>>>
>>>>>>
>>>>>> ______________________________________________
>>>>>> R-help at r-project.org mailing list
>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>>>>> and provide commented, minimal, self-contained, reproducible
>>>>>> code.
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Jim Holtman
>>>>> Data Munger Guru
>>>>>
>>>>> What is the problem that you are trying to solve?
>>>>>
>>>>> ______________________________________________
>>>>> R-help at r-project.org mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>
>>>> David Winsemius, MD
>>>> West Hartford, CT
>>>>
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>
>> David Winsemius, MD
>> West Hartford, CT
>>
David Winsemius, MD
West Hartford, CT
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20110804/1eae6dc5/attachment.pl>