Hi,
I need to perform calculations on subsets of a data frame:
DF = data.frame(read.table(textConnection(" A B C D E F
1 a 1995 0 4 1
2 a 1997 1 1 3
3 b 1995 3 7 0
4 b 1996 1 2 3
5 b 1997 1 2 3
6 b 1998 6 0 0
7 b 1999 3 7 0
8 c 1997 1 2 3
9 c 1998 1 2 3
10 c 1999 6 0 0
11 d 1999 3 7 0
12 e 1995 1 2 3
13 e 1998 1 2 3
14 e 1999 6 0 0"),head=TRUE,stringsAsFactors=FALSE))
I'd like to create new dataframes for each unique year in which for each
value of A, the values of D, E and F are summed over the last 3 years (e.g.
1998 = 1998, 1997, 1996):
Question 1: How do I go from DF to newDFyear?
Examples:
newDF1995
B D E F
a 0 4 1
b 3 7 0
e 1 2 3
newDF1998
B D E F
a 1 1 3
b 8 4 6
c 2 4 6
e 1 2 3
Then, for each new DF I need to generate a square matrix after doing the
following:
newDF1998$G<-newDF1998$D + newDF1998$E + newDF1998$F
newDF1998$D<-newDF1998$D/newDF1998$G
newDF1998$E<-newDF1998$E/newDF1998$G
newDF1998$F<-newDF1998$F/newDF1998$G
newDF1998<-NewDF1998[,c(-5)]
newDF1998
B D E F
a 0.2 0.2 0.6
b 0.4 0.2 0.3
c 0.2 0.3 0.5
e 0.2 0.3 0.5
Question 2: How do I go from newDF1998 to a matrix
a b c e
a
b
c
e
in which Cell ab = (0.2*0.4 + 0.2*0.2 + 0.6*0.3)/((0.2*0.2 + 0.2*0.2 +
0.6*0.6)^0.5) * ((0.4*0.4 + 0.2*0.2 + 0.3*0.3)^0.5) = 0.84
Thanks a lot for your help!
--
View this message in context: http://r.789695.n4.nabble.com/Yearly-aggregates-and-matrices-tp3438140p3438140.html
Sent from the R help mailing list archive at Nabble.com.
Yearly aggregates and matrices
8 messages · mathijsdevaan, Gabor Grothendieck
On Sat, Apr 9, 2011 at 5:14 AM, mathijsdevaan <mathijsdevaan at gmail.com> wrote:
Hi,
I need to perform calculations on subsets of a data frame:
DF = data.frame(read.table(textConnection(" ? ?A ?B ?C ?D ?E ?F
1 a ?1995 ?0 ?4 ?1
2 a ?1997 ?1 ?1 ?3
3 b ?1995 ?3 ?7 ?0
4 b ?1996 ?1 ?2 ?3
5 b ?1997 ?1 ?2 ?3
6 b ?1998 ?6 ?0 ?0
7 b ?1999 ?3 ?7 ?0
8 c ?1997 ?1 ?2 ?3
9 c ?1998 ?1 ?2 ?3
10 c ?1999 ?6 ?0 ?0
11 d ?1999 ?3 ?7 ?0
12 e ?1995 ?1 ?2 ?3
13 e ?1998 ?1 ?2 ?3
14 e ?1999 ?6 ?0 ?0"),head=TRUE,stringsAsFactors=FALSE))
I'd like to create new dataframes for each unique year in which for each
value of A, the values of D, E and F are summed over the last 3 years (e.g.
1998 = 1998, 1997, 1996):
Question 1: How do I go from DF to newDFyear?
Examples:
newDF1995
B ?D ?E ?F
a ?0 ?4 ?1
b ?3 ?7 ?0
e ?1 ?2 ?3
newDF1998
B ?D ?E ?F
a ?1 ?1 ?3
b ?8 ?4 ?6
c ?2 ?4 ?6
e ?1 ?2 ?3
Then, for each new DF I need to generate a square matrix after doing the
following:
newDF1998$G<-newDF1998$D + newDF1998$E + newDF1998$F
newDF1998$D<-newDF1998$D/newDF1998$G
newDF1998$E<-newDF1998$E/newDF1998$G
newDF1998$F<-newDF1998$F/newDF1998$G
newDF1998<-NewDF1998[,c(-5)]
newDF1998
B ?D ?E ?F
a ?0.2 ?0.2 ?0.6
b ?0.4 ?0.2 ?0.3
c ?0.2 ?0.3 ?0.5
e ?0.2 ?0.3 ?0.5
Question 2: How do I go from newDF1998 to a matrix
?a ?b ?c ?e
a
b
c
e
in which Cell ab = (0.2*0.4 + 0.2*0.2 + 0.6*0.3)/((0.2*0.2 + 0.2*0.2 +
0.6*0.6)^0.5) * ((0.4*0.4 + 0.2*0.2 + 0.3*0.3)^0.5) = 0.84
First we use read.zoo to reform DF into a multivariate time series and
use rollapply (where we have used the devel version of zoo since it
supports the partial= argument on rollapply). We then reform each
resulting row into a matrix converting each row of each matrix to
proportions. Finally we form the desired scaled cross product.
# devel version of zoo
install.packages("zoo", repos = "http://r-forge.r-project.org")
library(zoo)
z <- read.zoo(DF, split = 2, index = 3, FUN = identity)
sum.na <- function(x) if (any(!is.na(x))) sum(x, na.rm = TRUE) else NA
r <- rollapply(z, 3, sum.na, align = "right", partial = TRUE)
newDF <- lapply(1:nrow(r), function(i)
prop.table(na.omit(matrix(r[i,], nc = 4, byrow = TRUE,
dimnames = list(unique(DF$B), names(DF)[-2:-3]))[, -1]), 1))
names(newDF) <- time(z)
lapply(mats, function(mat) tcrossprod(mat / sqrt(rowSums(mat^2))))
Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com
On Sat, Apr 9, 2011 at 11:45 PM, Gabor Grothendieck
<ggrothendieck at gmail.com> wrote:
On Sat, Apr 9, 2011 at 5:14 AM, mathijsdevaan <mathijsdevaan at gmail.com> wrote:
Hi,
I need to perform calculations on subsets of a data frame:
DF = data.frame(read.table(textConnection(" ? ?A ?B ?C ?D ?E ?F
1 a ?1995 ?0 ?4 ?1
2 a ?1997 ?1 ?1 ?3
3 b ?1995 ?3 ?7 ?0
4 b ?1996 ?1 ?2 ?3
5 b ?1997 ?1 ?2 ?3
6 b ?1998 ?6 ?0 ?0
7 b ?1999 ?3 ?7 ?0
8 c ?1997 ?1 ?2 ?3
9 c ?1998 ?1 ?2 ?3
10 c ?1999 ?6 ?0 ?0
11 d ?1999 ?3 ?7 ?0
12 e ?1995 ?1 ?2 ?3
13 e ?1998 ?1 ?2 ?3
14 e ?1999 ?6 ?0 ?0"),head=TRUE,stringsAsFactors=FALSE))
I'd like to create new dataframes for each unique year in which for each
value of A, the values of D, E and F are summed over the last 3 years (e.g.
1998 = 1998, 1997, 1996):
Question 1: How do I go from DF to newDFyear?
Examples:
newDF1995
B ?D ?E ?F
a ?0 ?4 ?1
b ?3 ?7 ?0
e ?1 ?2 ?3
newDF1998
B ?D ?E ?F
a ?1 ?1 ?3
b ?8 ?4 ?6
c ?2 ?4 ?6
e ?1 ?2 ?3
Then, for each new DF I need to generate a square matrix after doing the
following:
newDF1998$G<-newDF1998$D + newDF1998$E + newDF1998$F
newDF1998$D<-newDF1998$D/newDF1998$G
newDF1998$E<-newDF1998$E/newDF1998$G
newDF1998$F<-newDF1998$F/newDF1998$G
newDF1998<-NewDF1998[,c(-5)]
newDF1998
B ?D ?E ?F
a ?0.2 ?0.2 ?0.6
b ?0.4 ?0.2 ?0.3
c ?0.2 ?0.3 ?0.5
e ?0.2 ?0.3 ?0.5
Question 2: How do I go from newDF1998 to a matrix
?a ?b ?c ?e
a
b
c
e
in which Cell ab = (0.2*0.4 + 0.2*0.2 + 0.6*0.3)/((0.2*0.2 + 0.2*0.2 +
0.6*0.6)^0.5) * ((0.4*0.4 + 0.2*0.2 + 0.3*0.3)^0.5) = 0.84
First we use read.zoo to reform DF into a multivariate time series and
use rollapply (where we have used the devel version of zoo since it
supports the partial= argument on rollapply). ?We then reform each
resulting row into a matrix converting each row of each matrix to
proportions. ?Finally we form the desired scaled cross product.
# devel version of zoo
install.packages("zoo", repos = "http://r-forge.r-project.org")
library(zoo)
z <- read.zoo(DF, split = 2, index = 3, FUN = identity)
sum.na <- function(x) if (any(!is.na(x))) sum(x, na.rm = TRUE) else NA
r <- rollapply(z, 3, ?sum.na, align = "right", partial = TRUE)
newDF <- lapply(1:nrow(r), function(i)
? ? ? ?prop.table(na.omit(matrix(r[i,], nc = 4, byrow = TRUE,
? ? ? ? ? ? ? ?dimnames = list(unique(DF$B), names(DF)[-2:-3]))[, -1]), 1))
names(newDF) <- time(z)
lapply(mats, function(mat) tcrossprod(mat / sqrt(rowSums(mat^2))))
mats in the last line should be newDF: lapply(newDF, function(mat) tcrossprod(mat / sqrt(rowSums(mat^2))))
Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com
17 days later
Hi,
Is there an alternative to "z <- read.zoo(DF, split = 2, index = 3, FUN =
identity)" and "r <- rollapply(z, 3, sum.na, align = "right", partial =
TRUE)"? I am trying to use the following script in which the split data (B)
contains about 300000 unique cases and obviously I am getting an allocation
error. Thanks!
# devel version of zoo
install.packages("zoo", repos = "http://r-forge.r-project.org")
DF = data.frame(read.table(textConnection(" ?A ?B ?C ?D ?E ?F
1 a ?1995 ?0 ?4 ?1
2 a ?1997 ?1 ?1 ?3
3 b ?1995 ?3 ?7 ?0
4 b ?1996 ?1 ?2 ?3
5 b ?1997 ?1 ?2 ?3
6 b ?1998 ?6 ?0 ?0
7 b ?1999 ?3 ?7 ?0
8 c ?1997 ?1 ?2 ?3
9 c ?1998 ?1 ?2 ?3
10 c ?1999 ?6 ?0 ?0
11 d ?1999 ?3 ?7 ?0
12 e ?1995 ?1 ?2 ?3
13 e ?1998 ?1 ?2 ?3
14 e ?1999 ?6 ?0 ?0"),head=TRUE,stringsAsFactors=FALSE))
library(zoo)
z <- read.zoo(DF, split = 2, index = 3, FUN = identity)
sum.na <- function(x) if (any(!is.na(x))) sum(x, na.rm = TRUE) else NA
r <- rollapply(z, 3, sum.na, align = "right", partial = TRUE)
newDF <- lapply(1:nrow(r), function(i)
prop.table(na.omit(matrix(r[i,], nc = 4, byrow = TRUE,
dimnames = list(unique(DF$B), names(DF)[-2:-3]))[, -1]), 1))
names(newDF) <- time(z)
lapply(newDF, function(mat) tcrossprod(mat / sqrt(rowSums(mat^2))))
Gabor Grothendieck wrote:
On Sat, Apr 9, 2011 at 5:14 AM, mathijsdevaan <mathijsdevaan at gmail.com> wrote:
Hi,
I need to perform calculations on subsets of a data frame:
DF = data.frame(read.table(textConnection(" ? ?A ?B ?C ?D ?E ?F
1 a ?1995 ?0 ?4 ?1
2 a ?1997 ?1 ?1 ?3
3 b ?1995 ?3 ?7 ?0
4 b ?1996 ?1 ?2 ?3
5 b ?1997 ?1 ?2 ?3
6 b ?1998 ?6 ?0 ?0
7 b ?1999 ?3 ?7 ?0
8 c ?1997 ?1 ?2 ?3
9 c ?1998 ?1 ?2 ?3
10 c ?1999 ?6 ?0 ?0
11 d ?1999 ?3 ?7 ?0
12 e ?1995 ?1 ?2 ?3
13 e ?1998 ?1 ?2 ?3
14 e ?1999 ?6 ?0 ?0"),head=TRUE,stringsAsFactors=FALSE))
I'd like to create new dataframes for each unique year in which for each
value of A, the values of D, E and F are summed over the last 3 years
(e.g.
1998 = 1998, 1997, 1996):
Question 1: How do I go from DF to newDFyear?
Examples:
newDF1995
B ?D ?E ?F
a ?0 ?4 ?1
b ?3 ?7 ?0
e ?1 ?2 ?3
newDF1998
B ?D ?E ?F
a ?1 ?1 ?3
b ?8 ?4 ?6
c ?2 ?4 ?6
e ?1 ?2 ?3
Then, for each new DF I need to generate a square matrix after doing the
following:
newDF1998$G<-newDF1998$D + newDF1998$E + newDF1998$F
newDF1998$D<-newDF1998$D/newDF1998$G
newDF1998$E<-newDF1998$E/newDF1998$G
newDF1998$F<-newDF1998$F/newDF1998$G
newDF1998<-NewDF1998[,c(-5)]
newDF1998
B ?D ?E ?F
a ?0.2 ?0.2 ?0.6
b ?0.4 ?0.2 ?0.3
c ?0.2 ?0.3 ?0.5
e ?0.2 ?0.3 ?0.5
Question 2: How do I go from newDF1998 to a matrix
?a ?b ?c ?e
a
b
c
e
in which Cell ab = (0.2*0.4 + 0.2*0.2 + 0.6*0.3)/((0.2*0.2 + 0.2*0.2 +
0.6*0.6)^0.5) * ((0.4*0.4 + 0.2*0.2 + 0.3*0.3)^0.5) = 0.84
First we use read.zoo to reform DF into a multivariate time series and
use rollapply (where we have used the devel version of zoo since it
supports the partial= argument on rollapply). We then reform each
resulting row into a matrix converting each row of each matrix to
proportions. Finally we form the desired scaled cross product.
# devel version of zoo
install.packages("zoo", repos = "http://r-forge.r-project.org")
library(zoo)
z <- read.zoo(DF, split = 2, index = 3, FUN = identity)
sum.na <- function(x) if (any(!is.na(x))) sum(x, na.rm = TRUE) else NA
r <- rollapply(z, 3, sum.na, align = "right", partial = TRUE)
newDF <- lapply(1:nrow(r), function(i)
prop.table(na.omit(matrix(r[i,], nc = 4, byrow = TRUE,
dimnames = list(unique(DF$B), names(DF)[-2:-3]))[, -1]), 1))
names(newDF) <- time(z)
lapply(mats, function(mat) tcrossprod(mat / sqrt(rowSums(mat^2))))
--
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
-- View this message in context: http://r.789695.n4.nabble.com/Yearly-aggregates-and-matrices-tp3438140p3478997.html Sent from the R help mailing list archive at Nabble.com.
On Wed, Apr 27, 2011 at 2:03 PM, mathijsdevaan <mathijsdevaan at gmail.com> wrote:
Hi, Is there an alternative to "z <- read.zoo(DF, split = 2, index = 3, FUN = identity)" and "r <- rollapply(z, 3, ?sum.na, align = "right", partial = TRUE)"? I am trying to use the following script in which the split data (B) contains about 300000 unique cases and obviously I am getting an allocation error. Thanks!
You could test the speed of this to see if its faster:
library(reshape2)
library(zoo)
mm <- melt(DF, id = c("B", "C"))
aa <- acast(mm, C ~ B + variable, FUN = sum)
sum.na <- function(x) if (any(!is.na(x))) sum(x, na.rm = TRUE) else NA
r <- rollapply(aa, 3, sum.na, align = "right", partial = TRUE)
Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com
Thanks, but it did not really improve the speed. Why is it that when I change
the layout of the matrix (which does not give the required results), the
speed increases tremendously? So:
library(reshape2)
library(zoo)
z <- read.zoo(DF, split = 3, index = 2, FUN = identity) # Split on 3 and
index on 2 instead of vice versa
sum.na <- function(x) if (any(!is.na(x))) sum(x, na.rm = TRUE) else NA
r <- rollapply(z, 3, sum.na, align = "right", partial = TRUE)
or
mm <- melt(DF, id = c("B", "C"))
aa <- acast(mm, B ~ C + variable, FUN = sum) # B ~ C instead of C ~ B
sum.na <- function(x) if (any(!is.na(x))) sum(x, na.rm = TRUE) else NA
r <- rollapply(aa, 3, sum.na, align = "right", partial = TRUE)
Thanks!
Gabor Grothendieck wrote:
On Wed, Apr 27, 2011 at 2:03 PM, mathijsdevaan <mathijsdevaan at gmail.com> wrote:
Hi, Is there an alternative to "z <- read.zoo(DF, split = 2, index = 3, FUN = identity)" and "r <- rollapply(z, 3, ?sum.na, align = "right", partial = TRUE)"? I am trying to use the following script in which the split data (B) contains about 300000 unique cases and obviously I am getting an allocation error. Thanks!
You could test the speed of this to see if its faster:
library(reshape2)
library(zoo)
mm <- melt(DF, id = c("B", "C"))
aa <- acast(mm, C ~ B + variable, FUN = sum)
sum.na <- function(x) if (any(!is.na(x))) sum(x, na.rm = TRUE) else NA
r <- rollapply(aa, 3, sum.na, align = "right", partial = TRUE)
--
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
-- View this message in context: http://r.789695.n4.nabble.com/Yearly-aggregates-and-matrices-tp3438140p3482174.html Sent from the R help mailing list archive at Nabble.com.
On Thu, Apr 28, 2011 at 4:49 PM, mathijsdevaan <mathijsdevaan at gmail.com> wrote:
Thanks, but it did not really improve the speed. Why is it that when I change
the layout of the matrix (which does not give the required results), the
speed increases tremendously? So:
library(reshape2)
library(zoo)
z <- read.zoo(DF, split = 3, index = 2, FUN = identity) # Split on 3 and
index on 2 instead of vice versa
sum.na <- function(x) if (any(!is.na(x))) sum(x, na.rm = TRUE) else NA
r <- rollapply(z, 3, ?sum.na, align = "right", partial = TRUE)
or
mm <- melt(DF, id = c("B", "C"))
aa <- acast(mm, B ~ C + variable, FUN = sum) # B ~ C instead of C ~ B
sum.na <- function(x) if (any(!is.na(x))) sum(x, na.rm = TRUE) else NA
r <- rollapply(aa, 3, ?sum.na, align = "right", partial = TRUE)
For me it makes little difference:
system.time(for(i in 1:100) read.zoo(DF, split = 3, index = 2, FUN = identity))
user system elapsed 1.72 0.00 1.71
system.time(for(i in 1:100) read.zoo(DF, split = 2, index = 3, FUN = identity))
user system elapsed 1.75 0.00 1.74
Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com
On Thu, Apr 28, 2011 at 10:13 PM, Gabor Grothendieck
<ggrothendieck at gmail.com> wrote:
On Thu, Apr 28, 2011 at 4:49 PM, mathijsdevaan <mathijsdevaan at gmail.com> wrote:
Thanks, but it did not really improve the speed. Why is it that when I change
the layout of the matrix (which does not give the required results), the
speed increases tremendously? So:
library(reshape2)
library(zoo)
z <- read.zoo(DF, split = 3, index = 2, FUN = identity) # Split on 3 and
index on 2 instead of vice versa
sum.na <- function(x) if (any(!is.na(x))) sum(x, na.rm = TRUE) else NA
r <- rollapply(z, 3, ?sum.na, align = "right", partial = TRUE)
or
mm <- melt(DF, id = c("B", "C"))
aa <- acast(mm, B ~ C + variable, FUN = sum) # B ~ C instead of C ~ B
sum.na <- function(x) if (any(!is.na(x))) sum(x, na.rm = TRUE) else NA
r <- rollapply(aa, 3, ?sum.na, align = "right", partial = TRUE)
For me it makes little difference:
system.time(for(i in 1:100) read.zoo(DF, split = 3, index = 2, FUN = identity))
? user ?system elapsed ? 1.72 ? ?0.00 ? ?1.71
system.time(for(i in 1:100) read.zoo(DF, split = 2, index = 3, FUN = identity))
? user ?system elapsed ? 1.75 ? ?0.00 ? ?1.74
and here it is with rollapply included:
system.time(for(i in 1:100) { read.zoo(DF, split = 3, index = 2, FUN = identity)
+ r <- rollapply(z, 3, sum.na, align = "right", partial = TRUE)}) user system elapsed 14.74 0.00 14.76 .
system.time(for(i in 1:100) {read.zoo(DF, split = 2, index = 3, FUN = identity)
+ r <- rollapply(z, 3, sum.na, align = "right", partial = TRUE)}) user system elapsed 14.72 0.00 14.74
Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com