Skip to content

aggregate function with a dataframe for both "x" and "by"

3 messages · Eva Powers, Dennis Murphy, David Winsemius

#
Hi:

It's a little tricky to read in a data frame 'by hand' without making
NA a default missing value; you've got to trick it a bit. I'm doing
this inefficiently, but if you have the two 'real' data sets stored in
separate files, read.table() is the way to go since it provides an
option for defining the form of the missing values. data.frame()
doesn't have that option. To that end,

mydata <- data.frame(testvar1=c(1,3,5,7,8,3,5,NA,4,5,7,9),
                    testvar2=c(11,33,55,77,88,33,55,NA,44,55,77,99))
mybys <- data.frame(mbn1=c('red','blue',1,2,NA,'big',1,2,'red',1,NA,12),
                    mbn2=c('wet','dry',99,95,NA,'damp',95,99,'red',99,NA,NA),
                    stringsAsFactors = FALSE)
mybys

# You can tell that NA is a missing value since NA is printed as <NA>
(character NA).
# This is not the method you want, but the following 'works':
lines <- "
   mbn1 mbn2
  red  wet
 blue  dry
    1   99
    2   95
   NA   NA
  big damp
    1   95
    2   99
  red  red
    1   99
   NA   NA
   12   NA"
mybys <- read.table(textConnection(lines),
                    stringsAsFactors = FALSE, na.strings = "")
closeAllConnections()

# Now mybys treats NA as a character string.
# Moral: if NA is a legitimate value, keep the data in an external file
# and read it in with read.table(), using the na.strings = argument
# to specify an alternative missing value string. You don't need to
# use textConnection() as I did if you have the data in an external file.

Next problem: mydata has 12 rows, mybys has 13. I combined the two
data frames with cbind() using only the first 12 rows of mybys and
then used the ddply() function from the plyr package to do the
groupwise summation in addition to aggregate(). [Several other
packages would also work here, including doBy and data.table.] The
results are slightly different.

# Combined data:
myd <- cbind(mydata, mybys[1:12, ])

# Summation function with na.rm = TRUE:
# Will return 0 if all values of x are NA
sfun <- function(x) sum(x, na.rm = TRUE)

# aggregate() version:
aggregate(cbind(testvar1, testvar2) ~ V1 + V2, data = myd, FUN = sfun)
    V1   V2 testvar1 testvar2
1    2   95        8       88
2    1   99       14      154
3    2   99        4       44
4  big damp        5       55
5 blue  dry        5       55
6 mbn1 mbn2        1       11
7   NA   NA       12      132
8  red  red        5       55
9  red  wet        3       33

# ddply() version:
library('plyr')
ddply(myd, .(V1, V2), colwise(sfun, c('testvar1', 'testvar2')))
     V1   V2 testvar1 testvar2
1     1   95        0        0
2     1   99       14      154
3     2   95        8       88
4     2   99        4       44
5   big damp        5       55
6  blue  dry        5       55
7  mbn1 mbn2        1       11
8    NA   NA       12      132
9   red  red        5       55
10  red  wet        3       33

Hope this is what you were after.
Dennis
On Wed, Oct 5, 2011 at 4:45 PM, Eva Powers <evap4442 at gmail.com> wrote:
#
On Oct 5, 2011, at 7:45 PM, Eva Powers wrote:

            
corresponding?
David Winsemius, MD
West Hartford, CT