A suggestion for an amendment to tapply - R-devel

Mon, Nov 5, 2007 9:10 PM #

Dear R-developers,

when tapply() is invoked on factors that have empty levels, it returns
NA.  This behaviour is in accord with the tapply documentation, and is
reasonable in many cases.  However, when FUN is sum, it would also
seem reasonable to return 0 instead of NA, because "the sum of an
empty set is zero, by definition."

I'd like to raise a discussion of the possibility of an amendment to
tapply.

The attached patch changes the function so that it checks if there are
any empty levels, and if there are, replaces the corresponding NA
values with the result of applying FUN to the empty set.  Eg in the
case of sum, it replaces the NA with 0, whereas with mean, it replaces
the NA with NA, and issues a warning.

This change has the following advantage: tapply and sum work better
together.  Arguably, tapply and any other function that has a non-NA
response to the empty set will also work better together.
Furthermore, tapply shows a warning if FUN would normally show a
warning upon being evaluated on an empty set.  That deviates from
current behaviour, which might be bad, but also provides information
that might be useful to the user, so that would be good.

The attached script provides the new function in full, and
demonstrates its application in some simple test cases.

Best wishes,

Andrew

Andrew Robinson  
Department of Mathematics and Statistics            Tel: +61-3-8344-9763
University of Melbourne, VIC 3010 Australia         Fax: +61-3-8344-4599
http://www.ms.unimelb.edu.au/~andrewpr
http://blogs.mbs.edu/fishing-in-the-bay/ 
-------------- next part --------------
## The new function

my.tapply <- function (X, INDEX, FUN=NULL, ..., simplify=TRUE)
{
    FUN <- if (!is.null(FUN)) match.fun(FUN)
    if (!is.list(INDEX)) INDEX <- list(INDEX)
    nI <- length(INDEX)
    namelist <- vector("list", nI)
    names(namelist) <- names(INDEX)
    extent <- integer(nI)
    nx <- length(X)
    one <- as.integer(1)
    group <- rep.int(one, nx)#- to contain the splitting vector
    ngroup <- one
    for (i in seq.int(INDEX)) {
	index <- as.factor(INDEX[[i]])
	if (length(index) != nx)
	    stop("arguments must have same length")
	namelist[[i]] <- levels(index)#- all of them, yes !
	extent[i] <- nlevels(index)
	group <- group + ngroup * (as.integer(index) - one)
	ngroup <- ngroup * nlevels(index)
    }
    if (is.null(FUN)) return(group)
    ans <- lapply(split(X, group), FUN, ...)
    index <- as.numeric(names(ans))
    if (simplify && all(unlist(lapply(ans, length)) == 1)) {
	ansmat <- array(dim=extent, dimnames=namelist)
	ans <- unlist(ans, recursive = FALSE)
    }
    else  {
	ansmat <- array(vector("list", prod(extent)),
			dim=extent, dimnames=namelist)
    }
    ## old : ansmat[as.numeric(names(ans))] <- ans
    names(ans) <- NULL
    ansmat[index] <- ans
    if (sum(table(INDEX) < 1) > 0)
        ansmat[table(INDEX) < 1] <- do.call(FUN, list(c(NULL), ...)) 
    ansmat
}

## Check its utility

group <- factor(c(1,1,3,3), levels=c("1","2","3"))
x <- c(1,2,3,4)

## Ok with mean?

tapply(x, group, mean)
my.tapply(x, group, mean)

## Ok with sum?

tapply(x, group, sum)
my.tapply(x, group, sum)

## Check that other arguments are carried through

x <- c(NA,2,3,10)

tapply(x, group, sum, na.rm=TRUE)
tapply(x, group, mean, na.rm=TRUE)

my.tapply(x, group, sum, na.rm=TRUE)
my.tapply(x, group, mean, na.rm=TRUE)

## Check that listed groups work ok also

group.2 <- factor(c(1,2,3,3), levels=c("1","2","3"))

tapply(x, list(group, group.2), sum, na.rm=TRUE)
tapply(x, list(group, group.2), mean, na.rm=TRUE)

my.tapply(x, list(group, group.2), sum, na.rm=TRUE)
my.tapply(x, list(group, group.2), mean, na.rm=TRUE)

Bill Venables

Mon, Nov 5, 2007 10:53 PM #

Unfortunately I think it would break too much existing code.  tapply()
is an old function and many people have gotten used to the way it works
now.

This is not to suggest there could not be another argument added at the
end to indicate that you want the new behaviour, though.  e.g. 

tapply <- function (X, INDEX, FUN=NULL, ..., simplify=TRUE,
handle.empty.levels = FALSE) 

but this raises the question of what sort of time penalty the
modification might entail.  Probably not much for most situations, I
suppose.  (I know this argument name looks long, but you do need a
fairly specific argument name, or it will start to impinge on the ...
argument.)

Just some thoughts.

Bill Venables.

Bill Venables
CSIRO Laboratories
PO Box 120, Cleveland, 4163
AUSTRALIA
Office Phone (email preferred): +61 7 3826 7251
Fax (if absolutely necessary):  +61 7 3826 7304
Mobile:                         +61 4 8819 4402
Home Phone:                     +61 7 3286 7700
mailto:Bill.Venables at csiro.au
http://www.cmis.csiro.au/bill.venables/ 

-----Original Message-----
From: r-devel-bounces at r-project.org
[mailto:r-devel-bounces at r-project.org] On Behalf Of Andrew Robinson
Sent: Tuesday, 6 November 2007 3:10 PM
To: R-Devel
Subject: [Rd] A suggestion for an amendment to tapply

Dear R-developers,

when tapply() is invoked on factors that have empty levels, it returns
NA.  This behaviour is in accord with the tapply documentation, and is
reasonable in many cases.  However, when FUN is sum, it would also
seem reasonable to return 0 instead of NA, because "the sum of an
empty set is zero, by definition."

I'd like to raise a discussion of the possibility of an amendment to
tapply.

The attached patch changes the function so that it checks if there are
any empty levels, and if there are, replaces the corresponding NA
values with the result of applying FUN to the empty set.  Eg in the
case of sum, it replaces the NA with 0, whereas with mean, it replaces
the NA with NA, and issues a warning.

This change has the following advantage: tapply and sum work better
together.  Arguably, tapply and any other function that has a non-NA
response to the empty set will also work better together.
Furthermore, tapply shows a warning if FUN would normally show a
warning upon being evaluated on an empty set.  That deviates from
current behaviour, which might be bad, but also provides information
that might be useful to the user, so that would be good.

The attached script provides the new function in full, and
demonstrates its application in some simple test cases.

Best wishes,

Andrew

Andrew Robinson  
Department of Mathematics and Statistics            Tel: +61-3-8344-9763
University of Melbourne, VIC 3010 Australia         Fax: +61-3-8344-4599
http://www.ms.unimelb.edu.au/~andrewpr
http://blogs.mbs.edu/fishing-in-the-bay/

Brian Ripley

Mon, Nov 5, 2007 11:23 PM #

On Tue, 6 Nov 2007, Bill.Venables at csiro.au wrote:

It is also not necessarily desirable: FUN(numeric(0)) might be an error.
For example:

but sd(numeric(0)) is an error.  (Similar things involving var are 'in the 
wild' and so would be broken.)

This is not to suggest there could not be another argument added at the
end to indicate that you want the new behaviour, though.  e.g.

tapply <- function (X, INDEX, FUN=NULL, ..., simplify=TRUE,
handle.empty.levels = FALSE)

but this raises the question of what sort of time penalty the
modification might entail.  Probably not much for most situations, I
suppose.  (I know this argument name looks long, but you do need a
fairly specific argument name, or it will start to impinge on the ...
argument.)

Just some thoughts.

Bill Venables.

Bill Venables
CSIRO Laboratories
PO Box 120, Cleveland, 4163
AUSTRALIA
Office Phone (email preferred): +61 7 3826 7251
Fax (if absolutely necessary):  +61 7 3826 7304
Mobile:                         +61 4 8819 4402
Home Phone:                     +61 7 3286 7700
mailto:Bill.Venables at csiro.au
http://www.cmis.csiro.au/bill.venables/

-----Original Message-----
From: r-devel-bounces at r-project.org
[mailto:r-devel-bounces at r-project.org] On Behalf Of Andrew Robinson
Sent: Tuesday, 6 November 2007 3:10 PM
To: R-Devel
Subject: [Rd] A suggestion for an amendment to tapply

Dear R-developers,

when tapply() is invoked on factors that have empty levels, it returns
NA.  This behaviour is in accord with the tapply documentation, and is
reasonable in many cases.  However, when FUN is sum, it would also
seem reasonable to return 0 instead of NA, because "the sum of an
empty set is zero, by definition."

I'd like to raise a discussion of the possibility of an amendment to
tapply.

The attached patch changes the function so that it checks if there are
any empty levels, and if there are, replaces the corresponding NA
values with the result of applying FUN to the empty set.  Eg in the
case of sum, it replaces the NA with 0, whereas with mean, it replaces
the NA with NA, and issues a warning.

This change has the following advantage: tapply and sum work better
together.  Arguably, tapply and any other function that has a non-NA
response to the empty set will also work better together.
Furthermore, tapply shows a warning if FUN would normally show a
warning upon being evaluated on an empty set.  That deviates from
current behaviour, which might be bad, but also provides information
that might be useful to the user, so that would be good.

The attached script provides the new function in full, and
demonstrates its application in some simple test cases.

Best wishes,

Andrew

Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

Andrew Robinson

Tue, Nov 6, 2007 5:43 PM #

These are important concerns.  It seems to me that adding an argument
as suggested by Bill will allow the user to side-step the problem
identified by Brian.

Bill, under what kinds of circumstances would you anticipate a
significant time penalty?  I would be happy to check those out with
some simulations.

If the timing seems acceptable, I can write a patch for tapply.R and
tapply.Rd if anyone in the core is willing to consider them.  Please
contact me on or off list if so.

Best wishes to all,

Andrew

On Tue, Nov 06, 2007 at 07:23:56AM +0000, Prof Brian Ripley wrote:

On Tue, 6 Nov 2007, Bill.Venables at csiro.au wrote:

Unfortunately I think it would break too much existing code.  tapply()
is an old function and many people have gotten used to the way it works
now.

It is also not necessarily desirable: FUN(numeric(0)) might be an error.
For example:

Z <- data.frame(x=rnorm(10), f=rep(c("a", "b"), each=5))[1:5, ]
tapply(Z$x, Z$f, sd)

but sd(numeric(0)) is an error.  (Similar things involving var are 'in the 
wild' and so would be broken.)

This is not to suggest there could not be another argument added at the
end to indicate that you want the new behaviour, though.  e.g.

tapply <- function (X, INDEX, FUN=NULL, ..., simplify=TRUE,
handle.empty.levels = FALSE)

but this raises the question of what sort of time penalty the
modification might entail.  Probably not much for most situations, I
suppose.  (I know this argument name looks long, but you do need a
fairly specific argument name, or it will start to impinge on the ...
argument.)

Just some thoughts.

Bill Venables.

Bill Venables
CSIRO Laboratories
PO Box 120, Cleveland, 4163
AUSTRALIA
Office Phone (email preferred): +61 7 3826 7251
Fax (if absolutely necessary):  +61 7 3826 7304
Mobile:                         +61 4 8819 4402
Home Phone:                     +61 7 3286 7700
mailto:Bill.Venables at csiro.au
http://www.cmis.csiro.au/bill.venables/

-----Original Message-----
From: r-devel-bounces at r-project.org
[mailto:r-devel-bounces at r-project.org] On Behalf Of Andrew Robinson
Sent: Tuesday, 6 November 2007 3:10 PM
To: R-Devel
Subject: [Rd] A suggestion for an amendment to tapply

Dear R-developers,

when tapply() is invoked on factors that have empty levels, it returns
NA.  This behaviour is in accord with the tapply documentation, and is
reasonable in many cases.  However, when FUN is sum, it would also
seem reasonable to return 0 instead of NA, because "the sum of an
empty set is zero, by definition."

I'd like to raise a discussion of the possibility of an amendment to
tapply.

The attached patch changes the function so that it checks if there are
any empty levels, and if there are, replaces the corresponding NA
values with the result of applying FUN to the empty set.  Eg in the
case of sum, it replaces the NA with 0, whereas with mean, it replaces
the NA with NA, and issues a warning.

This change has the following advantage: tapply and sum work better
together.  Arguably, tapply and any other function that has a non-NA
response to the empty set will also work better together.
Furthermore, tapply shows a warning if FUN would normally show a
warning upon being evaluated on an empty set.  That deviates from
current behaviour, which might be bad, but also provides information
that might be useful to the user, so that would be good.

The attached script provides the new function in full, and
demonstrates its application in some simple test cases.

Best wishes,

Andrew

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

Andrew Robinson  
Department of Mathematics and Statistics            Tel: +61-3-8344-9763
University of Melbourne, VIC 3010 Australia         Fax: +61-3-8344-4599
http://www.ms.unimelb.edu.au/~andrewpr
http://blogs.mbs.edu/fishing-in-the-bay/

Peter Dalgaard

Tue, Nov 6, 2007 11:15 PM #

Andrew Robinson wrote:

There's another concern: tapply (et al.) has the ... args passed on to 
FUN which means that you have to be really careful with argument names.

Could I just interject that we already have

 > airquality$Month <- factor(airquality$Month,levels=4:9) # April not there
 > unlist(lapply(
+    split(airquality$Ozone, airquality$Month, drop=F),sum, na.rm=T))
   4    5    6    7    8    9
   0  614  265 1537 1559  912

(splitting on multiple factors gets a  bit involved, though)

Best wishes to all,

Andrew




On Tue, Nov 06, 2007 at 07:23:56AM +0000, Prof Brian Ripley wrote:

On Tue, 6 Nov 2007, Bill.Venables at csiro.au wrote:

Unfortunately I think it would break too much existing code.  tapply()
is an old function and many people have gotten used to the way it works
now.

It is also not necessarily desirable: FUN(numeric(0)) might be an error.
For example:

Z <- data.frame(x=rnorm(10), f=rep(c("a", "b"), each=5))[1:5, ]
tapply(Z$x, Z$f, sd)

but sd(numeric(0)) is an error.  (Similar things involving var are 'in the 
wild' and so would be broken.)

This is not to suggest there could not be another argument added at the
end to indicate that you want the new behaviour, though.  e.g.

tapply <- function (X, INDEX, FUN=NULL, ..., simplify=TRUE,
handle.empty.levels = FALSE)

but this raises the question of what sort of time penalty the
modification might entail.  Probably not much for most situations, I
suppose.  (I know this argument name looks long, but you do need a
fairly specific argument name, or it will start to impinge on the ...
argument.)

Just some thoughts.

Bill Venables.

Bill Venables
CSIRO Laboratories
PO Box 120, Cleveland, 4163
AUSTRALIA
Office Phone (email preferred): +61 7 3826 7251
Fax (if absolutely necessary):  +61 7 3826 7304
Mobile:                         +61 4 8819 4402
Home Phone:                     +61 7 3286 7700
mailto:Bill.Venables at csiro.au
http://www.cmis.csiro.au/bill.venables/

-----Original Message-----
From: r-devel-bounces at r-project.org
[mailto:r-devel-bounces at r-project.org] On Behalf Of Andrew Robinson
Sent: Tuesday, 6 November 2007 3:10 PM
To: R-Devel
Subject: [Rd] A suggestion for an amendment to tapply

Dear R-developers,

when tapply() is invoked on factors that have empty levels, it returns
NA.  This behaviour is in accord with the tapply documentation, and is
reasonable in many cases.  However, when FUN is sum, it would also
seem reasonable to return 0 instead of NA, because "the sum of an
empty set is zero, by definition."

I'd like to raise a discussion of the possibility of an amendment to
tapply.

The attached patch changes the function so that it checks if there are
any empty levels, and if there are, replaces the corresponding NA
values with the result of applying FUN to the empty set.  Eg in the
case of sum, it replaces the NA with 0, whereas with mean, it replaces
the NA with NA, and issues a warning.

This change has the following advantage: tapply and sum work better
together.  Arguably, tapply and any other function that has a non-NA
response to the empty set will also work better together.
Furthermore, tapply shows a warning if FUN would normally show a
warning upon being evaluated on an empty set.  That deviates from
current behaviour, which might be bad, but also provides information
that might be useful to the user, so that would be good.

The attached script provides the new function in full, and
demonstrates its application in some simple test cases.

Best wishes,

Andrew

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

O__  ---- Peter Dalgaard             ?ster Farimagsgade 5, Entr.B
  c/ /'_ --- Dept. of Biostatistics     PO Box 2099, 1014 Cph. K
 (*) \(*) -- University of Copenhagen   Denmark          Ph:  (+45) 35327918
~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk)                  FAX: (+45) 35327907

Andrew Robinson

Wed, Nov 7, 2007 3:45 PM #

On Wed, Nov 07, 2007 at 08:15:17AM +0100, Peter Dalgaard wrote:

For that matter, we have

airquality$Month <- factor(airquality$Month,levels=4:9)
air.sum <- tapply(airquality$Ozone, airquality$Month, sum, na.rm=T)
air.sum[is.na(air.sum)] <- 0

which is equivalent to what I ended up using whilst fiddling with tapply.

Andrew

Best wishes to all,

Andrew




On Tue, Nov 06, 2007 at 07:23:56AM +0000, Prof Brian Ripley wrote:

On Tue, 6 Nov 2007, Bill.Venables at csiro.au wrote:

Unfortunately I think it would break too much existing code.  tapply()
is an old function and many people have gotten used to the way it works
now.

It is also not necessarily desirable: FUN(numeric(0)) might be an error.
For example:

Z <- data.frame(x=rnorm(10), f=rep(c("a", "b"), each=5))[1:5, ]
tapply(Z$x, Z$f, sd)

but sd(numeric(0)) is an error.  (Similar things involving var are 'in 
the wild' and so would be broken.)

This is not to suggest there could not be another argument added at the
end to indicate that you want the new behaviour, though.  e.g.

tapply <- function (X, INDEX, FUN=NULL, ..., simplify=TRUE,
handle.empty.levels = FALSE)

but this raises the question of what sort of time penalty the
modification might entail.  Probably not much for most situations, I
suppose.  (I know this argument name looks long, but you do need a
fairly specific argument name, or it will start to impinge on the ...
argument.)

Just some thoughts.

Bill Venables.

Bill Venables
CSIRO Laboratories
PO Box 120, Cleveland, 4163
AUSTRALIA
Office Phone (email preferred): +61 7 3826 7251
Fax (if absolutely necessary):  +61 7 3826 7304
Mobile:                         +61 4 8819 4402
Home Phone:                     +61 7 3286 7700
mailto:Bill.Venables at csiro.au
http://www.cmis.csiro.au/bill.venables/

-----Original Message-----
From: r-devel-bounces at r-project.org
[mailto:r-devel-bounces at r-project.org] On Behalf Of Andrew Robinson
Sent: Tuesday, 6 November 2007 3:10 PM
To: R-Devel
Subject: [Rd] A suggestion for an amendment to tapply

Dear R-developers,

when tapply() is invoked on factors that have empty levels, it returns
NA.  This behaviour is in accord with the tapply documentation, and is
reasonable in many cases.  However, when FUN is sum, it would also
seem reasonable to return 0 instead of NA, because "the sum of an
empty set is zero, by definition."

I'd like to raise a discussion of the possibility of an amendment to
tapply.

The attached patch changes the function so that it checks if there are
any empty levels, and if there are, replaces the corresponding NA
values with the result of applying FUN to the empty set.  Eg in the
case of sum, it replaces the NA with 0, whereas with mean, it replaces
the NA with NA, and issues a warning.

This change has the following advantage: tapply and sum work better
together.  Arguably, tapply and any other function that has a non-NA
response to the empty set will also work better together.
Furthermore, tapply shows a warning if FUN would normally show a
warning upon being evaluated on an empty set.  That deviates from
current behaviour, which might be bad, but also provides information
that might be useful to the user, so that would be good.

The attached script provides the new function in full, and
demonstrates its application in some simple test cases.

Best wishes,

Andrew

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595


-- 
  O__  ---- Peter Dalgaard             ?ster Farimagsgade 5, Entr.B
 c/ /'_ --- Dept. of Biostatistics     PO Box 2099, 1014 Cph. K
(*) \(*) -- University of Copenhagen   Denmark          Ph:  (+45) 35327918
~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk)                  FAX: (+45) 35327907

Andrew Robinson  
Department of Mathematics and Statistics            Tel: +61-3-8344-9763
University of Melbourne, VIC 3010 Australia         Fax: +61-3-8344-4599
http://www.ms.unimelb.edu.au/~andrewpr
http://blogs.mbs.edu/fishing-in-the-bay/