Thanks to Martin Maechler for his comments, advice and for pointing
out the speed problem. Thanks also to Ben Bolker for tests of speed,
which confirm that for small arrays, a slow down by a factor of about
1.2 - 1.5 may occur. Now, I would like to present a new version of sweep,
which is simpler and has an option to avoid the test. This is expected
to be used in scripts, where the programmer is quite sure that the
usage is correct and speed is required. The new version differs from
the previous one in the following:
1. The option check.margin has a different meaning. It defaults to TRUE
and it determines whether the test is performed or not.
2. Since check.margin has the meaning above, it cannot be used
to select, which test should be performed. This depends on the
type of STATS. The suggested sweep function contains two tests:
- a vector test by Heather Turner, which is used, if STATS
has no dim attribute and, hence, is a vector (STATS should
not be anything else than a vector or an array)
- an array test used if STATS has dim attribute.
The vector test allows some kinds of recycling, while the array test
does not. Hence, in the most common case, where x is a matrix
and STATS is a vector, if the user likes to be warned if the length
of the vector is not exactly the right one, the following call is
suggested: sweep(x,MARGIN,as.array(STATS)). Otherwise, a warning
will be generated only if length(STATS) does not divide the specified
dimension of x, which is nrow(x) (MARGIN=1) or ncol(x) (MARGIN=2).
3. If STATS is an array, then the test is more restrictive than in
the previous version. It is now required that after deleting
dimensions with one level, the remaining dimensions coincide.
The previous version allowed additionally the cases, when dim(STATS)
is a prefix of dim(x)[MARGIN], for example, if dim(STATS) = k1 and
dim(x)[MARGIN] = c(k1,k2).
The code of the tests in the suggested sweep is based on the previous suggestions
https://stat.ethz.ch/pipermail/r-help/2005-June/073989.html by Robin Hankin
https://stat.ethz.ch/pipermail/r-help/2005-June/074001.html by Heather Turner
https://stat.ethz.ch/pipermail/r-devel/2007-June/046217.html by Ben Bolker
with some further modifications.
The modification of sweep.Rd was prepared by Ben Bolker and me.
I would like to encourage everybody who likes to express his opinion
on the patch to do it now. In my opinion, the suggestion of the
new code stabilized in the sense that I will not modify it unless
there is a negative feedback.
A patch against R-devel_2007-08-06 is attached. It contains tabs. If they
are corrupted by email transfer, use the link
http://www.cs.cas.cz/~savicky/R-devel/patch-sweep
which is an identical copy.
Petr Savicky.
================================================================================================
--- R-devel_2007-08-06/src/library/base/R/sweep.R 2007-07-27 17:51:13.000000000 +0200
+++ R-devel_2007-08-06-sweep/src/library/base/R/sweep.R 2007-08-07 10:30:12.383672960 +0200
@@ -14,10 +14,29 @@
# A copy of the GNU General Public License is available at
# http://www.r-project.org/Licenses/
-sweep <- function(x, MARGIN, STATS, FUN = "-", ...)
+sweep <- function(x, MARGIN, STATS, FUN = "-", check.margin=TRUE, ...)
{
FUN <- match.fun(FUN)
dims <- dim(x)
+ if (check.margin) {
+ dimmargin <- dims[MARGIN]
+ dimstats <- dim(STATS)
+ lstats <- length(STATS)
+ if (lstats > prod(dimmargin)) {
+ warning("length of STATS greater than the extent of dim(x)[MARGIN]")
+ } else if (is.null(dimstats)) { # STATS is a vector
+ cumDim <- c(1, cumprod(dimmargin))
+ upper <- min(cumDim[cumDim >= lstats])
+ lower <- max(cumDim[cumDim <= lstats])
+ if (upper %% lstats != 0 || lstats %% lower != 0)
+ warning("STATS does not recycle exactly across MARGIN")
+ } else {
+ dimmargin <- dimmargin[dimmargin > 1]
+ dimstats <- dimstats[dimstats > 1]
+ if (length(dimstats) != length(dimmargin) || any(dimstats != dimmargin))
+ warning("length(STATS) or dim(STATS) do not match dim(x)[MARGIN]")
+ }
+ }
perm <- c(MARGIN, (1:length(dims))[ - MARGIN])
FUN(x, aperm(array(STATS, dims[perm]), order(perm)), ...)
}
--- R-devel_2007-08-06/src/library/base/man/sweep.Rd 2007-07-27 17:51:35.000000000 +0200
+++ R-devel_2007-08-06-sweep/src/library/base/man/sweep.Rd 2007-08-07 10:29:45.517757200 +0200
@@ -11,7 +11,7 @@
statistic.
}
\usage{
-sweep(x, MARGIN, STATS, FUN="-", \dots)
+sweep(x, MARGIN, STATS, FUN="-", check.margin=TRUE, \dots)
}
\arguments{
\item{x}{an array.}
@@ -22,8 +22,18 @@
case of binary operators such as \code{"/"} etc., the function name
must backquoted or quoted. (\code{FUN} is found by a call to
\code{\link{match.fun}}.)}
+ \item{check.margin}{logical. If \code{TRUE} (the default), warn if the
+ length or dimensions of \code{STATS} do
+ not match the specified dimensions of \code{x}.}
\item{\dots}{optional arguments to \code{FUN}.}
}
+\details{
+ The consistency check among \code{STATS}, \code{MARGIN} and \code{x}
+ is stricter if \code{STATS} is an array than if it is a vector.
+ In the vector case, some kinds of recycling are allowed without a
+ warning. Use \code{sweep(x,MARGIN,as.array(STATS))} if \code{STATS}
+ is a vector and you want to be warned if any recycling occurs.
+}
\value{
An array with the same shape as \code{x}, but with the summary
statistics swept out.
sweep sanity checking?
2 messages · Petr Savicky, Heather Turner
14 days later
Petr Savicky kindly brought this thread to my attention as I'm afraid it had passed me by. As one of the contributors to the earlier discussion on adding warnings to sweep I would like to give my support to Petr's proposed patch. For the record I should say that Petr was right to point out that the use of MARGIN in my examples did not make sense https://stat.ethz.ch/pipermail/r-devel/2007-July/046487.html so I have no quibble with that. I think it is sensible too, to use the dim attribute of STATS as the basis of the test, when the dim attribute is present. This provides a way to control the strength of the test in the case of sweeping out a vector, as Petr describes in his message below. I think that the proposed patch successfully brings together the different views on what should be tested, which was the stumbling block last time around https://stat.ethz.ch/pipermail/r-help/2005-June/074037.html Even if people set check.margin = FALSE for reasons of speed, this in itself should be a useful check, since they will need to be confident that the test is unnecessary. Heather -----Original Message----- From: r-devel-bounces at r-project.org [mailto:r-devel-bounces at r-project.org] On Behalf Of Petr Savicky Sent: 08 August 2007 07:54 To: r-devel at r-project.org Subject: Re: [Rd] sweep sanity checking? Thanks to Martin Maechler for his comments, advice and for pointing out the speed problem. Thanks also to Ben Bolker for tests of speed, which confirm that for small arrays, a slow down by a factor of about 1.2 - 1.5 may occur. Now, I would like to present a new version of sweep, which is simpler and has an option to avoid the test. This is expected to be used in scripts, where the programmer is quite sure that the usage is correct and speed is required. The new version differs from the previous one in the following: 1. The option check.margin has a different meaning. It defaults to TRUE and it determines whether the test is performed or not. 2. Since check.margin has the meaning above, it cannot be used to select, which test should be performed. This depends on the type of STATS. The suggested sweep function contains two tests: - a vector test by Heather Turner, which is used, if STATS has no dim attribute and, hence, is a vector (STATS should not be anything else than a vector or an array) - an array test used if STATS has dim attribute. The vector test allows some kinds of recycling, while the array test does not. Hence, in the most common case, where x is a matrix and STATS is a vector, if the user likes to be warned if the length of the vector is not exactly the right one, the following call is suggested: sweep(x,MARGIN,as.array(STATS)). Otherwise, a warning will be generated only if length(STATS) does not divide the specified dimension of x, which is nrow(x) (MARGIN=1) or ncol(x) (MARGIN=2). 3. If STATS is an array, then the test is more restrictive than in the previous version. It is now required that after deleting dimensions with one level, the remaining dimensions coincide. The previous version allowed additionally the cases, when dim(STATS) is a prefix of dim(x)[MARGIN], for example, if dim(STATS) = k1 and dim(x)[MARGIN] = c(k1,k2). The code of the tests in the suggested sweep is based on the previous suggestions https://stat.ethz.ch/pipermail/r-help/2005-June/073989.html by Robin Hankin https://stat.ethz.ch/pipermail/r-help/2005-June/074001.html by Heather Turner https://stat.ethz.ch/pipermail/r-devel/2007-June/046217.html by Ben Bolker with some further modifications. The modification of sweep.Rd was prepared by Ben Bolker and me. I would like to encourage everybody who likes to express his opinion on the patch to do it now. In my opinion, the suggestion of the new code stabilized in the sense that I will not modify it unless there is a negative feedback. A patch against R-devel_2007-08-06 is attached. It contains tabs. If they are corrupted by email transfer, use the link http://www.cs.cas.cz/~savicky/R-devel/patch-sweep which is an identical copy. Petr Savicky. ======================================================================== ======================== --- R-devel_2007-08-06/src/library/base/R/sweep.R 2007-07-27 17:51:13.000000000 +0200 +++ R-devel_2007-08-06-sweep/src/library/base/R/sweep.R 2007-08-07 10:30:12.383672960 +0200 @@ -14,10 +14,29 @@ # A copy of the GNU General Public License is available at # http://www.r-project.org/Licenses/ -sweep <- function(x, MARGIN, STATS, FUN = "-", ...) +sweep <- function(x, MARGIN, STATS, FUN = "-", check.margin=TRUE, ...) { FUN <- match.fun(FUN) dims <- dim(x) + if (check.margin) { + dimmargin <- dims[MARGIN] + dimstats <- dim(STATS) + lstats <- length(STATS) + if (lstats > prod(dimmargin)) { + warning("length of STATS greater than the extent of dim(x)[MARGIN]") + } else if (is.null(dimstats)) { # STATS is a vector + cumDim <- c(1, cumprod(dimmargin)) + upper <- min(cumDim[cumDim >= lstats]) + lower <- max(cumDim[cumDim <= lstats]) + if (upper %% lstats != 0 || lstats %% lower != 0) + warning("STATS does not recycle exactly across MARGIN") + } else { + dimmargin <- dimmargin[dimmargin > 1] + dimstats <- dimstats[dimstats > 1] + if (length(dimstats) != length(dimmargin) || any(dimstats != dimmargin)) + warning("length(STATS) or dim(STATS) do not match dim(x)[MARGIN]") + } + } perm <- c(MARGIN, (1:length(dims))[ - MARGIN]) FUN(x, aperm(array(STATS, dims[perm]), order(perm)), ...) } --- R-devel_2007-08-06/src/library/base/man/sweep.Rd 2007-07-27 17:51:35.000000000 +0200 +++ R-devel_2007-08-06-sweep/src/library/base/man/sweep.Rd 2007-08-07 10:29:45.517757200 +0200 @@ -11,7 +11,7 @@ statistic. } \usage{ -sweep(x, MARGIN, STATS, FUN="-", \dots) +sweep(x, MARGIN, STATS, FUN="-", check.margin=TRUE, \dots) } \arguments{ \item{x}{an array.} @@ -22,8 +22,18 @@ case of binary operators such as \code{"/"} etc., the function name must backquoted or quoted. (\code{FUN} is found by a call to \code{\link{match.fun}}.)} + \item{check.margin}{logical. If \code{TRUE} (the default), warn if the + length or dimensions of \code{STATS} do + not match the specified dimensions of \code{x}.} \item{\dots}{optional arguments to \code{FUN}.} } +\details{ + The consistency check among \code{STATS}, \code{MARGIN} and \code{x} + is stricter if \code{STATS} is an array than if it is a vector. + In the vector case, some kinds of recycling are allowed without a + warning. Use \code{sweep(x,MARGIN,as.array(STATS))} if \code{STATS} + is a vector and you want to be warned if any recycling occurs. +} \value{ An array with the same shape as \code{x}, but with the summary statistics swept out. ______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel