Skip to content

RFC: sapply() limitation from vector to matrix, but not further

10 messages · Marc Schwartz, Hadley Wickham, William Dunlap +3 more

#
sapply() stems from S / S+ times and hence has a long tradition.
In spite of that I think that it should be enhanced...

As the subject mentions, sapply() produces a matrix in cases
where the list components of the lapply(.) results are of the
same length (and ...).
However, it unfortunately "stops there".
E.g., if you *nest* two sapply() calls where the inner one
produces a matrix, very often the logical behavior would be for
the outer sapply() to stack these matrices into an array of 
rank 3 ["array rank"(x) := length(dim(x))].
However it does not do that, e.g., an artifical example

p0 <- function(...) paste(..., sep="")
myF <- function(x,y) {
    stopifnot(length(x) <= 3)
    x <- rep(x, length.out=3)
    ny <- length(y)
    r <- outer(x,y)
    dimnames(r) <- list(p0("r",1:3), p0("C", seq_len(ny)))
    r
}

and
A  B  C  D 
50 60 70 80 

if we let sapply() not simplify, we see the list of same size
matrices it produes:
$A
    C1  C2  C3  C4  C5
r1 100 200 300 400 500
r2 100 200 300 400 500
r3 100 200 300 400 500

$B
    C1  C2  C3  C4  C5
r1 120 240 360 480 600
r2 120 240 360 480 600
r3 120 240 360 480 600

$C
    C1  C2  C3  C4  C5
r1 140 280 420 560 700
r2 140 280 420 560 700
r3 140 280 420 560 700

$D
    C1  C2  C3  C4  C5
r1 160 320 480 640 800
r2 160 320 480 640 800
r3 160 320 480 640 800

However, quite deceptively
A   B   C   D
 [1,] 100 120 140 160
 [2,] 100 120 140 160
 [3,] 100 120 140 160
 [4,] 200 240 280 320
 [5,] 200 240 280 320
 [6,] 200 240 280 320
 [7,] 300 360 420 480
 [8,] 300 360 420 480
 [9,] 300 360 420 480
[10,] 400 480 560 640
[11,] 400 480 560 640
[12,] 400 480 560 640
[13,] 500 600 700 800
[14,] 500 600 700 800
[15,] 500 600 700 800


My proposal -- implemented and "make check" tested --
is to add an optional argument  'ARRAY'
which allows
, , A

    C1  C2  C3  C4  C5
r1 100 200 300 400 500
r2 100 200 300 400 500
r3 100 200 300 400 500

, , B

    C1  C2  C3  C4  C5
r1 120 240 360 480 600
r2 120 240 360 480 600
r3 120 240 360 480 600

, , C

    C1  C2  C3  C4  C5
r1 140 280 420 560 700
r2 140 280 420 560 700
r3 140 280 420 560 700

, , D

    C1  C2  C3  C4  C5
r1 160 320 480 640 800
r2 160 320 480 640 800
r3 160 320 480 640 800
-----------

In the best of all worlds, the default would be 'ARRAY = TRUE',
but of course, given the long-standing different behavior,
it seem much too "risky", and my proposal includes remaining
back-compatible with default ARRAY = FALSE.

Martin Maechler,
ETH Zurich
#
On Dec 1, 2010, at 2:39 AM, Martin Maechler wrote:

            
Seems to me to be a reasonable proposal Martin, obviously with the proviso that the current default behavior is unaltered, as you note.

Regards,

Marc
#
I think an even better approach would be to extract the
"simplification" component out of sapply, so that could write

sapply <- function(...) simplify(lapply(...))

(although obviously some arguments would go to lapply and some to simplify).

The advantage of this would be that you could use the same
simplification algorithm in other places.

Hadley

On Wed, Dec 1, 2010 at 8:39 AM, Martin Maechler
<maechler at stat.math.ethz.ch> wrote:

  
    
#
A downside of that approach is that lapply(X,...) can
cause a lot of unneeded memory to be allocated (length(X)
SEXP's).  Those SEXP's would be tossed out by simplify() but
the peak memory usage would remain high.  sapply() can
be written to avoid the intermediate list structure.

vapply() can avoid the intermediate list structure because
it knows what the output of FUN will look like and can
put the results directly into the desired output structure.
Perhaps its processing of the FUN.VALUE argument could be
beefed up so that matrices would be stacked as you want.

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com
#
But the upside is reusable code that can be used in multiple places -
what about the simplification code used by mapply and tapply? Why are
there three different implementations of simplification?

Hadley
26 days later
#
Finally finding time to come back to this.
Remember that I've started the thread by proposing a version of sapply()
which does not just "stop" with making a matrix() from the lapply() result, but
instead --- only when the new argument ARRAY = TRUE is set ---
may return an array() of any (appropriate) order, in those cases where
the lapply() result elements all return an array of the same dim().
On Wed, Dec 1, 2010 at 19:51, Hadley Wickham <hadley at rice.edu> wrote:
I have now looked into using a version of what Hadley had proposed.
Note (to Bill's point) that the current implementation of sapply()
does go via lapply() and
that we have  vapply()  as a faster version of sapply()  with less
copying (hopefully).

Very unfortunately, vapply() .. which was only created 13 months ago,
has inherited the ``illogical''  behavior of  sapply()
in that it does not make up higher rank arrays if the single element
is already a matrix (say).
...
Consequently, we also need a patch to vapply(),
and I do wonder if we should not make "ARRAY=TRUE" the default there,
since with vapply() you specify a result value, and if you specify a
matrix, the total result should stack these matrices into an array of
rank 3, etc.
Looking at it, the patch is not so much work... notably if we don't
use a new argument but really let  FUN.VALUE determine what the result
should look like.

More comments are stil welcome...
Martin
#
On Wed, Dec 1, 2010 at 3:39 AM, Martin Maechler
<maechler at stat.math.ethz.ch> wrote:
It would reduce the proliferation of arguments if the simplify=
argument were extended to allow this, e.g. simplify = "array" or
perhaps simplify = n would allow a maximum of n dimensions.
#
> On Wed, Dec 1, 2010 at 3:39 AM, Martin Maechler
> <maechler at stat.math.ethz.ch> wrote:
>> My proposal -- implemented and "make check" tested -- is
    >> to add an optional argument ?'ARRAY' which allows
    >> 
    >>> sapply(v, myF, y = 2*(1:5), ARRAY=TRUE)

    > It would reduce the proliferation of arguments if the
    > simplify= argument were extended to allow this,
    > e.g. simplify = "array" or perhaps simplify = n would
    > allow a maximum of n dimensions.

That's a good idea, though it makes the
implementation/documentation very slightly more complicated.

I'm interested to get more feedback on my other questions,
notably the only about *changing*  vapply() (on the C-level) to
behave "logical" in the sense of adding one  dim(.)ension in
those cases, the FUN.VALUE (result prototype) has a dim().


Martin
#
The abind() function from the abind package is an alternative here -- it can take a list argument, which makes it easy to use with the result of lapply().  It's also able take direction about which dimension to join on.

 > x <- list(a=1,b=2,c=3)
 > f <- function(v) matrix(v, nrow=2, ncol=4)
 > sapply(x, f)
      a b c
[1,] 1 2 3
[2,] 1 2 3
[3,] 1 2 3
[4,] 1 2 3
[5,] 1 2 3
[6,] 1 2 3
[7,] 1 2 3
[8,] 1 2 3
 >
 > # The 'along=' argument to abind() determines on which dimension
 > # the list elements are joined.  Use a fractional value to put the new
 > # dimension between existing ones.
 >
 > dim(abind(lapply(x, f), along=0))
[1] 3 2 4
 > dim(abind(lapply(x, f), along=1.5))
[1] 2 3 4
 > dim(abind(lapply(x, f), along=3))
[1] 2 4 3
 > abind(lapply(x, f), along=3)
, , a

      [,1] [,2] [,3] [,4]
[1,]    1    1    1    1
[2,]    1    1    1    1

, , b

      [,1] [,2] [,3] [,4]
[1,]    2    2    2    2
[2,]    2    2    2    2

, , c

      [,1] [,2] [,3] [,4]
[1,]    3    3    3    3
[2,]    3    3    3    3

 >
On 12/28/2010 8:49 AM, Martin Maechler wrote:
#
On Tue, Dec 28, 2010 at 19:14, Tony Plate <tplate at acm.org> wrote:
Thank you, Tony.
Indeed, yes,  abind() is nice here (and in the good ol' APL spirit !)

Wanting to keep things both simple *and* fast here, of course,
hence I currently contemplate the following code,
where the new  simplify2array()  is  considerably simpler than  abind():

##' "Simplify" a list of commonly structured components into an array.
##'
##' @title simplify list() to an array if the list elements are
structurally equal
##' @param x a list, typically resulting from lapply()
##' @param higher logical indicating if an array() of "higher rank"
##'  should be returned when appropriate, namely when all elements of
##' \code{x} have the same \code{\link{dim}()}ension.
##' @return x itself, or an array if the simplification "is sensible"
simplify2array <- function(x, higher = TRUE)
{
    if(length(common.len <- unique(unlist(lapply(x, length)))) > 1L)
        return(x)
    if(common.len == 1L)
        unlist(x, recursive = FALSE)
    else if(common.len > 1L) {
        n <- length(x)
        ## make sure that array(*) will not call rep() {e.g. for 'call's}:
        r <- as.vector(unlist(x, recursive = FALSE))
        if(higher && length(c.dim <- unique(lapply(x, dim))) == 1 &&
           is.numeric(c.dim <- c.dim[[1L]]) &&
           prod(d <- c(c.dim, n)) == length(r)) {

            iN1 <- is.null(n1 <- dimnames(x[[1L]]))
            n2 <- names(x)
            dnam <-
                if(!(iN1 && is.null(n2)))
                    c(if(iN1) rep.int(list(n1), length(c.dim)) else n1,
                      list(n2)) ## else NULL
            array(r, dim = d, dimnames = dnam)

        } else if(prod(d <- c(common.len, n)) == length(r))
            array(r, dim = d,
                  dimnames= if(!(is.null(n1 <- names(x[[1L]])) &
                  is.null(n2 <- names(x)))) list(n1,n2))
        else x
    }
    else x
}

sapply <- function(X, FUN, ..., simplify = TRUE, USE.NAMES = TRUE)
{
    FUN <- match.fun(FUN)
    answer <- lapply(X, FUN, ...)
    if(USE.NAMES && is.character(X) && is.null(names(answer)))
	names(answer) <- X
    if(!identical(simplify, FALSE) && length(answer))
	simplify2array(answer, higher = (simplify == "array"))
    else answer
}