Skip to content

[Bioc-devel] rbind for ExpressionSet objects?

8 messages · Gordon K Smyth, Martin Maechler, Martin Morgan +2 more

#
Hi Martin,

I have only just noticed that the methods package now has generic 
functions rbind2() and cbind2(), which it didn't when the combine() 
function was first created for Biobase.

I think it would be clearer and more elegant to implement rbind2() and 
cbind2() methods for ExpressionSet, and to retire combine() sometime down 
the track (not obviously for the imminent Bioconductor release).  The term 
"combine" is a somewhat overused, e.g., it conflicts with the c() function 
in base.

What do you think?

Cheers
Gordon
On Fri, 4 Apr 2008, Martin Morgan wrote:

            
4 days later
#
GS> Hi Martin,
    GS> I have only just noticed that the methods package now has generic 
    GS> functions rbind2() and cbind2(), which it didn't when the combine() 
    GS> function was first created for Biobase.

    GS> I think it would be clearer and more elegant to implement rbind2() and 
    GS> cbind2() methods for ExpressionSet, and to retire combine() sometime down 
    GS> the track (not obviously for the imminent Bioconductor release).  The term 
    GS> "combine" is a somewhat overused, e.g., it conflicts with the c() function 
    GS> in base.

    GS> What do you think?

(I'm another 'Martin' but nevertheless .. : )

I'm strongly in favor of providing  rbind2() and cbind2()
methods, base combine() on these for now, and deprecate
combine().

rbind2() and cbind2() had been introduced exactly for the
purpose of providing rbind() / cbind() - like methods for S4
objects.

Martin Maechler, ETH Zurich


    GS> Cheers
    GS> Gordon
GS> On Fri, 4 Apr 2008, Martin Morgan wrote:
>> Thanks for the suggestion and examples.
    >> 
    >> I implemented this in Biobase 1.99.5. It is slightly different from the 
    >> version in the beadarraySNP package, in that the content of overlapping 
    >> regions of the exprs arrays have to be identical (beadarraySNP allows NAs in 
    >> the second matrix).
    >> 
    >> The functionality I implemented is consistent with the following tests 
    >> (hopefully self-explanatory).
    >> 
    >> data(sample.ExpressionSet)
    >> obj <- sample.ExpressionSet
    >> 
    >> checkEquals(obj, combine(obj[1:250,], obj[251:500,]))
    >> checkEquals(obj, combine(obj[,1:13], obj[,14:26]))
    >> ## overlapping
    >> checkEquals(obj, combine(obj[1:300,], obj[250:500,]))
    >> checkEquals(obj, combine(obj[,1:20], obj[,15:26]))
    >> 
    >> 
    >> The implementation introduces a combine method for matricies, which is 
    >> consistent with these tests:
    >> 
    >> ## dimnames
    >> m <- matrix(1:20, nrow=5, dimnames=list(LETTERS[1:5], letters[1:4]))
    >> checkEquals(m, combine(m, m))
    >> checkEquals(m, combine(m[1:3,], m[4:5,]))
    >> checkEquals(m, combine(m[,1:3], m[,4, drop=FALSE]))
    >> ## overlap
    >> checkEquals(m, combine(m[1:3,], m[3:5,]))
    >> checkEquals(m, combine(m[,1:3], m[,3:4]))
    >> checkEquals(matrix(c(1:3, NA, NA, 6:8, NA, NA,
    >> 11:15, NA, NA, 18, NA, NA),
    >> nrow=5,
    >> dimnames=list(LETTERS[1:5], letters[1:4])),
    >> combine(m[1:3,1:3], m[3:5, 3:4]))
    >> ## row reordering
    >> checkEquals(m[c(1,3,5,2,4),], combine(m[c(1,3,5),], m[c(2,4),]))
    >> ## Exceptions
    >> checkException(combine(m, matrix(0, nrow=5, ncol=4)),
    >> silent=TRUE)         # types differ
    >> checkException(combine(m, matrix(0L, nrow=5, ncol=4)),
    >> silent=TRUE)         # attributes differ
    >> m1 <- matrix(1:20, nrow=5)
    >> checkException(combine(m, m1), silent=TRUE) # dimnames required
    >> 
    >> Please let me know if you had something else in mind, or if there are 
    >> problems with this.
    >> 
    >> Martin
    >>
>> Laurent Gautier wrote:
>>> That would be useful.
    >>> 
    >>> I have been in a situation where it would have been useful, and spent some 
    >>> time
    >>> with combine as well before writing my own ad-hoc solution.
    >>> 
    >>> 
    >>> 
    >>> Laurent
    >>> 
    >>> 
    >>> 2008/4/4, Gordon K Smyth <smyth at wehi.edu.au>:
    >>>> An rbind() method or an rbind-like function for ExpressionSet objects
    >>>> would be useful.  Any plans for such a function?
    >>>> 
    >>>> At the moment, an ExpressionSet object can be subsetted by rows or
    >>>> columns.  Column subsets can be put back together using combine(), but
    >>>> there's no way I think to put row subsets back together.
    >>>> 
    >>>> BTW, the help page for the generic function combine() includes the idea 
    >>>> of
    >>>> combining by rows, but this concept is not honoured by the combine method
    >>>> for the eSet class.
    >>>> 
    >>>> Cheers
    >>>> Gordon
    >>>> 
    >>>> _______________________________________________
    >>>> Bioc-devel at stat.math.ethz.ch mailing list
    >>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
    >>>> 
    >>> 
    >>> 
    >> 
    >> 
    >> -- 
    >> Martin Morgan
    >> Computational Biology / Fred Hutchinson Cancer Research Center
    >> 1100 Fairview Ave. N.
    >> PO Box 19024 Seattle, WA 98109
    >> 
    >> Location: Arnold Building M2 B169
    >> Phone: (206) 667-2793
    >> 

    GS> _______________________________________________
    GS> Bioc-devel at stat.math.ethz.ch mailing list
    GS> https://stat.ethz.ch/mailman/listinfo/bioc-devel
#
Martin Maechler <maechler at stat.math.ethz.ch> writes:
(sorry Gordon for not getting back to you on this, I've actually been
mulling it over a bit...)

combine does more (or is supposed to, anyway) than rbind or cbind,
appending non-overlapping rows and columns simultaneously (and
introducing NAs in the implied missing values where features present
in only one eSet have to be aligned with samples present only in the
other, and vice versa).

In mulling this over I realized a bug in the current combine (fixed in
devel), and the more-or-less overly restrictive description of
combine,eSet,eSet-method.

I'm kind of wondering whether Gordon's original question was prompted
by a malfunctioning / misleading combine, or by a desire to have a
more consistent interface to rbind / cbind?
I know 'why', but it's a little too bad that, for this design goal,
rbind2 is not named, well, rbind.

I have implemented rbind2 / cbind2 in my local copy of Biobase, and
will likely commit over the next day or so. The code is basically

setMethod("cbind2",
          signature=signature(x="eSet", y="eSet"),
          function(x, y) {
            ## check that featureNames the same, sampleNames differ,
            ## and then...
            combine(x, y)
          })

Gordon, is this the effect you're looking for?

Martin Morgan

  
    
#
I guess another difference between rbind / cbind / combine and rbind2
/ cbind2 is that the latter only allow for two arguments (perhaps that
is what the '2' is for?) whereas the former will do their work on any
number of arguments. Martin

Martin Maechler <maechler at stat.math.ethz.ch> writes:

  
    
#
The idea is for rbind2 and cbind2 to be able to deal with multiple 
arguments by combining them pairwise, where there is some chance of 
dealing with the ambiguity.  There are some notes in the man page, and 
use of these in a non-trivial way requires buying into a number of other 
things that one may or may not want to do (and some care may be needed 
to turn this on and off in appropriate situations).

ETH-Martin, while I understand your desire to make rbind2 etc some sort 
of standard for S4, as FHCRC-Martin said, that is not the operation that 
is being performed (at least not very often).  We typically have much 
more complex arrangments (and in a sense this would be more like merge 
and friends, but still different enough that I think we need to keep our 
notion of combine.
Martin Morgan wrote:

  
    
#
Thanks to both Martin's for replies.

I hadn't realised before that combine() is actually a merge-like function, 
although admitedly more careful reading of the help page would have warned 
me.  The name did confuse me: combine() is unlike c() in the base package 
but instead very similar to merge().

I really did want genuine rbind() and cbind() functions.  I now see that 
combine() does more than I want, and the possibility of unwanted effects 
gives me less trust in it for my work.

There is some difference in philosophy here I think.  I think of 
microarray data objects as analogous to matrices, whereas combine() is 
viewing them as analogous to data.frames.  It makes sense to "merge" 
data.frames, but not matrices, because row and column names might not be 
unique.  I am quite happy to entertain microarray objects with repeated 
row or column names.  Even if I wasn't, I would find it hard to ensure 
that sample names are unique across different experimental runs, 
expecially considering that the names may be set by data files and 
software which are not under my control.

All the best
Gordon
On Mon, 5 May 2008, Martin Morgan wrote:

            
#
2008/5/6 Gordon K Smyth <smyth at wehi.edu.au>:
There are always several to skin a cat, but the data structure
proposed for microarray
data start being rather handy and save one the trouble of reinventing the wheel
(and I can tell you that I am of the picky kind).
It can probably do a lot of what you need, and take care of the
bookkeeping for you.
For example, the slot featureData can accommodate repeated names in
one of its columns
if you have any need to that.
About not having unique sample names, I can tell you that *are*
implicitly having them:
the position of each column in a matrix is a way to identify your
data. Making whatever
you have unique is only a matter of using a sequence of integers for example.

Hoping this helps,


L.
#
On Tue, 6 May 2008, Laurent Gautier wrote:

            
Dear Laurent,

You've interpretted my post to say almost the opposite of what I intended, 
no doubt my fault for making such an obscure comment late at night.

I can only agree with you that data objects can be useful, that 
featureData columns can contain anything, and that matrices have column 
numbers, and be sobered by the fact that you believe these things to be 
new to me.

If you play with merge() or combine(), you'll find that column number has 
no significance in these functions, and instead column names take 
precedence in determining sample identity.  This can have spectacular 
consequences.  This is not to say that they are bad functions, not at all, 
just that they make a rather strong set assumptions, which are different 
than those made by cbind() and rbind().  For work done at the FHCRC, the 
combine() assumptions are useful and productive (eg, row names might be 
Affy probe IDs and col names might be patient IDs, both of which should be 
unambiguous through an entire study), whereas they're not quite so useful 
for the type of data I see most often.  In making this observation, I am 
not backing away from the whole concept of a data class, just the use of 
combine() over rbind() or cbind().  Hope this is little clearer.

Best wishes
Gordon