Skip to content

[Bioc-devel] avoiding clashes of different S4 methods with the same generic

14 messages · Aaron Lun, Michael Lawrence, Martin Morgan +2 more

#
Dear List,

When a S4 method for the same class is defined in two separate packages 
(i.e., under the same generic), and both packages are loaded into a R 
session, it seems that the method from the package loaded later clobbers 
the method from the package loaded first. Is it possible to specifically 
call the method in the first package when both packages are loaded? If 
not, how should we protect against this?

To give some context; the csaw package currently defines a normalize() 
method for SummarizedExperiment objects, using the generic from 
BiocGenerics. However, if some other hypothetical package (I'll call it 
"swings", for argument's sake) were to define a normalize() method with 
a SE signature, and if the swings package were to be loaded after csaw, 
then it seems that all calls to normalize() would use the method defined 
by swings, rather than that defined by csaw.

Now, for usual functions, disambiguation would be easy with "::", but I 
don't know whether this can be done in the S4 system, given that the 
details of dispatch are generally hidden away. The only solution I can 
see is for csaw (and/or swings) to define a SE subclass; define the 
normalize() method using the subclass as the signature, such that S4 
dispatch will now go to the correct method; and hope that no other 
package redefines normalize() for the subclass.

Is this what I should be doing routinely, i.e., define subclasses and 
methods for those subclasses in all my packages? Or am I missing 
something obvious? I would have expected such clashes to be more of a 
problem, given how many new packages are being added to BioC at every 
release.

Cheers,

Aaron
#
On Tue, Apr 26, 2016 at 11:00 AM, Aaron Lun <alun at wehi.edu.au> wrote:
I would recommend against defining subclasses of basic data structures
that differ only in their behavior. The purpose of
SummarizedExperiment is to store data. One might use inheritance to
modify how the data are stored, or to store new types of data,
although the latter may be best addressed through composition.

To extend behavior, define methods. The generic represents the verb
and thus the semantics of the operation. In general, method conflicts
indicate that the design is broken. In this case, the normalize()
generic has a very general name. There is no one way to "normalize" a
SummarizedExperiment. It would be difficult for the reader to
understand such ambiguous code. To indicate a specific normalization
algorithm, we either need a more specific generic or we need to
parameterize it further.

One way to make more specific generics would be to give them the same
name, "normalize", but define them in different namespaces and require
:: qualification. That would mean abandoning the BiocGenerics generic
and it would only work if each package provides only one way to
normalize. Or, one could give them different names, but it would be
difficult to select a natural name, and it's not clear whether the
abstract notion of normalization should be always coupled with the
method.

A more flexible/modular approach would be to augment the signature of
BiocGenerics::normalize to indicate a normalization method and rely on
dual-dispatch:

normalize(se, WithSwings())
normalize(se, WithCSaw())

Roughly, one example of this approach is
VariantAnnotation::locateVariants() and its variant type argument.

The affy package (or something around it) auto-qualifies the generic
via a method argument; something like S3 around S4. For example
normalize(se, "swings") would call normalize.swings(se), where
normalize.swings itself could be generic. Another way to effect
cascading dispatch is through composition, where the method object
either is a function or can provide one to implement the normalization
(emulating message passing OOP), which would allow normalize() to
implemented simply as:

normalize <- function(x, method, ...) normalizer(method)(x, ...)

One issue is that the syntax is a bit unconventional and users might
end up preferring the affy approach, with a normalize_csaw() and
normalize_swings(). But I like the modular, dynamic approach outlined
above.

Thoughts?

Michael
#
On 04/26/2016 04:47 PM, Michael Lawrence wrote:
I like the dual dispatch method quite a bit (but wonder why we get 
several swings but only one csaw? Maybe a csaw implies two participants 
[though I think I once in a while csaw-ed alone], so a singular csaw and 
a pair of swings balance out?), partly because it's very easy to extend 
(write another method) and the second argument can be either lightweight 
or parameterized.

 From a user perspective normalizeCsaw / normalizeSwings makes the 
available options only a tab key away; maybe that's why Michael 
suggested With*?

Martin
This email message may contain legally privileged and/or confidential information.  If you are not the intended recipient(s), or the employee or agent responsible for the delivery of this message to the intended recipient(s), you are hereby notified that any disclosure, copying, distribution, or use of this email message is prohibited.  If you have received this message in error, please notify the sender immediately by e-mail and delete this email message from your computer. Thank you.
#
On Tue, Apr 26, 2016 at 2:16 PM, Martin Morgan
<martin.morgan at roswellpark.org> wrote:
I could go along with the dual dispatch. "Swings" is short for "Set of
swings". Usually, there are several swings in a row, but only one
see-saw.
That's a good point.
#
On 04/26/2016 05:28 PM, Michael Lawrence wrote:
> On Tue, Apr 26, 2016 at 2:16 PM, Martin Morgan
> <martin.morgan at roswellpark.org> wrote:
>> >
>> >On 04/26/2016 04:47 PM, Michael Lawrence wrote:
>>> >>
>>> >>On Tue, Apr 26, 2016 at 11:00 AM, Aaron Lun<alun at wehi.edu.au> wrote:
...
 >>>> >>>BiocGenerics. However, if some other hypothetical package (I'll 
call it
 >>>> >>>"swings", for argument's sake) were to define a normalize() 
method with a
...
 >> >I like the dual dispatch method quite a bit (but wonder why we get 
several
Googling for "how many swings per see-saw" took me to

   https://www.cpsc.gov//PageFiles/108601/playgrnd.pdf

where it is apparent that swings are much more dangerous than see-saws 
(e.g., 51 matches for "swing" versus 4 for "see-saw"; "Swings ... were 
involved in about 19 ... percent of injuries ... See-saws accounted for 
about three percent"; "Homemade rope, tire, or tree swings were also 
involved in a number of hanging deaths" [no mention of death by see-saw]).

I think for the sake of our users, especially our younger users, we do 
not want to consider swings, or even methods on swings, further.

Martin


This email message may contain legally privileged and/or confidential information.  If you are not the intended recipient(s), or the employee or agent responsible for the delivery of this message to the intended recipient(s), you are hereby notified that any disclosure, copying, distribution, or use of this email message is prohibited.  If you have received this message in error, please notify the sender immediately by e-mail and delete this email message from your computer. Thank you.
#
Yes, but "monkeyBars" doesn't have quite the same pithiness for a
package name.

Anyway, the dual dispatch mechanism sounds most interesting. I assume
that means we'd have to define some sort of base "normalizeParam" class,
and then derive "csawNormParam" and "swingsNormParam" subclasses, so
that specific methods can be defined for each signature.

- Aaron
Martin Morgan wrote:
#
Hi,

I would not discard defining a SummarizedExperiment subclass so quickly.
SummarizedExperiment is very generic and can contain any kind of data.
IIUC the csaw package uses SummarizedExperiment to store a particular
kind of data (ChIP-seq data) and I believe specialization is a
legitimate situation for defining a subclass, even if the subclass is
a "straight" subclass i.e. a subclass that doesn't add new slots or
doesn't touch the existing slots.

OTOH introducing a "straight" subclass only to define one specialized
method on it (the "normalize" method in this case) might not be worth
it since there is a cost for such class, even if that cost is minimal:
a cost for the user (one new container/constructor to deal with) and a
cost for the developer (e.g. multiplication of coerce methods).

Changing the signature of the normalize() generic in BiocGenerics and
introducing dual dispath is of course doable but that means the
maintainers of the packages that define methods on this generic are
ok with the dual dispatch game and are willing to make the required
modifications to their packages. It's an important change and I don't
see an easy way to make it happen smoothly (i.e. thru a
deprecated/defunct cycle).

Here is the list of packages that currently define methods for
BiocGenerics::normalize():

   affyPLM
   Cardinal
   codelink
   CopyNumber450k
   csaw
   diffHic
   EBImage
   epigenomix
   MSnbase
   oligo
   qpcrNorm
   scran

[Interestingly the scran package defines a default "normalize" method
(i.e. a normalize,ANY method)].

Whether we make the second argument lightweight or parameterized (which
is something that would need to be decided at the level of the generic)
these packages will break as soon as we change the signature of the
generic. So we'll need to wait after the release before this happens.

Personally I find the lightweight second argument not particularly
intuitive, elegant, or user-friendly. I'd rather type
normalizeSwing(se, ...) or normalize(se, SwingParam(...)) than
normalize(se, WithSwing(), ...).

Last thing: In case of a parameterized second argument, do we really
need a virtual normalizeParam class as parent of all the concrete
normalizeParam* classes? If so then I guess we would need to have it
defined in BiocGenerics but I think we should try hard to not start
defining classes in this package (that could take us too far...)

H.
On 04/26/2016 03:03 PM, Aaron Lun wrote:

  
    
#
On the plus side, a straight subclass would make it easier to solve the 
same problem for other generics. For example, let's say that csaw and 
swings both defined the estimateDispersions() method for the SE class. 
If we decided to use dual dispatch, then the estimateDispersions() 
generic would also have to be changed to have an additional argument. 
This would not be necessary if the two packages set up their own 
straight subclasses and defined S4 methods for each subclass.

It is slightly annoying, though, to have to define something like a 
csawDataSet class and its constructor, even if the constructor is just a 
trivial wrapper around the SE constructor. Ah well.
Yes - uh - I got too excited with S4 methods, and things got out of 
hand. I've been pruning out some of the sillier definitions.

- Aaron
#
On 04/27/2016 02:12 AM, Herv? Pag?s wrote:
By 'parameterized' I meant that the second argument could have slots or 
not; I'm not sure whether the use was clear. I don't think the generic 
needs to know about parameterization (also, I agree below that there is 
no need for a base class).
I agree, I don't think a base class or decisions about parameterization 
would be necessary -- create a MerryGoRoundParam() (parameterized with 
horses or not; much safer than swings!) and the 
normalize,SummarizedExperiment,MerryGoRoundParam-method and you're done.

Also wouldn't dispatch with second arg 'missing' or ANY recover current 
behavior? Maybe it would need to be named and at the end to be 
implemented without others buying in.

setGeneric("foo", function(x, y, ...) standardGeneric("foo"))
## setGeneric("foo", function(x, ..., y) standardGeneric("foo"))

.A <- setClass("A", representation(x="numeric"))

setMethod("foo", "A", function(x, y, ...) "ImanA")

 > foo(.A())
[1] "ImanA"

For the normalizeSlides etc approach... I agree that it is probably 
easier for users, though maybe many users are following a vignette so 
the benefit is relatively minimal; there is definitely value in using a 
common prefix or suffix (normalize* or *Param()).

Here's what limma does

 > normalize<tab>
normalizeBetweenArrays     normalizeQuantiles
normalizeCyclicLoess       normalizeRobustSpline
normalizeForPrintorder     normalizeVSN
normalizeForPrintorder.rg  normalizeVSN.default
normalizeMedianAbsValues   normalizeVSN.EListRaw
normalizeMedianValues      normalizeVSN.RGList
normalizePath              normalizeWithinArrays

and note the complexity in normalizeVSN, where we're back at desiring a 
generic! If one wanted to implement 
normalizeQuantiles,SummarizedExperiment-method then it would be back to 
negotiating with individual package authors to introduce a new BiocGenerics.

Martin
This email message may contain legally privileged and/or confidential information.  If you are not the intended recipient(s), or the employee or agent responsible for the delivery of this message to the intended recipient(s), you are hereby notified that any disclosure, copying, distribution, or use of this email message is prohibited.  If you have received this message in error, please notify the sender immediately by e-mail and delete this email message from your computer. Thank you.
#
On Tue, Apr 26, 2016 at 11:12 PM, Herv? Pag?s <hpages at fredhutch.org> wrote:
If the data are more specialized, specialize the data structure, but
the fact that the specialization solves the normalize() ambiguity is a
mere coincidence. There are two different concerns.
In conjunction with what Martin said, you could define a
"ANY","missing" method that emits a deprecation warning, and then
recall the generic using NULL or something for the second argument so
that it falls through. Packages would only need to fix the formals of
their method definition.
Sure, WithSwing() could hold arguments as well, but I agree that the
Param suffix is more consistent. The Param naming is not great for
autocompletion. Though I guess the interface could provide hints based
on the defined methods.
I would say no, no real need for a base class.
#
I agree with Herve that making a subclass (is that the technical term?)
might be a good idea.  This is what I would do.  And this is what has been
done historically for all the extensions of ExpressionSet and eSet from
Biobase.  The issue with a general class like SummarizedExperiment is that
methods defined on that class really ought to make sense for any instance
of the class, and since the class can contain basically any form of data, I
don't see how one can define any kind of informed data processing. For
example, SummarizedExperiment can contain any number of different assays
and what happens with all of those when you call normalize().  By making a
subclass you're can say that this is an object which contains a very
specific form of data (which can cue checked by extending the validity
check), and you can now start to think about data processing, while still
benefitting from the general infrastructure.

But I would also say that normalize() is a terrible choice for a method
since it is basically maximally uninformative about what happens.

Best,
Kasper

On Wed, Apr 27, 2016 at 7:24 AM, Michael Lawrence <lawrence.michael at gene.com

  
  
#
On 04/27/2016 03:18 AM, Martin Morgan wrote:
The name of the second argument would depend on whether it's lightweight
or parameterized. Could be something like 'with' or 'method' for the
former and 'param' for the latter. Also the man page for the generic
would need to be able to briefly describe that argument and provide some
guidelines to developers. The more consistent the normalize() interface
is across the dozens of methods, the better for the user.

H.

  
    
#
On 04/27/2016 04:24 AM, Michael Lawrence wrote:
Isn't it what I'm doing when I define a "straight" subclass? The fact 
that I don't need to alter the internal representation is an
implementation detail (and it could change at some point) but what's
important is that from a user point of view my container is now
tagged/specialized. I might only have one specialized method for it
at the moment but I might have more in the future, and/or other package
developers might build on top of my specialized container and add
specialized methods in the future (and I cross my fingers that since
I own the specialized container and already implemented a "normalize"
method for it, nobody will redefine that method in their package).

H.

  
    
#
On Wed, Apr 27, 2016 at 10:52 AM, Herv? Pag?s <hpages at fredhutch.org> wrote:
In general, yes, there are behaviors that clearly pertain to specific
types of data. But the normalize() generic is just too generic. There
are multiple ways to normalize ChIP-seq data, and we can't count on
one method to implement all of them. We want to encourage innovation
in statistical methods, so we should make it easy to extend
Bioconductor in that way.