[Bioc-devel] Compatibility of Bioconductor with tidyverse S3 classes/methods

Fri, Feb 7, 2020 3:38 PM

Thanks Guys for the discussion (I am learning a lot),

*To Martin:*

Thanks for the tips. I will start to implement those S4 style methods
https://github.com/stemangiola/ttBulk/issues/7

I would *really *like to be part of Bioconductor community with this
package, if just this

use of the interoperable (SummmarizedExperiment) version. "

Could become this

interoperable (SummmarizedExperiment) version.

I agree with the integration priority of Bioconductor, but this repository
(and this philosophy) is more than its data structures. There should be
space for more than one approach to do things, given that the principle are
respected.

If this is true, I could really spend energies to use methods as you
suggested and implement the SummarisedExperiment stream. And with the tips
of the community the link will become stronger and stronger with time and
versions.


*To Vincent*

Thanks a lot for the interest.

*> One thing I feel is missing is an approach to the following question:
[..] How do I make one that works the way ttBulk's operators work?*

I'm afraid I don't really understand the question. Are you wondering about
extension of the framework? Or creating a similar framework for other
applications? Could you please reformulate, maybe giving a concrete
example?

*> Are there patterns there that are preserved across different operators? *

A commonality is the use of code for integrating the new calculated
information (dplyr), validation functions, ..

*> Can they be factored out to improve maintainability?*

Almost surely yes, this is the first version, I hope to see enough
interest, improve the API upon feedback, and hope for (intellectual and
practical) contributions from more experts in software engineering.

*> validObject *

Seems a good method, and as far as I tested works for S3 objects as well. I
will try to implement it. In fact I already added it as issue into Github
https://github.com/stemangiola/ttBulk/issues/6

At the moment I have a custom validation function

Best wishes.

*Stefano *



Stefano Mangiola | Postdoctoral fellow

Papenfuss Laboratory

The Walter Eliza Hall Institute of Medical Research

+61 (0)466452544


Il giorno sab 8 feb 2020 alle ore 01:54 Vincent Carey <
stvjc at channing.harvard.edu> ha scritto:

This is an interesting discussion and I hope it is ok to continue it a
bit.  I found the
readme for the ttBulk repo extremely enticing and I am sure many people
will want to
explore this way of working with genomic data.  I have only a few moments
to explore
it and did not read the vignette, but it looks to me as if it is mostly
recapitulated in the
README, which is an excellent overview.

One thing I feel is missing is an approach to the following question: I
like the
idea of a pipe-oriented operator for programming steps in genomic
workflows.
How do I make one that works the way ttBulk's operators work?  Well, I can
have a look at ttBulk:::reduce_dimensions.ttBulk ...

It's involved.  Are there patterns there that
are preserved across different operators?  Can
they be factored out to improve maintainability?

One other point before I run

It seems to me the operators "require" that certain
fields be defined in their tibble operands.

names(attributes(counts))

[1] "names"      "class"      "row.names"  "parameters"

attributes(counts)$names

[1] "sample"             "transcript"         "Cell type"

[4] "count"              "time"               "condition"

[7] "batch"              "factor_of_interest"

validObject(counts)

*Error in .classEnv(classDef) : *

*  trying to get slot "package" from an object of a basic class ("NULL")
with no slots*


Enter a frame number, or 0 to exit


1: validObject(counts)

2: .classEnv(classDef)


I think you mentioned validity checking in a previous email.  This

is a feature of S4 that is not frequently invoked.  Of course

validObject will not work on counts, but do you have something similar?

(Not all working S4 objects from Bioc will pass validObject tests, but

they should....)



On Fri, Feb 7, 2020 at 5:26 AM Martin Morgan <mtmorgan.bioc at gmail.com>
wrote:

yes, absolutely. A common pattern might be to implement a generic

    setGeneric("foo", function(x, ...) standardGeneric("foo"))

an ?internal? function that implements the method on base R data types

    .foo <- function(x) {
        stopifnot("'x' must be a matrix" = is.matrix(x))
        t(x)
    }

and methods that act as a facade to the implementation

    setMethod("foo", "tbl_df", function(x) {
        x <- as.matrix(x)
        result <- .foo(x)
        as_tibble(result)
    })

    setMethod("foo", "SummarizedExperiment", function(x) {
        result <- .foo(assay(x))
        assays(x)[["foo"]] <- result
        x
    })

One would expect the vignette and examples to primarily emphasize the use
of the interoperable (SummmarizedExperiment) version.

Martin Morgan

From: stefano <mangiolastefano at gmail.com>
Date: Friday, February 7, 2020 at 12:31 AM
To: Michael Lawrence <lawrence.michael at gene.com>
Cc: Martin Morgan <mtmorgan.bioc at gmail.com>, "bioc-devel at r-project.org" <
bioc-devel at r-project.org>
Subject: Re: [Bioc-devel] Compatibility of Bioconductor with tidyverse S3
classes/methods

Would this scenario satisfy " make the package _directly_ compatible with
standard Bioconductor data structures"

If an input is SummarizedExperiment return SummarizedExperiment, if the
input is a tbl_df or ttBulk, return ttBulk (?)


Best wishes.
Stefano

Stefano Mangiola | Postdoctoral fellow
Papenfuss Laboratory
The Walter Eliza Hall Institute of Medical Research
+61 (0)466452544


Il giorno ven 7 feb 2020 alle ore 16:15 Michael Lawrence <mailto:
lawrence.michael at gene.com> ha scritto:
I would urge you to make the package _directly_ compatible with
standard Bioconductor data structures; no explicit conversion. But you
can create wrapper methods (even on an S3 generic) that perform the
conversion automatically. You'll probably want two separate APIs
though (in different styles), for one thing automatic conversion is
obviously not possible for return values.

Michael

On Thu, Feb 6, 2020 at 5:34 PM stefano <mailto:mangiolastefano at gmail.com>
wrote:

Thanks Michael,

yes in a sense, ttBulk and SummariseExperiment can be considere as two

interfaces. Would be fair enough to create a function that convert from one
to the other, although the default would be ttBulk?

I'm not sure the tidyverse is a great answer to the user interface,

because it lacks domain semantics

Would be fair to say that ttBulk class could be considered a tibble

with specific semantics? In the sense that it holds information about key
column names (.sample, .transcript, .abundance, .normalised_abundance,
etc..), and has a validator (that is triggered at every ttBulk function).

I think at the moment, given (i) S3 problem, and (ii) the lack of

formal foundation on SummaisedExperiment interface (that maybe would
require an S4 technology itself, where SummariseExperiment could be a
slot?) my package would belong more to CRAN, until those two issues will
have been resolved.

I imagine there are not many cases where a CRAN package migrated to

Bioconductor after complying with the ecosystem policies.

Thanks a lot.

Best wishes.

Stefano



Stefano Mangiola | Postdoctoral fellow

Papenfuss Laboratory

The Walter Eliza Hall Institute of Medical Research

+61 (0)466452544



Il giorno ven 7 feb 2020 alle ore 12:12 Michael Lawrence <mailto:

lawrence.michael at gene.com> ha scritto:

There's a difference between implementing software, where one wants
formal data structures, and providing a convenient user interface.
Software needs to interface with other software, so a package could
provide both types of interfaces, one based on rich (S4) data
structures, another on simpler structures with an API more amenable to
analysis. I'm not sure the tidyverse is a great answer to the user
interface, because it lacks domain semantics. This is still an active
area of research (see Stuart Lee's plyranges, for example). I hope you
can find a reasonable compromise that enables you to integrate ttBulk
into Bioconductor, so that it can take advantage of the synergies the
ecosystem provides.

PS: There is no simple fix for your example.

Michael

On Thu, Feb 6, 2020 at 4:12 PM stefano <mailto:

mangiolastefano at gmail.com> wrote:

Thanks a lot for your comment Martin and Michael,

Here I reply to Marti's comment. Michael I will try to implement your
solution!

I think a key point from

https://github.com/Bioconductor/Contributions/issues/1355#issuecomment-580977106

(that I was under-looking) is

*>> "So to sum up: if you submit a package to Bioconductor, there is

an

expectation that your package can work seamlessly with other

Bioconductor

packages, and your implementation should support that. The safest and
easiest way to do that is to use Bioconductor data structures"*

In this case my package would not be suited as I do not use

pre-existing

Bioconductor data structures, but instead i see value in using a

simple

tibble, for the reasons in part explained in the README
https://github.com/stemangiola/ttBulk (harvesting the power of

tidyverse

and friends for bulk transcriptomic analyses).

*>> "with the minimum standard of being able to accept such objects

even if

you do not rely on them internally (though you should)"*

With this I can comply in the sense that I can built converters to

and from

SummarizedExperiment (for example).

* >> "If you don't want to do that, then that's a shame, but it would
suggest that Bioconductor would not be the right place to host this
package."*

Well said.

In summary, I do not rely on Bioconductor data structure, as I am

proposing

another paradigm, but my back end is made of largely Bioconductor

analysis

packages that I would like to interface with tidyverse. So

1) Should I build converters to Bioc. data structures, and force the

use of

S3 object (needed to tiidyverse to work), or
2) Submit to CRAN

I don't have strong feeling for either, although I think

Bioconductor would

be a good fit. Please community give me your honest opinions, I will

take

them seriously and proceed.



Best wishes.

*Stefano *



Stefano Mangiola | Postdoctoral fellow

Papenfuss Laboratory

The Walter Eliza Hall Institute of Medical Research

+61 (0)466452544


Il giorno ven 7 feb 2020 alle ore 10:46 Martin Morgan <
mailto:mtmorgan.bioc at gmail.com> ha scritto:

The idea isn't to use S4 at any cost, but to 'play well' with the
Bioconductor ecosystem, including writing robust and maintainable

code.

This comment

https://github.com/Bioconductor/Contributions/issues/1355#issuecomment-580977106

provides some motivation; there was also an interesting exchange

on the

Bioconductor community slack about this (join at
https://bioc-community.herokuapp.com/; discussion starting with

https://community-bioc.slack.com/archives/C35G93GJH/p1580144746014800).

The plyranges package http://bioconductor.org/packages/plyranges

and

recently accepted fluentGenomics workflow
https://github.com/Bioconductor/Contributions/issues/1350 provide
illustrations.

In your domain it's really surprising that your package does not

use

(Import or Depend on) SummarizedExperiment or GenomicRanges

packages. From

a superficial look at your package, it seems like something like
`reduce_dimensions()` could be defined to take & return a
SummarizedExperiment and hence benefit from some of the points in

the

github issue comment mentioned above.

Certainly there is a useful transition, both 'on the way in' to a
SummarizedExperiment, and after leaving the more specialized

bioinformatic

computations to, e.g., display a pairs plot of the reduced

dimensions,

where one might re-shape the data to a tidy format and use 'plain

old'

tibbles; the fluentGenomics workflow might provide some guidance.

At the end of the day it would not be surprising for Bioconductor

packages

to make use of tidy concepts and data structures, particularly in

the

vignette, and it would be a mistake for Bioconductor to exclude
well-motivated 'tidy' representations.

Martin Morgan

?On 2/6/20, 5:46 PM, "Bioc-devel on behalf of stefano" <
mailto:bioc-devel-bounces at r-project.org on behalf of mailto:

mangiolastefano at gmail.com>

wrote:

    Hello,

    I have a package (ttBulk) under review. I have been told to

replace

the S3
    system to S4. My package is based on the class tbl_df and must

be fully

    compatible with tidyverse methods (inheritance). After some

tests and

    research I understood that tidyverse ecosystem is not

compatible with

S4
    classes.

     For example, several methos do not apparently handle S4

objects based

on
    S3 tbl_df

    ```library(tidyverse)setOldClass("tbl_df")
    setClass("test2", contains = "tbl_df")
    my <- new("test2",  tibble(a = 1))
    my %>%  mutate(b = 3)

       a b
    1 1 3
    ```

     ```my <- new("test2",  tibble(a = rnorm(100), b = 1))
    my %>% nest(data = -b)
    Error: `x` must be a vector, not a `test2` object
    Run `rlang::last_error()` to see where the error occurred.
    ```

    Could you please advise whether a tidyverse based package can

be

hosted on
    Bioconductor, and if S4 classes are really mandatory? I need to
understand
    if I am forced to submit to CRAN instead (although

Bioconductor would

be a
    good fit, sice I try to interface transcriptional analysis

tools to

tidy
    universe)


    Thanks a lot.
    Stefano

        [[alternative HTML version deleted]]

    _______________________________________________
    mailto:Bioc-devel at r-project.org mailing list
    https://stat.ethz.ch/mailman/listinfo/bioc-devel

_______________________________________________
mailto:Bioc-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

_______________________________________________
Bioc-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

[Bioc-devel] Compatibility of Bioconductor with tidyverse S3 classes/methods

Thread (14 messages)