[Bioc-devel] Why should Bioconductor developers re-use core classes?
Thanks for those additional comments, Levi. I don't think you were being unfair to phyloseq, and it sounds like some of the issues involved are still relevant (e.g. new domain, new contributor). I wanted to riff on Martin's comment about "incremental gain rather than perfection". I imagine there are many reasons to use this as a guiding principle, but one of them could be that it helps foster new-to-BioC contributors by not overwhelming them with sub-critical requirements. Interoperability is a feature that developers are already incentivized to leverage, when they realize it, if they haven't buried themselves too deep O:-). Similarly, after taking a peek at those developer docs, I think buildingPackagesForBioC <http://bioconductor.org/developers/how-to/buildingPackagesForBioc/> looks pretty great, and I wish it had existed (or I had known about it) when I unwittingly blundered these early decisions in phyloseq. Thanks again for the discussion/feedback! --- --- "Joey" Paul J. McMurdie II Sent from Gmail On Thu, Oct 19, 2017 at 9:53 AM, Martin Morgan <
martin.morgan at roswellpark.org> wrote:
On 10/19/2017 10:12 AM, Levi Waldron wrote:
Thanks for all your thoughts Joey, and I hope I didn't come across as critical of phyloseq in particular. In fact, the couple packages I created as a post-doc (doppelgangR and ffpe) did exactly the same thing, but unlike phyloseq have never been used enough for it to make much difference :). This comparison with phyloseq and metagenomeSeq is just one that I have personal experience with because I use both packages and it was something of a revelation to me when I realized that MRExperiment and all other eSet-derived classes would "just work" as elements of a MultiAssayExperiment. My motivation in making these slides was mainly to share my own very recent revelations and try to make life better for new package developers and their users. A couple comments below: On Wed, Oct 18, 2017 at 11:36 PM, Paul Joseph McMurdie <joey711 at gmail.com
wrote:
- Hopefully it is obvious from my description, and also what I imagine to be Levi's motivation for making the slide deck, but somehow new eager developers are missing out on this great infrastructure and it isn't because they want to re-implement core stuff. I sure didn't! I simply didn't know what was there or best practices for BioC. *A "beginner's guide to BioC package development"* would have been at the top of my list of things to read back then.
I think this is still not abundantly clear from http://bioconductor.org/developers/ and I'm not sure how much it factors into the package review process. Would it be helpful to have a short questionnaire for creators of new packages that includes some things not part of R CMD BiocCheck, or CHECK, like - what new classes do you define, and what is the purpose of defining them?
[ should have mentioned that the slack channel is #tree-like-se ] A couple of quick comments on new packages / reviews. New package submissions have a checklist that people submit with their package. Since it is not uncommon to find checked items that have not been satisfied, I'm quite cynical about this as a way of enforcing good behavior, though maybe it does something to elevate awareness. One of the check-list items is acknowledge reading the package guidelines https://bioconductor.org/developers/package-guidelines/ which include an S4 section https://bioconductor.org/developers/package-guidelines/#classes and links to general guidance https://bioconductor.org/developers/how-to/efficient-code/# re-use-existing-functionality as well as emphasis on core containers https://bioconductor.org/developers/how-to/commonMethodsAndClasses/ A relatively humorous pattern is for newer team members to eventually lament that new packages don't re-use existing classes and methods, suggesting that more documentation is the solution; usually the litany above is cited, and the newer team member either tweaks or adds to the existing documentation to the point that they think it is surely satisfactory, or sees that more documentation is not an effective answer. This pattern started again yesterday... One thing that Joey said, and that I think is true of a lot of new package contributors, is that they are 'new' developers. This is somewhat ironic, since they are writing packages that are meant (and sometimes are, as with phyloseq) to be widely used and influential; it seems somehow appropriate to keep that channel open, with the domain and other expertise that 'new' developers bring as a trade-off against the sophistication of their current development skills. Of course being new developers mean that they make both 'mistakes' and choices that are not entirely consistent with the overall project, e.g., use of S3 or R6 rather than S4 classes, reimplementation of existing functionality, adopting unfamiliar API conventions (a minor favorite of mine is the presentation of a novel method for some aspect of differential expression, with the primary input a matrix of samples x 'genes' rather than the bioinformatics convention of 'genes' x samples). The diversity of programming and analysis approaches available in R makes it not at all surprising that the insights that can come even after many years of working with Bioconductor are not accessible to many new package contributors. I'll also mention my own philosophy (which I sometimes emphasize with other members of the core team of reviewers) in reviewing packages, which is to seek incremental gain rather than perfection; sometimes there are issues other than S4 class use (!) where greater strides can be made. Martin
- It isn't that I didn't read other established packages. I did. However,
a lot of core BioC tools had gene-expression specific names even for data classes that were not intrinsically gene expression (e.g. it's actually a matrix, or related tables) -- and I'm happy many of these now use more general names like "experiment" or "row". The old names signaled to me "this isn't for you". And I naively, ignorantly, accepted that at mostly face value. - Conversely, sometimes not-inheriting methods is a feature, because it protects users from doing something that is great and appropriate for one domain (gene expression) but totally irrational in another (microbiome). I'm not saying my original implementation made great nuanced decisions about this -- it has many trappings of a new developer -- but I did have some pretty naive users in mind with phyloseq, for whom navigating legacy methods and method names from other domain(s) was expected to be a hurdle. Curious to hear thoughts on this.
Something you did well in phyloseq was define methods for all common user operations, and if you had inherited from eSet, you probably still would've wanted to do this - except that more of your methods would've been just wrappers for inappropriately named eSet methods, and your average user wouldn't have known the difference. Because of your use of methods and not direct slot access by users, I think you could still change the underlying data model without it being noticeable to basic users, even if it expanded the number of methods actually available. Thankfully, SummarizedExperiment I think now avoids these "this isn't for you" signals. - There actually *still isn't core support for evolutionary trees in
BioC* (as mentioned by Joe Paulson and Ben Callahan in other threads). One of phyloseq's key contributions was to leverage the fantastic representation of trees implemented in the CRAN package "ape" in order to support analysis techniques popular among microbiome researchers that require a phylogenetic tree. The integration in the phyloseq-class and ape is necessarily pretty deep, including certain row operations. Users also needed a familiar and simple R interface to manipulate that composite object despite the complex hierarchical relationship among rows. Correct me if I'm wrong, but I think there is still no core BioC support for representing tree-like or bio-taxonomy-like hierarchy among rows in a SummarizedExperiment, or equivalent; and consequently certain row operations may have to be modified more deeply than usual if we were to re-implement phyloseq "the right way". I'd love to hear thoughts on this.
AFAIK you're right, and I don't know the solution, although I hope it can be built on SummarizedExperiment. Looking forward to talking more about this. Even though phyloseq is at the receiving end, I think the criticism is
fair, and I want current and future new BioC contributors to not re-make my mistakes circa 2011-12. I'm happy to help if I can. Cheers, and thanks for the interesting, collegial thread. Joey
Thanks Joey, and I do want to say also that I think phyloseq is
responsible
for making Bioconductor a viable and already superior choice for
statistical analysis of microbiome data!
[[alternative HTML version deleted]]
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
This email message may contain legally privileged and/or confidential information. If you are not the intended recipient(s), or the employee or agent responsible for the delivery of this message to the intended recipient(s), you are hereby notified that any disclosure, copying, distribution, or use of this email message is prohibited. If you have received this message in error, please notify the sender immediately by e-mail and delete this email message from your computer. Thank you.