On 10/19/2017 10:12 AM, Levi Waldron wrote:
Thanks for all your thoughts Joey, and I hope I didn't come across as
critical of phyloseq in particular. In fact, the couple packages I created
as a post-doc (doppelgangR and ffpe) did exactly the same thing, but
unlike
phyloseq have never been used enough for it to make much difference :).
This comparison with phyloseq and metagenomeSeq is just one that I have
personal experience with because I use both packages and it was something
of a revelation to me when I realized that MRExperiment and all other
eSet-derived classes would "just work" as elements of a
MultiAssayExperiment. My motivation in making these slides was mainly to
share my own very recent revelations and try to make life better for new
package developers and their users. A couple comments below:
On Wed, Oct 18, 2017 at 11:36 PM, Paul Joseph McMurdie <joey711 at gmail.com
- Hopefully it is obvious from my description, and also what I imagine to
be Levi's motivation for making the slide deck, but somehow new eager
developers are missing out on this great infrastructure and it isn't
because they want to re-implement core stuff. I sure didn't! I simply
didn't know what was there or best practices for BioC. *A "beginner's
guide to BioC package development"* would have been at the top of my list
of things to read back then.
I think this is still not abundantly clear from
http://bioconductor.org/developers/ and I'm not sure how much it factors
into the package review process. Would it be helpful to have a short
questionnaire for creators of new packages that includes some things not
part of R CMD BiocCheck, or CHECK, like - what new classes do you define,
and what is the purpose of defining them?
[ should have mentioned that the slack channel is #tree-like-se ]
A couple of quick comments on new packages / reviews.
New package submissions have a checklist that people submit with their
package. Since it is not uncommon to find checked items that have not been
satisfied, I'm quite cynical about this as a way of enforcing good
behavior, though maybe it does something to elevate awareness.
One of the check-list items is acknowledge reading the package guidelines
https://bioconductor.org/developers/package-guidelines/
which include an S4 section
https://bioconductor.org/developers/package-guidelines/#classes
and links to general guidance
https://bioconductor.org/developers/how-to/efficient-code/#
re-use-existing-functionality
as well as emphasis on core containers
https://bioconductor.org/developers/how-to/commonMethodsAndClasses/
A relatively humorous pattern is for newer team members to eventually
lament that new packages don't re-use existing classes and methods,
suggesting that more documentation is the solution; usually the litany
above is cited, and the newer team member either tweaks or adds to the
existing documentation to the point that they think it is surely
satisfactory, or sees that more documentation is not an effective answer.
This pattern started again yesterday...
One thing that Joey said, and that I think is true of a lot of new package
contributors, is that they are 'new' developers. This is somewhat ironic,
since they are writing packages that are meant (and sometimes are, as with
phyloseq) to be widely used and influential; it seems somehow appropriate
to keep that channel open, with the domain and other expertise that 'new'
developers bring as a trade-off against the sophistication of their current
development skills. Of course being new developers mean that they make both
'mistakes' and choices that are not entirely consistent with the overall
project, e.g., use of S3 or R6 rather than S4 classes, reimplementation of
existing functionality, adopting unfamiliar API conventions (a minor
favorite of mine is the presentation of a novel method for some aspect of
differential expression, with the primary input a matrix of samples x
'genes' rather than the bioinformatics convention of 'genes' x samples).
The diversity of programming and analysis approaches available in R makes
it not at all surprising that the insights that can come even after many
years of working with Bioconductor are not accessible to many new package
contributors.
I'll also mention my own philosophy (which I sometimes emphasize with
other members of the core team of reviewers) in reviewing packages, which
is to seek incremental gain rather than perfection; sometimes there are
issues other than S4 class use (!) where greater strides can be made.
Martin
- It isn't that I didn't read other established packages. I did. However,
a lot of core BioC tools had gene-expression specific names even for data
classes that were not intrinsically gene expression (e.g. it's actually a
matrix, or related tables) -- and I'm happy many of these now use more
general names like "experiment" or "row". The old names signaled to me
"this isn't for you". And I naively, ignorantly, accepted that at mostly
face value.
- Conversely, sometimes not-inheriting methods is a feature, because it
protects users from doing something that is great and appropriate for one
domain (gene expression) but totally irrational in another (microbiome).
I'm not saying my original implementation made great nuanced decisions
about this -- it has many trappings of a new developer -- but I did have
some pretty naive users in mind with phyloseq, for whom navigating legacy
methods and method names from other domain(s) was expected to be a
hurdle.
Curious to hear thoughts on this.
Something you did well in phyloseq was define methods for all common user
operations, and if you had inherited from eSet, you probably still
would've
wanted to do this - except that more of your methods would've been just
wrappers for inappropriately named eSet methods, and your average user
wouldn't have known the difference. Because of your use of methods and not
direct slot access by users, I think you could still change the underlying
data model without it being noticeable to basic users, even if it expanded
the number of methods actually available. Thankfully, SummarizedExperiment
I think now avoids these "this isn't for you" signals.
- There actually *still isn't core support for evolutionary trees in
BioC* (as
mentioned by Joe Paulson and Ben Callahan in other threads). One of
phyloseq's key contributions was to leverage the fantastic representation
of trees implemented in the CRAN package "ape" in order to support
analysis
techniques popular among microbiome researchers that require a
phylogenetic
tree. The integration in the phyloseq-class and ape is necessarily pretty
deep, including certain row operations. Users also needed a familiar and
simple R interface to manipulate that composite object despite the
complex
hierarchical relationship among rows. Correct me if I'm wrong, but I
think
there is still no core BioC support for representing tree-like or
bio-taxonomy-like hierarchy among rows in a SummarizedExperiment, or
equivalent; and consequently certain row operations may have to be
modified
more deeply than usual if we were to re-implement phyloseq "the right
way".
I'd love to hear thoughts on this.
AFAIK you're right, and I don't know the solution, although I hope it can
be built on SummarizedExperiment. Looking forward to talking more about
this.
Even though phyloseq is at the receiving end, I think the criticism is
fair, and I want current and future new BioC contributors to not re-make
my
mistakes circa 2011-12. I'm happy to help if I can.
Cheers, and thanks for the interesting, collegial thread.
Joey
Thanks Joey, and I do want to say also that I think phyloseq is
responsible
for making Bioconductor a viable and already superior choice for
statistical analysis of microbiome data!
[[alternative HTML version deleted]]