Hi *all* package developers
for the upcoming Bioc release, as a developer, could you please revise
what packages you put under 'Depends' in your DESCRIPTION files.
In many cases packages listed there are only used occationally in a
few rarely called functions. In such cases it is recommended to put
such packages under 'Suggests' instead and use require("<pkg>") where
ever they are needed. This will decrease the download/installation
footprint.
Without picking on a particular package (I've used a different package
before), here is an illustrative example involving several packages
with large package footprints:
In order to use the runBioHMM() segmentation method in the snapCGH package:
Package: snapCGH
Depends: limma, tilingArray, DNAcopy, GLAD, cluster, methods, aCGH
Suggests:
Imports:
this is what you need to download and install (illustrated package by package):
Hi Henrik,
Good points. I would also like people to consider Imports for things
that are infrastructure and hence should not end up on the search path.
And to answer your last point - the number of dependencies and where
they are is something we do check (as several recent submitters can
attest). We also review packages that are in the repository for a
number of things (excessive dependencies being one of them).
best wishes
Robert
Henrik Bengtsson wrote:
Hi *all* package developers
for the upcoming Bioc release, as a developer, could you please revise
what packages you put under 'Depends' in your DESCRIPTION files.
In many cases packages listed there are only used occationally in a
few rarely called functions. In such cases it is recommended to put
such packages under 'Suggests' instead and use require("<pkg>") where
ever they are needed. This will decrease the download/installation
footprint.
Without picking on a particular package (I've used a different package
before), here is an illustrative example involving several packages
with large package footprints:
In order to use the runBioHMM() segmentation method in the snapCGH package:
Package: snapCGH
Depends: limma, tilingArray, DNAcopy, GLAD, cluster, methods, aCGH
Suggests:
Imports:
this is what you need to download and install (illustrated package by package):
Robert Gentleman, PhD
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
PO Box 19024
Seattle, Washington 98109-1024
206-667-7700
rgentlem at fhcrc.org
Henrik,
I agree that the installation footprint is/has been issue, and I would
think that it could be addressed more efficiently than going the "less
dependencies" route.
Having code and data mixed into one package, and the absence of common
recommended data source for data types, are probably responsible for a
much larger fraction of the footprint than the dependencies are.
Regarding the example you picked this time, it can be seen as
disputable. "snapCGH" is clearly advertised as an "umbrella" package,
unifying several other aCGH-related packages under one common framework.
One should still be interested in the package, one should also suspect
that is it going to require other packages (and their respective
dependencies in turn), shouldn't he ?
As a side note, it can seem difficult to objectively judge what
functions are "rarely called" in a package without actual usage data.
L.
On Wed, 2008-10-15 at 13:11 -0700, Henrik Bengtsson wrote:
Hi *all* package developers
for the upcoming Bioc release, as a developer, could you please revise
what packages you put under 'Depends' in your DESCRIPTION files.
In many cases packages listed there are only used occationally in a
few rarely called functions. In such cases it is recommended to put
such packages under 'Suggests' instead and use require("<pkg>") where
ever they are needed. This will decrease the download/installation
footprint.
Without picking on a particular package (I've used a different package
before), here is an illustrative example involving several packages
with large package footprints:
In order to use the runBioHMM() segmentation method in the snapCGH package:
Package: snapCGH
Depends: limma, tilingArray, DNAcopy, GLAD, cluster, methods, aCGH
Suggests:
Imports:
this is what you need to download and install (illustrated package by package):
On Thu, Oct 16, 2008 at 12:42 AM, laurent <lgautier at gmail.com> wrote:
Henrik,
I agree that the installation footprint is/has been issue, and I would
think that it could be addressed more efficiently than going the "less
dependencies" route.
I wouldn't say less "dependencies", but rather "differently
classified" dependencies ("Suggests" could read "Optional" and
"Depends" could read "Critical"; my view).
Having code and data mixed into one package, and the absence of common
recommended data source for data types, are probably responsible for a
much larger fraction of the footprint than the dependencies are.
I definitely agree. My personal take on this is that for most
algorithms the developer could provide:
(i) a high-level function taking objects of more complex classes (e.g.
eSet, ExpressionSet, AffyBatch, RGList, ...), which then utilizes:
(ii) a low-level function operating on basic data types (e.g. vectors
and matrices) and that implements the actually algorithms.
The (i) functions targets end users and higher level calls from other
packages and (ii) mainly targets other developers (and some users).
The API of these low-level functions are less likely to change whereas
the high-level API tend to change more often (new classes are
introduced). Several packages already has an internal low-level
interface, but it is not always explicit which these functions are
that they're part of a supported API (namespaces sometimes clarifies
this).
If all low-level functions are put in a separate package, then we
achieve part of what you suggest. It will also be easier to main the
low-level as well as the high-level functions, it will be easier for
other to contribute with say optimized code for the actually algorithm
(which is in the low-level API).
Regarding the example you picked this time, it can be seen as
disputable. "snapCGH" is clearly advertised as an "umbrella" package,
unifying several other aCGH-related packages under one common framework.
One should still be interested in the package, one should also suspect
that is it going to require other packages (and their respective
dependencies in turn), shouldn't he ?
Yes, and the life of Bioconductor packages is a dynamic process.
Maybe I picked a bad example this time, and it might be that all the
packages are heavily used most of the time by snapCGH. My entry point
to snapCGH was the Bioinformatics Application Note (Marioni et al
2006) on the BioHMM model, and the umbrella features of snapCGH is not
the main point in that package. This might still illustrate a use
case as well the usefulness of a high-/low-level package (and that the
BioHMM model would favor from being available in a package of itself).
See also my comments on the frequently used function smoothScatter()
in "heavy-weight" geneplotter, cf.
https://stat.ethz.ch/pipermail/bioc-devel/2008-July/001640.html
As a side note, it can seem difficult to objectively judge what
functions are "rarely called" in a package without actual usage data.
I should have used "optionally" instead. It is often the developer
who knows when and where certain packages are used and in several
cases it is only some of the loaded packages that are used in any R
session. For example, in aroma.affymetrix, we only utilize the
'EBImage' package for generating PNGs showing data spatially. Not all
people do this and if they do not necessarily in every session. Since
this is "rarely"/"optionally" done, the EBImage package is under
'Suggests' and we use require():ed whenever needed.
Our dependency of EBImage illustrates another argument, which is
software robustness. EBImage was for several months broken on Windows
(I'm glad to see that it is now fixed), and with a hard dependency
aroma.affymetrix would have been impossible to install and use on
Windows during that time. In this case I guess the reason was that
original developer handed over the maintainence of EBImage which
caused some startup delays for the new maintainer; these are things we
are always going to face with most packages at some stage or the
other. As developers we can somewhat protect ourselves *and
downstream developers/users* against this by using the
'Suggests/Imports' fields.
Finally, if one want to install all packages including those in
'Suggests', one can do:
biocLite("<pkg/group>", dependencies=c("Depends", "Suggests", "Imports"))
Cheers
Henrik
L.
On Wed, 2008-10-15 at 13:11 -0700, Henrik Bengtsson wrote:
Hi *all* package developers
for the upcoming Bioc release, as a developer, could you please revise
what packages you put under 'Depends' in your DESCRIPTION files.
In many cases packages listed there are only used occationally in a
few rarely called functions. In such cases it is recommended to put
such packages under 'Suggests' instead and use require("<pkg>") where
ever they are needed. This will decrease the download/installation
footprint.
Without picking on a particular package (I've used a different package
before), here is an illustrative example involving several packages
with large package footprints:
In order to use the runBioHMM() segmentation method in the snapCGH package:
Package: snapCGH
Depends: limma, tilingArray, DNAcopy, GLAD, cluster, methods, aCGH
Suggests:
Imports:
this is what you need to download and install (illustrated package by package):
Finally, if one want to install all packages including those in
'Suggests', one can do:
biocLite("<pkg/group>", dependencies=c("Depends", "Suggests", "Imports"))
Note that currently, when I do:
biocLite("<somepackage>", dependencies=TRUE) # equivalent to the above
I end up with tons of packages that I'm not necessarily interested in.
This is because the behaviour of install.packages() (the backend of biocLite)
when "Suggests" is specified in the 'dependencies' argument is to follow
recursively the Suggests field. This is obviously the right thing to do for
the Depends and Imports fields but IMO the Suggests field should be treated
differently i.e. only packages *directly* suggested by <somepackage> should be
installed (+ the packages that those suggested packages depend on or import,
recursively, of course), but not the packages that the suggested packages
suggest :-P
It could be that this behaviour of install.packages() is partly responsible
of the too many installed packages that the user experiences sometimes...
Cheers,
H.
Cheers
Henrik
L.
On Wed, 2008-10-15 at 13:11 -0700, Henrik Bengtsson wrote:
Hi *all* package developers
for the upcoming Bioc release, as a developer, could you please revise
what packages you put under 'Depends' in your DESCRIPTION files.
In many cases packages listed there are only used occationally in a
few rarely called functions. In such cases it is recommended to put
such packages under 'Suggests' instead and use require("<pkg>") where
ever they are needed. This will decrease the download/installation
footprint.
Without picking on a particular package (I've used a different package
before), here is an illustrative example involving several packages
with large package footprints:
In order to use the runBioHMM() segmentation method in the snapCGH package:
Package: snapCGH
Depends: limma, tilingArray, DNAcopy, GLAD, cluster, methods, aCGH
Suggests:
Imports:
this is what you need to download and install (illustrated package by package):