Skip to content

[Bioc-devel] Package size limitation

4 messages · Patrick Aboyoun, Henrik Bengtsson, Hervé Pagès +1 more

#
Tobias,
To elaborate on Kasper's well-stated points, the Bioconductor project 
has separate repositories for software, experiment data, and annotation 
metadata.

BioC software:  http://bioconductor.org/packages/release/bioc/
BioC experiment data:  
http://bioconductor.org/packages/release/data/experiment/
BioC annotation metadata:  
http://bioconductor.org/packages/release/data/annotation/

The main criterion for an experiment data package is that it should be 
novel in some way to make it useful for other software developers to 
utilize it when illustrating concepts in their software package. More 
often than not, including a subset of your data in your software package 
will suffice. One common misconception by new package developers is that 
examples in the man pages and the vignettes need to be 100% "real". The 
main goal of vignettes and man pages is to illustrate concepts rather 
than reveal scientific findings. You are encouraged to provide 
references to scientific papers that demonstrate the latter within your 
software's documentation, but typically end-users want your package to 
have a small storage footprint on their machine and have the examples 
run in a short time frame.

If you are not sure how to handle your particular situation, when the 
Bioconductor team previews and reviews your package, we will help you 
through any tricky decisions. Good luck with your package submission and 
thanks for your interest in the Bioconductor project!


Patrick
Kasper Daniel Hansen wrote:
#
Chime!

Avoid putting large data sets in otherwise small packages (how big is
your package without data?).  Put large example data in a separate
experiment data package which is optional to load.  Try to minimize
the amount of download and the number of dependent package that other
users/developers needed to actually use your new method.  That
increase the chances that your method is used elsewhere as well.
Updates will be faster to install.

If you put together an experiment data package, please consider using
the CEL files and not AffyBatch packages.  The AffyBatch structure
might be obsolete one day and your experiment package with it.  This
is less likely to happen if you use CEL files - the most common
denominator for all data structure/classes.

Now to a trick: If you do want to distribute an AffyBatch object, have
a look at your intensities.  If your chip type is a 3x3 pixel per
probe array, and the Affymetrix image analysis (typically) took the
75% quantile (7:th ordered pixel), you will actually see only integer
probe signals.  Note, this is not a rounding error but it just happens
"by chance".  If this is the case with your data, you could create an
object holding the signals as integers and not doubles without loosing
anything.  That object would be roughly half the size.  I don't think
compression algorithms can pick this up.

Cheers

Henrik
On Fri, Aug 1, 2008 at 9:27 AM, Patrick Aboyoun <paboyoun at fhcrc.org> wrote:
#
Hi Henrik,

Quoting Henrik Bengtsson <hb at stat.berkeley.edu>:
[...]
Yes they do:

   > xx <- sample(1000000L, 5000000, replace=TRUE)
   > typeof(xx)
   [1] "integer"
   > object.size(xx)
   [1] 20000040
   > yy <- as.double(xx)
   > typeof(yy)
   [1] "double"
   > object.size(yy)
   [1] 40000040
   > save(xx, file="xx.rda")
   > save(yy, file="yy.rda")

Then from the shell:

   hpages at lamprey:~> ls -lh *.rda
   -rw-r--r-- 1 hpages compbio 16M 2008-08-01 10:19 xx.rda
   -rw-r--r-- 1 hpages compbio 18M 2008-08-01 10:19 yy.rda

H.
#
Thanks everyone, I found what I needed!

Cheers,

------------------------------
Tobias Guennel
Research Assistant
Department of Biostatistics
Virginia Commonwealth University
Theater Row 3035F
804-828-2527


-----Original Message-----
From: Patrick Aboyoun [mailto:paboyoun at fhcrc.org] 
Sent: Friday, August 01, 2008 12:27 PM
To: Kasper Daniel Hansen
Cc: Tobias Guennel; bioc-devel at stat.math.ethz.ch
Subject: Re: [Bioc-devel] Package size limitation

Tobias,
To elaborate on Kasper's well-stated points, the Bioconductor project 
has separate repositories for software, experiment data, and annotation 
metadata.

BioC software:  http://bioconductor.org/packages/release/bioc/
BioC experiment data:  
http://bioconductor.org/packages/release/data/experiment/
BioC annotation metadata:  
http://bioconductor.org/packages/release/data/annotation/

The main criterion for an experiment data package is that it should be 
novel in some way to make it useful for other software developers to 
utilize it when illustrating concepts in their software package. More 
often than not, including a subset of your data in your software package 
will suffice. One common misconception by new package developers is that 
examples in the man pages and the vignettes need to be 100% "real". The 
main goal of vignettes and man pages is to illustrate concepts rather 
than reveal scientific findings. You are encouraged to provide 
references to scientific papers that demonstrate the latter within your 
software's documentation, but typically end-users want your package to 
have a small storage footprint on their machine and have the examples 
run in a short time frame.

If you are not sure how to handle your particular situation, when the 
Bioconductor team previews and reviews your package, we will help you 
through any tricky decisions. Good luck with your package submission and 
thanks for your interest in the Bioconductor project!


Patrick
Kasper Daniel Hansen wrote: