Skip to content

[Bioc-devel] package size

9 messages · Bazeley, Peter, Henrik Bengtsson, Hervé Pagès +2 more

#
Dear List,

I am creating a package, the purpose of which is to combine data from different microarray platforms. I have found a NCBI GEO data series with 3 different platforms (1 Affymetrix and 2 Illumina) that works well for illustrating my package functions. It would be nice to keep this data series as a data object for use in the function examples (currently, 4 of 5 functions use this data object in their example code) in the documentation, but the xz compressed .rda file (consisting of 3 data frames, one for each data set) is about 5MB (total package size is 6MB). Is this too big?

There are 2 alternatives:

1) The package includes a function to download datasets using the GEOquery package, which could be used to easily re-create the data frames included in my .rda file. The only downside is that it takes several minutes to download all the data, so it may be inconvenient, since this data object is used in example code for the 4 functions.

1a) I could have each function example contain code to either a) download the data and save it in an .RData image file, or b) load the image file saved in a). This way the investigator would only have to endure the download once, unless they chose not to save the data.

2) I could take, say, the first 1000 genes from each platform. I did this, and the combined data only has 19 probes/probesets (they are mapped by Accession/UniGene IDs, and the common transcripts are extracted) . It would be nice to have a larger example, although not necessary. Alternatively, I could find a better set of 1000 (or however many), so that more than 19 are present.


Thank you for any assistance,
Peter Bazeley
#
Hi Peter,
On 07/19/2010 09:10 PM, Bazeley, Peter wrote:
Hmm, but if they are expression data, then an ExpressionSet would more
fully represent the data? See library(GEOquery); ?getGEO with the
GSEMatrix option set to TRUE, and

  http://bioconductor.org/packages/2.6/bioc/html/Biobase.html

and the 'An Introduction to Biobase and ExpressionSets' vignette.
A third is to create an experiment data package like those at

  http://bioconductor.org/packages/release/ExperimentData.html

that contains the entire data. This way you get a rich and reproducible
example to illustrate your tools. These are really just packages with
data objects in the inst/extdata/ (for CEL and other non-R formats) or
data/ (for R data objects) directories, and man pages describing the data.

Perhaps there is already an experiment data package that meets your needs?

Martin

  
    
#
Consider also package updates; even if you just do a tiny bug fix,
then one have do download all that data again.

Martin's suggestion to keep a separate experimental data package is a
good option.  It will also makes the data available to others to use
in their examples (without having to install your main package
dependencies), e.g. "competing" methods.

/Henrik
On Tue, Jul 20, 2010 at 6:31 AM, Martin Morgan <mtmorgan at fhcrc.org> wrote:
#
Going with Martin's first suggestion, is 37 seconds to download the data too long/inconvenient for an example in the function documentation? This is for the package's main function, and the second of 2 examples, with the first using a smaller/faster to load dataset. The remaining code in this 2nd example takes under 8 seconds, including the code to access the data in the GEOquery object.

Of course, the times will vary. My computer has an Intel Core 2 Duo 2.8 GHz, 4GB of RAM, Windows 7.
R version 2.11.1 (2010-05-31) 
i386-pc-mingw32 

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] hgu95av2cdf_2.6.0   affydata_1.11.10    affy_1.26.1         QuantCombine_0.99.0 GEOquery_2.12.0    
[6] RCurl_1.4-2         bitops_1.0-4.1      Biobase_2.8.0      

loaded via a namespace (and not attached):
[1] affyio_1.16.0         preprocessCore_1.10.0 tools_2.11.1

        
#
Hi Peter,
On 07/20/2010 10:57 PM, Bazeley, Peter wrote:
Download times depend more on the quality of your network connection
than anything else. So for people with a slow internet access, those
times could be multiplied by 5, or 10, or more...

Cheers,
H.

  
    
#
On 07/21/2010 11:16 AM, Herv? Pag?s wrote:
My first suggestion was more along the lines of 'use ExpressionSet
rather than a data.frame or matrix'; sorry to have clouded the water.
I agree; there are lots of issues that show up on the mailing list that
trace to inability to reliably connect to sites, due to errors on the
server end, poor connectivity, local firewalls, ... And in our build
system reports packages regularly show transient internet-related
failures. This makes it difficult for the maintainer (and us!) to know
whether there's a 'real' problem or not.

An experiment data package is additional work, but in exchange you get
reproducible, reliable, and documented research.

Martin

  
    
#
Martin, a question regarding this issue just when through my mind,
On Wed, 2010-07-21 at 19:20 -0700, Martin Morgan wrote:
[...]
i haven't found any especific reference to experiment data package
submission in

http://wiki.fhcrc.org/bioc/HowTo/Package_Contribution

and links thereafter.


should an experiment data package be submitted following the same
procedure as a software package? does it go also through a peer-review
process? are there requirements or guidelines specific for experiment
data packages?

thanks!
robert.
#
Hi Robert --
On 07/21/2010 11:57 PM, Robert Castelo wrote:
Often experiment data packages are produced much like in Peter's case --
to support analysis in a particular package or group of packages. They
(the analysis and data packages) are then subject jointly to the preview
process.

Sometimes experiment data packages emerge after the fact, when it
becomes apparent that work flows or packages would benefit from a common
data set. These generally get introduced ad hoc, without formal preview,
but of course the packages are being built and passing R CMD check.

We don't usually start with a 'pure' experiment data package as a
submission -- Bioconductor is not a data repository in that sense -- but
if one were to come in we'd treat it as a new package and submit it to
the same formal acceptance procedure.

Hope that helps,

Martin

  
    
#
Sorry, I should have mentioned that that download time corresponded to a rewritten version of the function's example that used GEOquery to download the 3 GSE SuperSeries and extract the expression data from this. I think what I'm going to do is include this example in a vignette, and in the function's documentation examples, use the Dilution data instead. This is what I should have done in the first place to better integrate with existing packages.

Thanks for all your input,
Pete