Dear List, I am creating a package, the purpose of which is to combine data from different microarray platforms. I have found a NCBI GEO data series with 3 different platforms (1 Affymetrix and 2 Illumina) that works well for illustrating my package functions. It would be nice to keep this data series as a data object for use in the function examples (currently, 4 of 5 functions use this data object in their example code) in the documentation, but the xz compressed .rda file (consisting of 3 data frames, one for each data set) is about 5MB (total package size is 6MB). Is this too big? There are 2 alternatives: 1) The package includes a function to download datasets using the GEOquery package, which could be used to easily re-create the data frames included in my .rda file. The only downside is that it takes several minutes to download all the data, so it may be inconvenient, since this data object is used in example code for the 4 functions. 1a) I could have each function example contain code to either a) download the data and save it in an .RData image file, or b) load the image file saved in a). This way the investigator would only have to endure the download once, unless they chose not to save the data. 2) I could take, say, the first 1000 genes from each platform. I did this, and the combined data only has 19 probes/probesets (they are mapped by Accession/UniGene IDs, and the common transcripts are extracted) . It would be nice to have a larger example, although not necessary. Alternatively, I could find a better set of 1000 (or however many), so that more than 19 are present. Thank you for any assistance, Peter Bazeley
[Bioc-devel] package size
9 messages · Bazeley, Peter, Henrik Bengtsson, Hervé Pagès +2 more
Hi Peter,
On 07/19/2010 09:10 PM, Bazeley, Peter wrote:
Dear List, I am creating a package, the purpose of which is to combine data from different microarray platforms. I have found a NCBI GEO data series with 3 different platforms (1 Affymetrix and 2 Illumina) that works well for illustrating my package functions. It would be nice to keep this data series as a data object for use in the function examples (currently, 4 of 5 functions use this data object in their example code) in the documentation, but the xz compressed .rda file (consisting of 3 data frames, one for each data set) is about 5MB
Hmm, but if they are expression data, then an ExpressionSet would more fully represent the data? See library(GEOquery); ?getGEO with the GSEMatrix option set to TRUE, and http://bioconductor.org/packages/2.6/bioc/html/Biobase.html and the 'An Introduction to Biobase and ExpressionSets' vignette.
(total package size is 6MB). Is this too big? There are 2 alternatives: 1) The package includes a function to download datasets using the GEOquery package, which could be used to easily re-create the data frames included in my .rda file. The only downside is that it takes several minutes to download all the data, so it may be inconvenient, since this data object is used in example code for the 4 functions. 1a) I could have each function example contain code to either a) download the data and save it in an .RData image file, or b) load the image file saved in a). This way the investigator would only have to endure the download once, unless they chose not to save the data. 2) I could take, say, the first 1000 genes from each platform. I did this, and the combined data only has 19 probes/probesets (they are mapped by Accession/UniGene IDs, and the common transcripts are extracted) . It would be nice to have a larger example, although not necessary. Alternatively, I could find a better set of 1000 (or however many), so that more than 19 are present.
A third is to create an experiment data package like those at http://bioconductor.org/packages/release/ExperimentData.html that contains the entire data. This way you get a rich and reproducible example to illustrate your tools. These are really just packages with data objects in the inst/extdata/ (for CEL and other non-R formats) or data/ (for R data objects) directories, and man pages describing the data. Perhaps there is already an experiment data package that meets your needs? Martin
Thank you for any assistance, Peter Bazeley
_______________________________________________ Bioc-devel at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Martin Morgan Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793
Consider also package updates; even if you just do a tiny bug fix, then one have do download all that data again. Martin's suggestion to keep a separate experimental data package is a good option. It will also makes the data available to others to use in their examples (without having to install your main package dependencies), e.g. "competing" methods. /Henrik
On Tue, Jul 20, 2010 at 6:31 AM, Martin Morgan <mtmorgan at fhcrc.org> wrote:
Hi Peter, On 07/19/2010 09:10 PM, Bazeley, Peter wrote:
Dear List, I am creating a package, the purpose of which is to combine data from different microarray platforms. I have found a NCBI GEO data series with 3 different platforms (1 Affymetrix and 2 Illumina) that works well for illustrating my package functions. It would be nice to keep this data series as a data object for use in the function examples (currently, 4 of 5 functions use this data object in their example code) in the documentation, but the xz compressed .rda file (consisting of 3 data frames, one for each data set) is about 5MB
Hmm, but if they are expression data, then an ExpressionSet would more fully represent the data? See library(GEOquery); ?getGEO with the GSEMatrix option set to TRUE, and ?http://bioconductor.org/packages/2.6/bioc/html/Biobase.html and the 'An Introduction to Biobase and ExpressionSets' vignette.
(total package size is 6MB). Is this too big? There are 2 alternatives: 1) The package includes a function to download datasets using the GEOquery package, which could be used to easily re-create the data frames included in my .rda file. The only downside is that it takes several minutes to download all the data, so it may be inconvenient, since this data object is used in example code for the 4 functions. 1a) I could have each function example contain code to either a) download the data and save it in an .RData image file, or b) load the image file saved in a). This way the investigator would only have to endure the download once, unless they chose not to save the data. 2) I could take, say, the first 1000 genes from each platform. I did this, and the combined data only has 19 probes/probesets (they are mapped by Accession/UniGene IDs, and the common transcripts are extracted) . It would be nice to have a larger example, although not necessary. Alternatively, I could find a better set of 1000 (or however many), so that more than 19 are present.
A third is to create an experiment data package like those at ?http://bioconductor.org/packages/release/ExperimentData.html that contains the entire data. This way you get a rich and reproducible example to illustrate your tools. These are really just packages with data objects in the inst/extdata/ (for CEL and other non-R formats) or data/ (for R data objects) directories, and man pages describing the data. Perhaps there is already an experiment data package that meets your needs? Martin
Thank you for any assistance, Peter Bazeley
_______________________________________________ Bioc-devel at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
-- Martin Morgan Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793
_______________________________________________ Bioc-devel at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Going with Martin's first suggestion, is 37 seconds to download the data too long/inconvenient for an example in the function documentation? This is for the package's main function, and the second of 2 examples, with the first using a smaller/faster to load dataset. The remaining code in this 2nd example takes under 8 seconds, including the code to access the data in the GEOquery object. Of course, the times will vary. My computer has an Intel Core 2 Duo 2.8 GHz, 4GB of RAM, Windows 7.
sessionInfo()
R version 2.11.1 (2010-05-31) i386-pc-mingw32 locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 [3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C [5] LC_TIME=English_United States.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] hgu95av2cdf_2.6.0 affydata_1.11.10 affy_1.26.1 QuantCombine_0.99.0 GEOquery_2.12.0 [6] RCurl_1.4-2 bitops_1.0-4.1 Biobase_2.8.0 loaded via a namespace (and not attached): [1] affyio_1.16.0 preprocessCore_1.10.0 tools_2.11.1
From: henrik.bengtsson at gmail.com [henrik.bengtsson at gmail.com] on behalf of Henrik Bengtsson [hb at stat.berkeley.edu]
Sent: Tuesday, July 20, 2010 2:02 AM
To: Martin Morgan
Cc: Bazeley, Peter; bioc-devel at stat.math.ethz.ch
Subject: Re: [Bioc-devel] package size
Sent: Tuesday, July 20, 2010 2:02 AM
To: Martin Morgan
Cc: Bazeley, Peter; bioc-devel at stat.math.ethz.ch
Subject: Re: [Bioc-devel] package size
Consider also package updates; even if you just do a tiny bug fix, then one have do download all that data again. Martin's suggestion to keep a separate experimental data package is a good option. It will also makes the data available to others to use in their examples (without having to install your main package dependencies), e.g. "competing" methods. /Henrik On Tue, Jul 20, 2010 at 6:31 AM, Martin Morgan <mtmorgan at fhcrc.org> wrote: > Hi Peter, > > On 07/19/2010 09:10 PM, Bazeley, Peter wrote: >> Dear List, >> >> I am creating a package, the purpose of which is to combine data from >> different microarray platforms. I have found a NCBI GEO data series >> with 3 different platforms (1 Affymetrix and 2 Illumina) that works >> well for illustrating my package functions. It would be nice to keep >> this data series as a data object for use in the function examples >> (currently, 4 of 5 functions use this data object in their example >> code) in the documentation, but the xz compressed .rda file >> (consisting of 3 data frames, one for each data set) is about 5MB > > Hmm, but if they are expression data, then an ExpressionSet would more > fully represent the data? See library(GEOquery); ?getGEO with the > GSEMatrix option set to TRUE, and > > http://bioconductor.org/packages/2.6/bioc/html/Biobase.html > > and the 'An Introduction to Biobase and ExpressionSets' vignette. > >> (total package size is 6MB). Is this too big? >> >> There are 2 alternatives: >> >> 1) The package includes a function to download datasets using the >> GEOquery package, which could be used to easily re-create the data >> frames included in my .rda file. The only downside is that it takes >> several minutes to download all the data, so it may be inconvenient, >> since this data object is used in example code for the 4 functions. >> >> 1a) I could have each function example contain code to either a) >> download the data and save it in an .RData image file, or b) load the >> image file saved in a). This way the investigator would only have to >> endure the download once, unless they chose not to save the data. >> >> 2) I could take, say, the first 1000 genes from each platform. I did >> this, and the combined data only has 19 probes/probesets (they are >> mapped by Accession/UniGene IDs, and the common transcripts are >> extracted) . It would be nice to have a larger example, although not >> necessary. Alternatively, I could find a better set of 1000 (or >> however many), so that more than 19 are present. > > A third is to create an experiment data package like those at > > http://bioconductor.org/packages/release/ExperimentData.html > > that contains the entire data. This way you get a rich and reproducible > example to illustrate your tools. These are really just packages with > data objects in the inst/extdata/ (for CEL and other non-R formats) or > data/ (for R data objects) directories, and man pages describing the data. > > Perhaps there is already an experiment data package that meets your needs? > > Martin > >> >> >> Thank you for any assistance, Peter Bazeley >> _______________________________________________ >> Bioc-devel at stat.math.ethz.ch mailing list >> https://stat.ethz.ch/mailman/listinfo/bioc-devel > > > -- > Martin Morgan > Computational Biology / Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N. > PO Box 19024 Seattle, WA 98109 > > Location: Arnold Building M1 B861 > Phone: (206) 667-2793 > > _______________________________________________ > Bioc-devel at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/bioc-devel >
Hi Peter,
On 07/20/2010 10:57 PM, Bazeley, Peter wrote:
Going with Martin's first suggestion, is 37 seconds to download the data too long/inconvenient for an example in the function documentation? This is for the package's main function, and the second of 2 examples, with the first using a smaller/faster to load dataset. The remaining code in this 2nd example takes under 8 seconds, including the code to access the data in the GEOquery object. Of course, the times will vary. My computer has an Intel Core 2 Duo 2.8 GHz, 4GB of RAM, Windows 7.
Download times depend more on the quality of your network connection than anything else. So for people with a slow internet access, those times could be multiplied by 5, or 10, or more... Cheers, H.
sessionInfo()
R version 2.11.1 (2010-05-31) i386-pc-mingw32 locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 [3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C [5] LC_TIME=English_United States.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] hgu95av2cdf_2.6.0 affydata_1.11.10 affy_1.26.1 QuantCombine_0.99.0 GEOquery_2.12.0 [6] RCurl_1.4-2 bitops_1.0-4.1 Biobase_2.8.0 loaded via a namespace (and not attached): [1] affyio_1.16.0 preprocessCore_1.10.0 tools_2.11.1
________________________________________ From: henrik.bengtsson at gmail.com [henrik.bengtsson at gmail.com] on behalf of Henrik Bengtsson [hb at stat.berkeley.edu] Sent: Tuesday, July 20, 2010 2:02 AM To: Martin Morgan Cc: Bazeley, Peter; bioc-devel at stat.math.ethz.ch Subject: Re: [Bioc-devel] package size Consider also package updates; even if you just do a tiny bug fix, then one have do download all that data again. Martin's suggestion to keep a separate experimental data package is a good option. It will also makes the data available to others to use in their examples (without having to install your main package dependencies), e.g. "competing" methods. /Henrik On Tue, Jul 20, 2010 at 6:31 AM, Martin Morgan<mtmorgan at fhcrc.org> wrote: Hi Peter, On 07/19/2010 09:10 PM, Bazeley, Peter wrote: Dear List, I am creating a package, the purpose of which is to combine data from different microarray platforms. I have found a NCBI GEO data series with 3 different platforms (1 Affymetrix and 2 Illumina) that works well for illustrating my package functions. It would be nice to keep this data series as a data object for use in the function examples (currently, 4 of 5 functions use this data object in their example code) in the documentation, but the xz compressed .rda file (consisting of 3 data frames, one for each data set) is about 5MB Hmm, but if they are expression data, then an ExpressionSet would more fully represent the data? See library(GEOquery); ?getGEO with the GSEMatrix option set to TRUE, and http://bioconductor.org/packages/2.6/bioc/html/Biobase.html and the 'An Introduction to Biobase and ExpressionSets' vignette. (total package size is 6MB). Is this too big? There are 2 alternatives: 1) The package includes a function to download datasets using the GEOquery package, which could be used to easily re-create the data frames included in my .rda file. The only downside is that it takes several minutes to download all the data, so it may be inconvenient, since this data object is used in example code for the 4 functions. 1a) I could have each function example contain code to either a) download the data and save it in an .RData image file, or b) load the image file saved in a). This way the investigator would only have to endure the download once, unless they chose not to save the data. 2) I could take, say, the first 1000 genes from each platform. I did this, and the combined data only has 19 probes/probesets (they are mapped by Accession/UniGene IDs, and the common transcripts are extracted) . It would be nice to have a larger example, although not necessary. Alternatively, I could find a better set of 1000 (or however many), so that more than 19 are present. A third is to create an experiment data package like those at http://bioconductor.org/packages/release/ExperimentData.html that contains the entire data. This way you get a rich and reproducible example to illustrate your tools. These are really just packages with data objects in the inst/extdata/ (for CEL and other non-R formats) or data/ (for R data objects) directories, and man pages describing the data. Perhaps there is already an experiment data package that meets your needs? Martin Thank you for any assistance, Peter Bazeley _______________________________________________ Bioc-devel at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel -- Martin Morgan Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793 _______________________________________________ Bioc-devel at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel _______________________________________________ Bioc-devel at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Herv? Pag?s Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M2-B876 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
On 07/21/2010 11:16 AM, Herv? Pag?s wrote:
Hi Peter, On 07/20/2010 10:57 PM, Bazeley, Peter wrote:
Going with Martin's first suggestion, is 37 seconds to download the
My first suggestion was more along the lines of 'use ExpressionSet rather than a data.frame or matrix'; sorry to have clouded the water.
data too long/inconvenient for an example in the function documentation? This is for the package's main function, and the second of 2 examples, with the first using a smaller/faster to load dataset. The remaining code in this 2nd example takes under 8 seconds, including the code to access the data in the GEOquery object. Of course, the times will vary. My computer has an Intel Core 2 Duo 2.8 GHz, 4GB of RAM, Windows 7.
Download times depend more on the quality of your network connection than anything else. So for people with a slow internet access, those times could be multiplied by 5, or 10, or more...
I agree; there are lots of issues that show up on the mailing list that trace to inability to reliably connect to sites, due to errors on the server end, poor connectivity, local firewalls, ... And in our build system reports packages regularly show transient internet-related failures. This makes it difficult for the maintainer (and us!) to know whether there's a 'real' problem or not. An experiment data package is additional work, but in exchange you get reproducible, reliable, and documented research. Martin
Cheers, H.
sessionInfo()
R version 2.11.1 (2010-05-31) i386-pc-mingw32 locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 [3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C [5] LC_TIME=English_United States.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] hgu95av2cdf_2.6.0 affydata_1.11.10 affy_1.26.1 QuantCombine_0.99.0 GEOquery_2.12.0 [6] RCurl_1.4-2 bitops_1.0-4.1 Biobase_2.8.0 loaded via a namespace (and not attached): [1] affyio_1.16.0 preprocessCore_1.10.0 tools_2.11.1
________________________________________ From: henrik.bengtsson at gmail.com [henrik.bengtsson at gmail.com] on behalf of Henrik Bengtsson [hb at stat.berkeley.edu] Sent: Tuesday, July 20, 2010 2:02 AM To: Martin Morgan Cc: Bazeley, Peter; bioc-devel at stat.math.ethz.ch Subject: Re: [Bioc-devel] package size Consider also package updates; even if you just do a tiny bug fix, then one have do download all that data again. Martin's suggestion to keep a separate experimental data package is a good option. It will also makes the data available to others to use in their examples (without having to install your main package dependencies), e.g. "competing" methods. /Henrik On Tue, Jul 20, 2010 at 6:31 AM, Martin Morgan<mtmorgan at fhcrc.org> wrote: Hi Peter, On 07/19/2010 09:10 PM, Bazeley, Peter wrote: Dear List, I am creating a package, the purpose of which is to combine data from different microarray platforms. I have found a NCBI GEO data series with 3 different platforms (1 Affymetrix and 2 Illumina) that works well for illustrating my package functions. It would be nice to keep this data series as a data object for use in the function examples (currently, 4 of 5 functions use this data object in their example code) in the documentation, but the xz compressed .rda file (consisting of 3 data frames, one for each data set) is about 5MB Hmm, but if they are expression data, then an ExpressionSet would more fully represent the data? See library(GEOquery); ?getGEO with the GSEMatrix option set to TRUE, and http://bioconductor.org/packages/2.6/bioc/html/Biobase.html and the 'An Introduction to Biobase and ExpressionSets' vignette. (total package size is 6MB). Is this too big? There are 2 alternatives: 1) The package includes a function to download datasets using the GEOquery package, which could be used to easily re-create the data frames included in my .rda file. The only downside is that it takes several minutes to download all the data, so it may be inconvenient, since this data object is used in example code for the 4 functions. 1a) I could have each function example contain code to either a) download the data and save it in an .RData image file, or b) load the image file saved in a). This way the investigator would only have to endure the download once, unless they chose not to save the data. 2) I could take, say, the first 1000 genes from each platform. I did this, and the combined data only has 19 probes/probesets (they are mapped by Accession/UniGene IDs, and the common transcripts are extracted) . It would be nice to have a larger example, although not necessary. Alternatively, I could find a better set of 1000 (or however many), so that more than 19 are present. A third is to create an experiment data package like those at http://bioconductor.org/packages/release/ExperimentData.html that contains the entire data. This way you get a rich and reproducible example to illustrate your tools. These are really just packages with data objects in the inst/extdata/ (for CEL and other non-R formats) or data/ (for R data objects) directories, and man pages describing the data. Perhaps there is already an experiment data package that meets your needs? Martin Thank you for any assistance, Peter Bazeley _______________________________________________ Bioc-devel at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel -- Martin Morgan Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793 _______________________________________________ Bioc-devel at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel _______________________________________________ Bioc-devel at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Martin Morgan Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793
Martin, a question regarding this issue just when through my mind,
On Wed, 2010-07-21 at 19:20 -0700, Martin Morgan wrote:
[...]
An experiment data package is additional work, but in exchange you get reproducible, reliable, and documented research.
i haven't found any especific reference to experiment data package submission in http://wiki.fhcrc.org/bioc/HowTo/Package_Contribution and links thereafter. should an experiment data package be submitted following the same procedure as a software package? does it go also through a peer-review process? are there requirements or guidelines specific for experiment data packages? thanks! robert.
Hi Robert --
On 07/21/2010 11:57 PM, Robert Castelo wrote:
Martin, a question regarding this issue just when through my mind, On Wed, 2010-07-21 at 19:20 -0700, Martin Morgan wrote: [...]
An experiment data package is additional work, but in exchange you get reproducible, reliable, and documented research.
i haven't found any especific reference to experiment data package submission in http://wiki.fhcrc.org/bioc/HowTo/Package_Contribution and links thereafter. should an experiment data package be submitted following the same procedure as a software package? does it go also through a peer-review process? are there requirements or guidelines specific for experiment data packages?
Often experiment data packages are produced much like in Peter's case -- to support analysis in a particular package or group of packages. They (the analysis and data packages) are then subject jointly to the preview process. Sometimes experiment data packages emerge after the fact, when it becomes apparent that work flows or packages would benefit from a common data set. These generally get introduced ad hoc, without formal preview, but of course the packages are being built and passing R CMD check. We don't usually start with a 'pure' experiment data package as a submission -- Bioconductor is not a data repository in that sense -- but if one were to come in we'd treat it as a new package and submit it to the same formal acceptance procedure. Hope that helps, Martin
thanks! robert.
Martin Morgan Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793
Sorry, I should have mentioned that that download time corresponded to a rewritten version of the function's example that used GEOquery to download the 3 GSE SuperSeries and extract the expression data from this. I think what I'm going to do is include this example in a vignette, and in the function's documentation examples, use the Dilution data instead. This is what I should have done in the first place to better integrate with existing packages. Thanks for all your input, Pete
From: Martin Morgan [mtmorgan at fhcrc.org]
Sent: Wednesday, July 21, 2010 9:20 PM
To: Herv? Pag?s
Cc: Bazeley, Peter; Henrik Bengtsson; bioc-devel at stat.math.ethz.ch
Subject: Re: [Bioc-devel] package size
Sent: Wednesday, July 21, 2010 9:20 PM
To: Herv? Pag?s
Cc: Bazeley, Peter; Henrik Bengtsson; bioc-devel at stat.math.ethz.ch
Subject: Re: [Bioc-devel] package size
On 07/21/2010 11:16 AM, Herv? Pag?s wrote: > Hi Peter, > > On 07/20/2010 10:57 PM, Bazeley, Peter wrote: >> Going with Martin's first suggestion, is 37 seconds to download the My first suggestion was more along the lines of 'use ExpressionSet rather than a data.frame or matrix'; sorry to have clouded the water. >> data too long/inconvenient for an example in the function >> documentation? This is for the package's main function, and the second >> of 2 examples, with the first using a smaller/faster to load dataset. >> The remaining code in this 2nd example takes under 8 seconds, >> including the code to access the data in the GEOquery object. >> >> Of course, the times will vary. My computer has an Intel Core 2 Duo >> 2.8 GHz, 4GB of RAM, Windows 7. > > Download times depend more on the quality of your network connection > than anything else. So for people with a slow internet access, those > times could be multiplied by 5, or 10, or more... I agree; there are lots of issues that show up on the mailing list that trace to inability to reliably connect to sites, due to errors on the server end, poor connectivity, local firewalls, ... And in our build system reports packages regularly show transient internet-related failures. This makes it difficult for the maintainer (and us!) to know whether there's a 'real' problem or not. An experiment data package is additional work, but in exchange you get reproducible, reliable, and documented research. Martin > > Cheers, > H. > >> >>> sessionInfo() >> R version 2.11.1 (2010-05-31) >> i386-pc-mingw32 >> >> locale: >> [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United >> States.1252 >> [3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C >> [5] LC_TIME=English_United States.1252 >> >> attached base packages: >> [1] stats graphics grDevices utils datasets methods base >> >> other attached packages: >> [1] hgu95av2cdf_2.6.0 affydata_1.11.10 affy_1.26.1 >> QuantCombine_0.99.0 GEOquery_2.12.0 >> [6] RCurl_1.4-2 bitops_1.0-4.1 Biobase_2.8.0 >> >> loaded via a namespace (and not attached): >> [1] affyio_1.16.0 preprocessCore_1.10.0 tools_2.11.1 >>> >> >> >> >> ________________________________________ >> From: henrik.bengtsson at gmail.com [henrik.bengtsson at gmail.com] on >> behalf of Henrik Bengtsson [hb at stat.berkeley.edu] >> Sent: Tuesday, July 20, 2010 2:02 AM >> To: Martin Morgan >> Cc: Bazeley, Peter; bioc-devel at stat.math.ethz.ch >> Subject: Re: [Bioc-devel] package size >> >> Consider also package updates; even if you just do a tiny bug fix, >> then one have do download all that data again. >> >> Martin's suggestion to keep a separate experimental data package is a >> good option. It will also makes the data available to others to use >> in their examples (without having to install your main package >> dependencies), e.g. "competing" methods. >> >> /Henrik >> >> On Tue, Jul 20, 2010 at 6:31 AM, Martin Morgan<mtmorgan at fhcrc.org> >> wrote: >>> Hi Peter, >>> >>> On 07/19/2010 09:10 PM, Bazeley, Peter wrote: >>>> Dear List, >>>> >>>> I am creating a package, the purpose of which is to combine data from >>>> different microarray platforms. I have found a NCBI GEO data series >>>> with 3 different platforms (1 Affymetrix and 2 Illumina) that works >>>> well for illustrating my package functions. It would be nice to keep >>>> this data series as a data object for use in the function examples >>>> (currently, 4 of 5 functions use this data object in their example >>>> code) in the documentation, but the xz compressed .rda file >>>> (consisting of 3 data frames, one for each data set) is about 5MB >>> >>> Hmm, but if they are expression data, then an ExpressionSet would more >>> fully represent the data? See library(GEOquery); ?getGEO with the >>> GSEMatrix option set to TRUE, and >>> >>> http://bioconductor.org/packages/2.6/bioc/html/Biobase.html >>> >>> and the 'An Introduction to Biobase and ExpressionSets' vignette. >>> >>>> (total package size is 6MB). Is this too big? >>>> >>>> There are 2 alternatives: >>>> >>>> 1) The package includes a function to download datasets using the >>>> GEOquery package, which could be used to easily re-create the data >>>> frames included in my .rda file. The only downside is that it takes >>>> several minutes to download all the data, so it may be inconvenient, >>>> since this data object is used in example code for the 4 functions. >>>> >>>> 1a) I could have each function example contain code to either a) >>>> download the data and save it in an .RData image file, or b) load the >>>> image file saved in a). This way the investigator would only have to >>>> endure the download once, unless they chose not to save the data. >>>> >>>> 2) I could take, say, the first 1000 genes from each platform. I did >>>> this, and the combined data only has 19 probes/probesets (they are >>>> mapped by Accession/UniGene IDs, and the common transcripts are >>>> extracted) . It would be nice to have a larger example, although not >>>> necessary. Alternatively, I could find a better set of 1000 (or >>>> however many), so that more than 19 are present. >>> >>> A third is to create an experiment data package like those at >>> >>> http://bioconductor.org/packages/release/ExperimentData.html >>> >>> that contains the entire data. This way you get a rich and reproducible >>> example to illustrate your tools. These are really just packages with >>> data objects in the inst/extdata/ (for CEL and other non-R formats) or >>> data/ (for R data objects) directories, and man pages describing the >>> data. >>> >>> Perhaps there is already an experiment data package that meets your >>> needs? >>> >>> Martin >>> >>>> >>>> >>>> Thank you for any assistance, Peter Bazeley >>>> _______________________________________________ >>>> Bioc-devel at stat.math.ethz.ch mailing list >>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel >>> >>> >>> -- >>> Martin Morgan >>> Computational Biology / Fred Hutchinson Cancer Research Center >>> 1100 Fairview Ave. N. >>> PO Box 19024 Seattle, WA 98109 >>> >>> Location: Arnold Building M1 B861 >>> Phone: (206) 667-2793 >>> >>> _______________________________________________ >>> Bioc-devel at stat.math.ethz.ch mailing list >>> https://stat.ethz.ch/mailman/listinfo/bioc-devel >>> >> >> _______________________________________________ >> Bioc-devel at stat.math.ethz.ch mailing list >> https://stat.ethz.ch/mailman/listinfo/bioc-devel > > -- Martin Morgan Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793