Dear all,I am trying to run
BiocCheck("/home/package_name", list(`no-check-vignettes`=TRUE))
There is no error but I am getting the following warning.
$warning
[1] "The following files are over 5MB in size: 'dataset/Caenorhabditis_elegans.WBcel235.dna.chromosome.I.fa'....."
This as well as other data like .gff files, that are being used for the reference based assembly are all much more than 5mb.
But the total package size is less than 500mb.
Is it essential that each file within the package is less than 5mb. If so, it would be very kind if anyone could suggest how to reduce the size of the genomic data files.
Wating for your valuable suggestions
Thanking you.
With regards,Claris Baby
[Bioc-devel] BiocCheck - warning: files are over 5MB
5 messages · Claris Baby, Pariksheet Nanda, Martin Morgan +1 more
Hi Claris, On Sat, Mar 10, 2018 at 2:49 AM, Claris Baby via Bioc-devel <
bioc-devel at r-project.org> wrote:
[1] "The following files are over 5MB in size: 'dataset/Caenorhabditis_elegans.WBcel235.dna.chromosome.I.fa'....." This as well as other data like .gff files, that are being used for the reference based assembly are all much more than 5mb. But the total package size is less than 500mb.
Assuming that's not a typo, 500 mb is very large and inappropriate for a package. It's generally good practice to separate code and data where possible, not least because it bloats code version control. If your package size is close to 500 mb, you should think about stashing the data and accessing it using something like the AnnotationHub or BiocFileCache (some others on the mailing list might have better and more specific suggestions as I've not yet had to deal with this particular problem, if you confirm that the package is indeed that big).
Is it essential that each file within the package is less than 5mb. If so, it would be very kind if anyone could suggest how to reduce the size of the genomic data files.
Can you gzip compress those data files? Text based files usually compress quite well and many functions like import() from tracklayer will automagically decompress them so you might not even need to change much in your code. .gz isn't the most disk efficient compression algorithm out there; .bz2 compresses better and is actually what R natively uses for save() and load() of .RData files, and .xz typically yields even better lossless compression but, for cross-platform compatibility that Bioconductor strives for, using .gz might be best to try first.
Claris Baby
Pariksheet
disk efficient compression algorithm
Whoops, meant to say compression format. Pariksheet
On 03/10/2018 09:03 AM, Pariksheet Nanda wrote:
Hi Claris, On Sat, Mar 10, 2018 at 2:49 AM, Claris Baby via Bioc-devel < bioc-devel at r-project.org> wrote:
[1] "The following files are over 5MB in size: 'dataset/Caenorhabditis_elegans.WBcel235.dna.chromosome.I.fa'....." This as well as other data like .gff files, that are being used for the reference based assembly are all much more than 5mb. But the total package size is less than 500mb.
Assuming that's not a typo, 500 mb is very large and inappropriate for a package. It's generally good practice to separate code and data where possible, not least because it bloats code version control. If your package size is close to 500 mb, you should think about stashing the data and accessing it using something like the AnnotationHub or BiocFileCache
yes, large files should be made available by a package that uses AnnotationHub or ExperimentHub for the resources. Also, it's often possible to re-use existing resources and, in a vignette, to _illustrate_ package functionality rather than redo a complete 'real' analysis. See http://bioconductor.org/packages/devel/bioc/vignettes/AnnotationHub/inst/doc/CreateAnAnnotationPackage.html Martin
(some others on the mailing list might have better and more specific suggestions as I've not yet had to deal with this particular problem, if you confirm that the package is indeed that big).
Is it essential that each file within the package is less than 5mb. If so, it would be very kind if anyone could suggest how to reduce the size of the genomic data files.
Can you gzip compress those data files? Text based files usually compress quite well and many functions like import() from tracklayer will automagically decompress them so you might not even need to change much in your code. .gz isn't the most disk efficient compression algorithm out there; .bz2 compresses better and is actually what R natively uses for save() and load() of .RData files, and .xz typically yields even better lossless compression but, for cross-platform compatibility that Bioconductor strives for, using .gz might be best to try first.
Claris Baby
Pariksheet [[alternative HTML version deleted]]
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
This email message may contain legally privileged and/or...{{dropped:2}}
Good day, You could make use of the package named BSgenome.Celegans.UCSC.ce11. It contains the DNA sequences of all of the chromosomes of the roundworm and doesn't add any size to your package. -------------------------------------- Dario Strbenac University of Sydney Camperdown NSW 2050 Australia