Skip to content

[Bioc-devel] BiocCheck - warning: files are over 5MB

5 messages · Claris Baby, Pariksheet Nanda, Martin Morgan +1 more

#
Dear all,I am trying to run 
BiocCheck("/home/package_name", list(`no-check-vignettes`=TRUE))
There is no error but I am getting the following warning.
$warning
[1] "The following files are over 5MB in size: 'dataset/Caenorhabditis_elegans.WBcel235.dna.chromosome.I.fa'....."
This as well as other data like .gff files, that are being used for the reference based assembly are all much more than 5mb.
But the total package size is less than 500mb.
Is it essential that each file within the package is less than 5mb. If so, it would be very kind if anyone could suggest how to reduce the size of the genomic data files.

Wating for your valuable suggestions
Thanking you.
With regards,Claris Baby
#
Hi Claris,

On Sat, Mar 10, 2018 at 2:49 AM, Claris Baby via Bioc-devel <
bioc-devel at r-project.org> wrote:
Assuming that's not a typo, 500 mb is very large and inappropriate for a
package.  It's generally good practice to separate code and data where
possible, not least because it bloats code version control.  If your
package size is close to 500 mb, you should think about stashing the data
and accessing it using something like the AnnotationHub or BiocFileCache
(some others on the mailing list might have better and more specific
suggestions as I've not yet had to deal with this particular problem, if
you confirm that the package is indeed that big).
Can you gzip compress those data files?  Text based files usually compress
quite well and many functions like import() from tracklayer will
automagically decompress them so you might not even need to change much in
your code.

.gz isn't the most disk efficient compression algorithm out there; .bz2
compresses better and is actually what R natively uses for save() and
load() of .RData files, and .xz typically yields even better lossless
compression but, for cross-platform compatibility that Bioconductor strives
for, using .gz might be best to try first.
Pariksheet
#
Whoops, meant to say compression format.

Pariksheet
#
On 03/10/2018 09:03 AM, Pariksheet Nanda wrote:
yes, large files should be made available by a package that uses 
AnnotationHub or ExperimentHub for the resources. Also, it's often 
possible to re-use existing resources and, in a vignette, to 
_illustrate_ package functionality rather than redo a complete 'real' 
analysis.

See

 
http://bioconductor.org/packages/devel/bioc/vignettes/AnnotationHub/inst/doc/CreateAnAnnotationPackage.html

Martin
This email message may contain legally privileged and/or...{{dropped:2}}
#
Good day,

You could make use of the package named BSgenome.Celegans.UCSC.ce11. It contains the DNA sequences of all of the chromosomes of the roundworm and doesn't add any size to your package.

--------------------------------------
Dario Strbenac
University of Sydney
Camperdown NSW 2050
Australia