[Bioc-devel] Fwd: Re: [devteam-bioc] suggestion about bioconductorpackage-- BSgenome.Hsapiens.NCBI.GRCh38

Hi Michael,
Looks like a fairly serious bug with BSgenome:
Yes this is a serious one. I also received this report off list a couple
of days ago. FWIW getSeq() uses Rsamtools::scanFa() which is based on
samtools C code (it's here Rsamtools/src/samtools/faidx.c).

I was not able to reproduce this so far (seems like you need to be on
Windows and use a specific LC_COLLATE to observe this).

Anyway, given the seriousness of the problem, my plan is to roll back
all the BSgenome packages to the old on disk storage in the next couple
of days, in release and devel... sigh...

I'll announce here and on the Bioconductor list when the new packages
are out.

Sorry for the inconvenience,

H.
---------- Forwarded message ----------
From: liyangbjmu <liyangbjmu at bjmu.edu.cn>
Date: Sun, Apr 20, 2014 at 6:31 AM
Subject: Re: Re: [Bioc-devel] [devteam-bioc] suggestion about
bioconductorpackage-- BSgenome.Hsapiens.NCBI.GRCh38
To: Michael Lawrence <lawrence.michael at gene.com>

  Hi Michael,

There is another problem with BSgeome.Hsapiens.NCBI.GRCh38. It can not
always get the same sequence with getSeq command.

  >library(BSgenome.Hsapiens.NCBI.GRCh38)
genome<-BSgenome.Hsapiens.NCBI.GRCh38
  > getSeq(genome,*'6',start=30000,width=50*)
   50-letter "DNAString" instance
seq: *GATGGCATCCAAGAAAGGGATGAGAATGTGAGATCCAGAAGGAAAAGCAG*
getSeq(genome,*'6',start=30000,width=50*)
   50-letter "DNAString" instance
seq: *TTGGAAGAAACAGGAAAACAGACCCTCAGAGACACAAAGGATGCTGAGAG*
getSeq(genome,*'6',start=30000,width=50*)
   50-letter "DNAString" instance
seq: *AGTGGCAGAGAGAAGAGTTGAAGGGGAGAAGTTGCTAGAACCTTGCTGCC*
getSeq(genome,'6',start=30000,width=50)
   50-letter "DNAString" instance
seq: CTCCTGCCCCCCTACCCTCACCTGGGTACCAGCCCAGGGGCCTCGGTCTG
getSeq(genome,'6',start=30000,width=50)
   50-letter "DNAString" instance
seq: CAAGGTCTGTAGTCCCTGCTGGATCTGCAGCAATGCCTGCATGGCTCGGG
getSeq(genome,'6',start=30000,width=50)
   50-letter "DNAString" instance
seq: GATTGGTAAGGATGGAGAGTGACTCTGGGTTCTGCATCTGGTGGGAAATA

  > sessionInfo()
R version 3.1.0 (2014-04-10)
Platform: i386-w64-mingw32/i386 (32-bit)
  locale:
[1] LC_COLLATE=Chinese_People's Republic of China.936
[2] LC_CTYPE=Chinese_People's Republic of China.936
[3] LC_MONETARY=Chinese_People's Republic of China.936
[4] LC_NUMERIC=C
[5] LC_TIME=Chinese_People's Republic of China.936
  attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

Best,

Sean
2014-04-20
------------------------------

*Yang Li, PhD*
Department of Occupational and Environmental Health Science School of
Public Health, Peking University 38 Xueyuan Road, Beijing 100191, China
------------------------------
  *????????* Michael Lawrence
*??????????* 2014-04-17  05:15:10
*????????* Herv?_Pag?s
*??????* liyangbjmu; maintainer; bioc-devel at r-project.org
*??????* Re: [Bioc-devel] [devteam-bioc] suggestion about
bioconductorpackage-- BSgenome.Hsapiens.NCBI.GRCh38
   Another interesting aspect of 2bit is that it supports simple masking.
I'm all for the 2bit direction. But then BSgenome would depend on
rtracklayer? Or have you reimplemented it?

On Wed, Apr 16, 2014 at 12:59 PM, Herv?? Pag??s <hpages at fhcrc.org> wrote:

Hi Sean

[hope you don't mind if I cc Bioc-devel]

On 04/15/2014 11:47 PM, Maintainer wrote:

Hi The Bioconductor Dev Team,
A new package called BSgenome.Hsapiens.NCBI.GRCh38 has been available
for one week. But the single-sequence.fa.gz file in this package is too
big. The sequences of all the chromosomes are put together. It is
difficult to load.  Why do you remake this
package as BSgenome.Hsapiens.UCSC.hg19 do? In
BSgenome.Hsapiens.UCSC.hg19, sequence files for each chromosome are
separated.

Yeah I know... sigh!!  ;-)

All the BSgenome data packages use this new on disk storage format, not
just GRCh38. All the chromosomes are now put together in a single
FASTA file compressed in the RAZip format (extension is rz, not gz).
The file is indexed so it makes direct random access faster. So for
example, extracting only some small portions of the genome with
getSeq() is much faster than with the previous on disk storage format
where the chromosomes were serialized into individual files.
Some people have manifested interest in having getSeq() do fast
random access to the genome sequences for a while. See for example
this post from Michael in 2011:

   https://stat.ethz.ch/pipermail/bioc-devel/2011-May/002601.html

I'm aware that the new on disk storage format slows down operations
where you actually need to compute on the entire genome, which is
typically done in a loop where one chromosome is loaded at a time.
It's hard to beat the speed (and simplicity) of just loading the
individual serialized DNAString objects in that case. This is
unfortunate and I was a little bit worried about this when I pushed
the new BSgenome packages to the public repos because I think this
is a common use case.

Note that the switch to this new format happened in devel in January
and was announced on the Bioc-devel mailing list:

   https://stat.ethz.ch/pipermail/bioc-devel/2014-January/005150.html

I was hoping people would test this and maybe start a discussion about
whether this change was worth it or not.

The good news is that I think we can have the best of both worlds i.e.
fast direct random access and fast loading of the full sequences.
I did some testing with the 2bit format and it looks very promising:

   - Random access is much faster than with current RAZip'ed FASTA
     format,

   - Loading a full chromosome into memory is also very fast because
     there isn't the overhead of decompressing the data. It could
     even be that it's faster than loading the chromosome stored in
     a serialized DNAString object, or maybe not, but at least it
     should be very close.

   - The sequences take less space on disk and the resulting BSgenome
     package will also be slightly smaller.

It's something that I was planning to work on early in this new devel
cycle. Seems like I should start even earlier and maybe backport to
BioC release?

Thanks,
H.

Best,
Sean

2014-04-16
------------------------------------------------------------------------

________________________________________________________________________
devteam-bioc mailing list
To unsubscribe from this mailing list send a blank email to
devteam-bioc-leave at lists.fhcrc.org
You can also unsubscribe or change your personal options at
https://lists.fhcrc.org/mailman/listinfo/devteam-bioc

--
Herv?? Pag??s

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319 <%28206%29%20667-1319>

_______________________________________________
Bioc-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

	[[alternative HTML version deleted]]

_______________________________________________
Bioc-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Herv? Pag?s

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319

[Bioc-devel] Fwd: Re: [devteam-bioc] suggestion about bioconductorpackage-- BSgenome.Hsapiens.NCBI.GRCh38

Thread (7 messages)