Skip to content

[Bioc-devel] PhredQuality from Biostrings

4 messages · Christian Ruckert, Martin Morgan

#
Hi,

I have written a function to read-in Roche SFF(Standard Flowgram Format) 
files into R. Now I want to store the contents in standard Bioconductor 
structures (e.q. sequences as DNAStringSet object). I have the quality 
scores as a list of integer vectors. One list entry for each sequence. 
The vector lengths correspond to the sequence lengths. The vectors 
contain entries between 0 and 40 corresponding to the base quality at 
this position.

Here is an example for one list entry, a sequence of length 82:
[1] 40 40 40 40 40 40 40 40 40 40 40 40 36 24 16 16 16 27 27 36 20 20 
27 27 31
[26] 27 36 38 39 40 40 40 40 40 40 40 40 40 40 40 40 40 39 34 34 38 39 
40 40 40
[51] 40 40 40 40 40 40 40 40 40 40 40 40 30 20 20 20 36 40 40 40 40 30 
30 30 30
[76] 39 40 40 40 40 40 40

Now I'm looking for an elegant way to convert my list of integer vectors 
to an PhredQuality object, but the solution I found is very slow for a 
list with 90000 sequences and a mean sequence length of around 400.
toString(PhredQuality(x))))

Is there a faster way creating a PhredQuality object out of a list like 
mine.

Regards,
Christian
R version 2.14.0 Under development (unstable) (2011-05-17 r55946)
Platform: x86_64-unknown-linux-gnu (64-bit)
locale:
  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
  [7] LC_PAPER=C                 LC_NAME=C
  [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] R453Plus1Toolbox_1.3.1
[2] BSgenome.Scerevisiae.UCSC.sacCer2_1.3.17
[3] BSgenome_1.21.0
[4] GenomicRanges_1.5.7
[5] Biostrings_2.21.3
[6] IRanges_1.11.5
[7] Biobase_2.13.2

loaded via a namespace (and not attached):
[1] biomaRt_2.9.1    hwriter_1.3      R2HTML_2.2       RCurl_1.6-1
[5] Rsamtools_1.5.17 ShortRead_1.11.6 tools_2.14.0     XML_3.4-0
#
On 06/10/2011 08:01 AM, Christian Ruckert wrote:
Hi Christian

Maybe along the lines of

PhredQuality(sapply(qualitylist, function(x) rawToChar(as.raw(x + 33))))

or via ShortRead::readQual / readFastaQual (can use a character vector 
for the path; no need to create a RochePath). Probably you'll find 
ShortReadQ useful for coordinating the sequences and qualities

Martin

  
    
4 days later
#
Am 10.06.2011 19:54, schrieb Martin Morgan:
This really speeds up things, thanks.
I have successfully created a ShortReadQ object out of my sequences

 > srq
class: ShortReadQ
length: 95551 reads; width: 77..1201 cycles

Is it reasonable to use ShortReadQ for sequences from Roche with their 
differing lengths? It seems to work but the manual always states 
"uniform-length short reads".

Christian
#
On 06/15/2011 10:28 AM, Christian Ruckert wrote:
yes it is fine; some operations (e.g., as(quality(srq), "matrix")) fail, 
as the qualities are not rectangular (I have sometimes found it 
convenient to look at trailing quality, then something like narrow(srq, 
start=width(rfq)-20) can be used).

Martin