[Bioc-devel] PhredQuality from Biostrings
On 06/10/2011 08:01 AM, Christian Ruckert wrote:
Hi, I have written a function to read-in Roche SFF(Standard Flowgram Format) files into R. Now I want to store the contents in standard Bioconductor structures (e.q. sequences as DNAStringSet object). I have the quality scores as a list of integer vectors. One list entry for each sequence. The vector lengths correspond to the sequence lengths. The vectors contain entries between 0 and 40 corresponding to the base quality at this position. Here is an example for one list entry, a sequence of length 82:
qualitylist[[1]]
[1] 40 40 40 40 40 40 40 40 40 40 40 40 36 24 16 16 16 27 27 36 20 20 27 27 31 [26] 27 36 38 39 40 40 40 40 40 40 40 40 40 40 40 40 40 39 34 34 38 39 40 40 40 [51] 40 40 40 40 40 40 40 40 40 40 40 40 30 20 20 20 36 40 40 40 40 30 30 30 30 [76] 39 40 40 40 40 40 40 Now I'm looking for an elegant way to convert my list of integer vectors to an PhredQuality object, but the solution I found is very slow for a list with 90000 sequences and a mean sequence length of around 400.
pq = PhredQuality(sapply(qualitylist, function(x)
toString(PhredQuality(x))))
Hi Christian Maybe along the lines of PhredQuality(sapply(qualitylist, function(x) rawToChar(as.raw(x + 33)))) or via ShortRead::readQual / readFastaQual (can use a character vector for the path; no need to create a RochePath). Probably you'll find ShortReadQ useful for coordinating the sequences and qualities Martin
Is there a faster way creating a PhredQuality object out of a list like mine. Regards, Christian
sessionInfo()
R version 2.14.0 Under development (unstable) (2011-05-17 r55946) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] R453Plus1Toolbox_1.3.1 [2] BSgenome.Scerevisiae.UCSC.sacCer2_1.3.17 [3] BSgenome_1.21.0 [4] GenomicRanges_1.5.7 [5] Biostrings_2.21.3 [6] IRanges_1.11.5 [7] Biobase_2.13.2 loaded via a namespace (and not attached): [1] biomaRt_2.9.1 hwriter_1.3 R2HTML_2.2 RCurl_1.6-1 [5] Rsamtools_1.5.17 ShortRead_1.11.6 tools_2.14.0 XML_3.4-0
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Computational Biology Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: M1-B861 Telephone: 206 667-2793