Loop avoidance and logical subscripts - R-help

retama

Thu, May 21, 2009 6:12 AM #

Hello!

I'm writing a script with a lot of loops and it executes really slowly over
huge amounts of data. I assume it's because I don't know how to avoid using
loops. Logical subscripts are more desirable, but I don't know how to
implement them. One example of that issue:

library(seqinr)
GCsequence <- vector()
	for( i in 1:(length(data$sequence))) {
		c(GCsequence,GC(s2c(data$sequence[i])))->data$GCsequence[i]
	}
	rm(GCsequence)

How should I speed up that? 

Thank you, 

Retama

View this message in context: http://www.nabble.com/Loop-avoidance-and-logical-subscripts-tp23652935p23652935.html
Sent from the R help mailing list archive at Nabble.com.

retama

Thu, May 21, 2009 9:56 AM #

Patrick Burns kindly provided an article about this issue called 'The R
Inferno'. However, I will expand a little bit my question because I think it
is not clear and, if I coud improve the code it will be more understandable
to other users reading this messages when I will paste it :)

In my example, I have a dataframe with several hundreds of DNA sequences in
the column data$sequences (each value is a long string written in an
alphabet of four characters, which are A, C, T and G). I'm trying to know
parameter number of Gs plus Cs over the total  [G+C/(A+T+C+G)] in each
sequence. In example, data$sequence [1] is something like AATTCCCGGGGGG but
a little bit longer, and, its G+C content is 0.69 . I need to compute a
vector with all G+C contents (in my example, in data$GCsequence, in which
data$GCsequence[1] is 0.69).

So the question was if making a loop and a combination of values with c() or
cbind() or with logical subscripts is ok or not. And which approach should
produce better results in terms of efficiency (my script goes really slow).

Thank you,

Retama

View this message in context: http://www.nabble.com/Loop-avoidance-and-logical-subscripts-tp23652935p23656703.html
Sent from the R help mailing list archive at Nabble.com.

(Ted Harding)

Thu, May 21, 2009 10:18 AM #

On 21-May-09 16:56:23, retama wrote:

Perhaps the following could be the basis of your code for the bigger
problem:

  S <- unlist(strsplit("AATTCCCGGGGGG",""))
  S
#  [1] "A" "A" "T" "T" "C" "C" "C" "G" "G" "G" "G" "G" "G"
  (sum((S=="C")|(S=="G")))
# [1] 9
  (sum((S=="C")|(S=="G")))/length(S)
# [1] 0.6923077

You could build a function on those lines, to evaluate what you
want for any given string; and then apply() it to the elements
(which are the separate character strings) of data$sequences
(which is presumably a vector of character strings).

Ted.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at manchester.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 21-May-09                                       Time: 18:18:24
------------------------------ XFMail ------------------------------

Martin Morgan

Thu, May 21, 2009 10:18 AM #

retama wrote:

A very efficient way to do this is

   library(Biostrings)
   dna = DNAStringSet(data$sequence)
   alf = alphabetFrequency(dna, baseOnly=TRUE)
   gc = rowSums(alf[,c("G", "C")]) / rowSums(alf)

this takes about .8 second for 3 million 36mers, for instance. 
Biostrings is installed with

   source('http://bioconductor.org/biocLite.R')
   biocLite('Biostrings')

Martin

Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793

retama

Wed, May 27, 2009 9:16 AM #

Thank you! The script is now adapted to Biostrings and it is really fast! For
example, it does:

   alph_sequence <- alphabetFrequency(data$sequence, baseOnly=TRUE)
   data$GCsequence <- rowSums(alph_sequence[,c("G", "C")]) /
rowSums(alph_sequence)

in the G+C computation. It also works amazingly fast in substring extraction
(substring), reverse complement (reverseComplement sequences), palindromes
search (findComplementedPalindromes) and so on.

Now, my bottleneck is conventional string handling, because I have not found
yet how to convert DNAStringSets to vector of chars. Now, I'm doing it by:

   dna <- vector()
	for (i in 1:length(dnaset)) {
		c(dna, toString(data$dnaset[[i]])) -> dna
	}

Regards,

Retama

View this message in context: http://www.nabble.com/Loop-avoidance-and-logical-subscripts-tp23652935p23745814.html
Sent from the R help mailing list archive at Nabble.com.