Loop avoidance and logical subscripts
On 21-May-09 16:56:23, retama wrote:
Patrick Burns kindly provided an article about this issue called 'The R Inferno'. However, I will expand a little bit my question because I think it is not clear and, if I coud improve the code it will be more understandable to other users reading this messages when I will paste it :) In my example, I have a dataframe with several hundreds of DNA sequences in the column data$sequences (each value is a long string written in an alphabet of four characters, which are A, C, T and G). I'm trying to know parameter number of Gs plus Cs over the total [G+C/(A+T+C+G)] in each sequence. In example, data$sequence [1] is something like AATTCCCGGGGGG but a little bit longer, and, its G+C content is 0.69 . I need to compute a vector with all G+C contents (in my example, in data$GCsequence, in which data$GCsequence[1] is 0.69). So the question was if making a loop and a combination of values with c() or cbind() or with logical subscripts is ok or not. And which approach should produce better results in terms of efficiency (my script goes really slow). Thank you, Retama
Perhaps the following could be the basis of your code for the bigger
problem:
S <- unlist(strsplit("AATTCCCGGGGGG",""))
S
# [1] "A" "A" "T" "T" "C" "C" "C" "G" "G" "G" "G" "G" "G"
(sum((S=="C")|(S=="G")))
# [1] 9
(sum((S=="C")|(S=="G")))/length(S)
# [1] 0.6923077
You could build a function on those lines, to evaluate what you
want for any given string; and then apply() it to the elements
(which are the separate character strings) of data$sequences
(which is presumably a vector of character strings).
Ted.
--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at manchester.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 21-May-09 Time: 18:18:24
------------------------------ XFMail ------------------------------