Performing basic Multiple Sequence Alignment in R?
Tal; I'm trimming the BioC posting. In the R lists it is considered spamming to cross post. (Please re-read the Posting Guide.)
On Dec 21, 2010, at 4:21 AM, Tal Galili wrote:
Hello everyone,
I am not sure if this should go on the general R mailing list (for
example,
if there is a text mining solution that might work here) or the
bioconductor
mailing list (since I wasn't able to find a solution to my question on
searching their lists) - so this time I tried both, and in the
future I'll
know better (in case it should go to only one of the two).
The task I'm trying to achieve is to align several sequences together.
I don't have a basic pattern to match to. All that I know is that the
"True" pattern should be of length "30" and that the sequences I'm
looking
at, have had missing values introduced to them at random points.
Here is an example of such sequences, were on the left we see what
is the
real location of the missing values, and on the right we see the
sequence
that we will be able to observe. My goal is to reconstruct the left
column
using only the sequences I've got on the right column (based on the
fact
that many of the letters in each position are the same)
Real_sequence The_sequence_we_see
1 CGCAATACTAAC-AGCTGACTTACGCACCG CGCAATACTAACAGCTGACTTACGCACCG
2 CGCAATACTAGC-AGGTGACTTCC-CT-CG CGCAATACTAGCAGGTGACTTCCCTCG
3 CGCAATGATCAC--GGTGGCTCCCGGTGCG CGCAATGATCACGGTGGCTCCCGGTGCG
4 CGCAATACTAACCA-CTAACT--CGCTGCG CGCAATACTAACCACTAACTCGCTGCG
5 CGCACGGGTAAGAACGTGA-TTACGCTCAG CGCACGGGTAAGAACGTGATTACGCTCAG
6 CGCTATACTAACAA-GTG-CTTAGGC-CTG CGCTATACTAACAAGTGCTTAGGCCTG
7 CCCA-C-CTAA-ACGGTGACTTACGCTCCG CCCACCTAAACGGTGACTTACGCTCCG
The agrep function allows one to specify which sort of differences to consider in calculating a Levenshtein edit distance. Insertions are one possible distance component. You could take a look at its code (in C in hte sources) and perhaps rejigger it to spit out the location of the deletions. > agrep(seqdat$The_sequence_we_see[1], seqdat$Real_sequence, max.distance=list(deletions=0, substitutions=0, insertions=0)) integer(0) > agrep(seqdat$The_sequence_we_see[1], seqdat$Real_sequence, max.distance=list(deletions=0, substitutions=0, insertions=1)) [1] 1
David.
>
> Here is an example code to reproduce the above example:
>
> ATCG <- c("A","T","C","G")
> set.seed(40)
>
> original.seq <- sample(ATCG, 30, T)
>
> seqS <- matrix(original.seq,200,30, T)
>
> change.letters <- function(x, number.of.changes = 15,
> letters.to.change.with = ATCG)
> {
>
> number.of.changes <- sample(seq_len(number.of.changes), 1)
>
> new.letters <- sample(letters.to.change.with , number.of.changes,
> T)
>
> where.to.change.the.letters <- sample(seq_along(x) ,
> number.of.changes, F)
>
> x[where.to.change.the.letters] <- new.letters
>
> return(x)
> }
>
> change.letters(original.seq)
>
> insert.missing.values <- function(x) change.letters(x, 3, "-")
>
> insert.missing.values(original.seq)
>
> seqS2 <- t(apply(seqS, 1, change.letters))
>
> seqS3 <- t(apply(seqS2, 1, insert.missing.values))
>
> seqS4 <- apply(seqS3,1, function(x) {paste(x, collapse = "")})
> require(stringr)
> # library(help=stringr)
>
> all.seqS <- str_replace(seqS4,"-" , "")
>
> # how do we allign this?
>
> data.frame(Real_sequence = seqS4, The_sequence_we_see = all.seqS)
>
>
> I understand that if all I had was a string and a pattern I would be
> able to
> use
>
> library(Biostrings)
>
> pairwiseAlignment(...)
>
>
>
> But in the case I present we are dealing with many sequences to
> align to one
> another (instead of aligning them to one pattern).
>
> Is there a known method for doing this in R?
>
>
> Thanks,
>
> Tal
>
>
>
> ----------------Contact
> Details:-------------------------------------------------------
> Contact me: Tal.Galili at gmail.com | 972-52-7275845
> Read me: www.talgalili.com (Hebrew) | www.biostatistics.co.il
> (Hebrew) |
> www.r-statistics.com (English)
> ----------------------------------------------------------------------------------------------
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
David Winsemius, MD
West Hartford, CT