Performing basic Multiple Sequence Alignment in R?

Sun, Dec 26, 2010 6:42 AM

From: marchywka at hotmail.com
To: tal.galili at gmail.com; r-help at r-project.org
Subject: RE: [R] Performing basic Multiple Sequence Alignment in R?
Date: Tue, 21 Dec 2010 17:03:17 -0500

From: tal.galili at gmail.com
Date: Tue, 21 Dec 2010 20:17:18 +0200
Subject: Re: [R] Performing basic Multiple Sequence Alignment in R?
To: r-help at r-project.org

Dear Mike and Thomas,

From what I gathered here (Thanks to Joris Meys):
http://stackoverflow.com/questions/4497747/how-to-perform-basic-multiple-sequence-alignments-in-r/4498434#4498434
There is an R interface to the MUSCLE algorithm in the bio3d package
(function seqaln()).
But not one for clustal.

I will probably end up using pairwiseAlignment on pairs of allignments
with some sort of stopping rules (I'll have to play with it to see how
it works).


http://scholar.google.com/scholar?hl=en&q=%22exact+string+matching%22+alignment

http://citeseerx.ist.psu.edu/search?q=exact+string+matching+alignment+dna&submit=Search&sort=rel

Certainly if you are flexible and can use whatever may be close in R that
is fine but I seem to recall that exact string matching was a fast and
interesting way to go and maybe some of the authors above, in the interest
of promoting their work, would help implement an R version if there is demand.

I seem to recall I did something like building indexes of the strings to be aligned
first, finding substrings that were unique to a given string but appeared only
once in each of the sequences to be aligned ( this was the most restrictive criterion
but you can imagine how to make it more accomodating). Now that you got me started,
up front tokenizing or compiling of input sequences ( usually no more than indexing
them in some way ) made many later operations like alignment go faster. This
may have ended up being similar to BLAST but now I can't really recall. Anyway,
my point here is that some where in R there may be packages that
generate intermediate forms useful across disciplines- mining data from
text, linquistics, or macromolecule analysis.  In fact, the indexing process
helps find things that have migrated a long ways from their original place
and there are probably other non-alignment related things you could
get out of the approach.

If you pursue this or make some decision would you please get back to
us, at least me off list? I just went back through my old code and hit the 
search links I posted above, this still seems like quite an interesting
area and the issues do not appear to be confined to bio. Looking at
my method names in my code, it looks like I had a way to supply fixed patterns,
probably from places like PROSITE or CDD, for use as the string you
probably meant to suggest although I seem to think it would make more sense
to discover these based on the strings it finds in the sequences.

I seem to recall I could do 2 sequences reasonably well with some quirks and limitations
but gave up when I tried to do multiple alignments ( actually there was no point
at the time). Recent literature seems to still talk about sub-quadratic time 
although practically for large sequences the real execution time could be dominated
by VM not algorithm order LOL. The indexing also makes it possible to find related
but distant strings, something that may be of interest but not normally
thought of as alignment between strings perturbed in limited ways ( "edit distance"
being rather restricted to a few operations). 

If you find a specific paper or approach that seems to work that may be
of interest to many here and indeed may be implemented under some other name. 

Thanks.

Performing basic Multiple Sequence Alignment in R?

Thread (6 messages)