Skip to content

[Bioc-devel] Biostrings: ambiguous matching for 'compareStrings'

2 messages · Julian Gehring, Hervé Pagès

#
Hi,

When I want to compare two aligned sequences of the same length and find 
the mismatching positions, 'compareStrings' from the 'Biostrings' 
package seems to be the best choice.  However, it does take into account 
any ambiguous matches (as matching any base to 'N'), as it can be found 
with the 'fixed' for 'matchPattern' and related functions.  Would it be 
reasonable to include this, to ensure a consistent behavior of methods 
for 'XString' and related objects?

Best wishes
Julian
#
Hi Julian,
On 03/15/2013 08:27 AM, Julian Gehring wrote:
Adding the 'fixed' arg to compareStrings() sounds like a good
suggestion to me. Note that it would only make sense to use
compareStrings() with 'fixed=FALSE' if the 2 sequences were
actually aligned with 'fixed=FALSE'. So from now I'm assuming
that this is how your sequences were aligned i.e. with a tool
that supports 'fixed=FALSE'.

However, since no alignment tools in Biostrings that supports
'fixed=FALSE' also supports indels (but see [1]), I would also
assume that your aligned sequences don't contain indels. Unless
your sequences were aligned with an external tool?

So if that's the case (i.e. no dashes in your aligned sequences),
and if you're only interested in the number of mismatches, you can
use neditAt():

   > s1 <- DNAString("AANTTTWCCW")
   > s2 <- DNAString("ACATTGSCCV")
   > neditAt(s1, s2)
   [1] 5
   > neditAt(s1, s2, fixed=FALSE)
   [1] 3

If you really want the position of the mismatches, here is an ugly
and inefficient trick:

   > mapply(function(x, y) neditAt(x, y, fixed=FALSE),
            DNAStringSet(s1, seq_along(s1), width=1),
            DNAStringSet(s2, seq_along(s2), width=1))
  [1] 0 1 0 0 0 1 1 0 0 0

then coerce to logical with as.logical() and apply which() on the
result to get the positions of the mismatches.

The fact that there is not easy/efficient workaround (and that
those workarounds only work correctly if the aligned sequences
don't contain indels), would make the addition of the 'fixed'
arg to compareStrings() even more valuable. Just added to my
TODO list.

Cheers,
H.

[1] At least not at the moment, but I still have to check-in a
     patch from H. Jaffee that would allow matchPattern() and
     family to support with.indels=TRUE and fixed=FALSE at the
     same time. Hopefully I manage to get to this before the
     upcoming release. BTW thanks Harris for your infinite patience
     on this matter :-)