Skip to content

[Bioc-devel] vmatchPattern Returns Out of Bounds Indices

4 messages · Dario Strbenac, Hervé Pagès

#
Hello,

If using vmatchPattern to find a sequence in another sequence, the resulting end index can be beyond the length of the subject XStringSet. For example:

forwardPrimer <- "TCTTGTGGAAAGGACGAAACACCG"
[1] 75 75
primerEnds <- vmatchPattern(forwardPrimer, reads, max.mismatch = 1)
[1] 23 76

This causes problems if using extractAt to obtain the sequences within each read. For example:

sequences = extractAt(reads, locations)
Error in .normarg_at2(at, x) : 
  some ranges in 'at' are off-limits with respect to their corresponding sequence
  in 'x'

It's rare, but still a problem, nonetheless.
FALSE   TRUE 
366225      2

This happens with Biostrings 2.42.0.

--------------------------------------
Dario Strbenac
University of Sydney
Camperdown NSW 2050
Australia
#
Hi,

These questions really belong to the support site.
On 11/16/2016 04:00 PM, Dario Strbenac wrote:
That is the expected behavior for matchPattern() and family when
max.mismatch != 0. This was a conscious choice. If you don't
allow indels, the matches will always have the length of the pattern.
This allows straightforward one-to-one mapping between the letters
in the pattern and their positions on the subject, which is a nice
property. For example, if the pattern is ACGT but the reported match
is range 1-3, you don't have an easy way to know how the 3 nucleotides
in the pattern are actually mapped to the subject.

Truncating the match when it goes out of bounds actually means
clipping the pattern. You'll get that behavior by allowing indels:

   > matchPattern("AAACCTT", "AAAAAAAAAAACCT")
     Views on a 14-letter BString subject
   subject: AAAAAAAAAAACCT
   views: NONE

   > matchPattern("AAACCTT", "AAAAAAAAAAACCT", max.mismatch=1)
     Views on a 14-letter BString subject
   subject: AAAAAAAAAAACCT
   views:
       start end width
   [1]     9  15     7 [AAACCT ]

   > matchPattern("AAACCTT", "AAAAAAAAAAACCT", max.mismatch=1, 
with.indels=TRUE)
     Views on a 14-letter BString subject
   subject: AAAAAAAAAAACCT
   views:
       start end width
   [1]     9  14     6 [AAACCT]

But then of course you loose the one-to-one mapping between the
letters in the pattern and their positions on the subject.

The 2 other alternatives are (1) to get rid of the out of bounds
matches or (2) to trim them. For example, to trim them:

   subjects <- DNAStringSet(c("AACCTTTTTT", "AAAAAAAAAAACCT"))
   m <- vmatchPattern("AAACCTT", subjects, max.mismatch=1)
   m
   # MIndex object of length 2
   # [[1]]
   # IRanges object with 1 range and 0 metadata columns:
   #           start       end     width
   #       <integer> <integer> <integer>
   #  [1]         0         6         7
   #
   # [[2]]
   # IRanges object with 1 range and 0 metadata columns:
   #           start       end     width
   #       <integer> <integer> <integer>
   #   [1]         9        15         7

   m2 <- IRangesList(start=pmax(start(m), 1),
                     end=pmin(end(m), width(subjects)))
   m2
   # IRangesList of length 2
   # [[1]]
   # IRanges object with 1 range and 0 metadata columns:
   #           start       end     width
   #       <integer> <integer> <integer>
   #   [1]         1         6         6
   #
   # [[2]]
   # IRanges object with 1 range and 0 metadata columns:
   #           start       end     width
   #       <integer> <integer> <integer>
   #   [1]         9        14         6

Then:

   extractAt(subjects, m2)
   # DNAStringSetList of length 2
   # [[1]] AACCTT
   # [[2]] AAACCT

H.

  
    
1 day later
#
Good day,
I suppose, although it seemed like an unexpected issue at first because it's not documented within ?lowlevel-matching so users don't know what to expect.
This reveals a discrepancy between the documentation and the way the function operates. In the documentation, the function definition of vmatchPattern has with.indels = FALSE in it. However, changing it to TRUE results in

Error in .XStringSet.vmatchPattern(pattern, subject, max.mismatch, min.mismatch,  : 
  vmatchPattern() does not support indels yet

This is utilising Biostrings 2.42.0 in R 3.3.1.

--------------------------------------
Dario Strbenac
University of Sydney
Camperdown NSW 2050
Australia
#
On 11/18/2016 02:00 AM, Dario Strbenac wrote:
Ah ok, right. I tested with matchPattern() only (which does support
indels) and forgot that vmatchPattern() doesn't support them yet.

So I guess a solution for you now is to use one of the 2 other
alternatives i.e. (1) to get rid of the out of bounds matches
or (2) to trim the matches.

H.