Skip to content
Prev 10121 / 21312 Next

[Bioc-devel] Feasibility of Parallel Extraction of Matches with extractAllMatches

Hi Dario,
On 11/16/2016 02:00 AM, Dario Strbenac wrote:
Not possible because extractAllMatches() returns a Views object and
a Views object can only represent views defined on a *single* subject.

extractAllMatches() is old and predates extractAt() which can be used
for this. See man page for extractAt/replaceAt for more information.
In particular the "(C) ADVANCED EXAMPLES" section in the man page
shows how to use extractAt() to extract the matches returned by
vmatchPattern().
2 issues with substr():

   (1) It will be quite inefficient if there are millions of matches
       to extract since it actually generates a copy of the matches.
       extractAllMatches() and extractAt() don't have this problem
       because they don't generate copies of the original sequence
       data. Even extractAt(), because the DNAStringSetList object
       it returns actually contains views on the original DNAStringSet
       subject, except that these views are Biostrings internal business
       and not something that can easily be seen unless you look
       at the internals of the DNAStringSet and DNAStringSetList
       objects.

   (2) substr() returns a "flat" vector so in general the mapping
       between the matches and the individual sequences in the
       DNAStringSet subject is lost.
This should be addressed in Biostrings 2.43.1. Thanks!

H.