[Bioc-devel] Feasibility of Parallel Extraction of Matches with extractAllMatches
Hi Dario,
On 11/16/2016 02:00 AM, Dario Strbenac wrote:
Good day,
I'd like to request that extractAllMatches works when subject is an XStringSet. The function could check that subject and mindex have the same length and then process them in parallel. Currently, the following example isn't immediately possible.
words <- BStringSet(c("xxGOATzz", "xxMOATzz", "xxNOTEzz"))
matches <- vmatchPattern("GOAT", words, max.mismatch = 1)
similarWords <- extractAllMatches(words, matches) # Not possible.
Not possible because extractAllMatches() returns a Views object and a Views object can only represent views defined on a *single* subject. extractAllMatches() is old and predates extractAt() which can be used for this. See man page for extractAt/replaceAt for more information. In particular the "(C) ADVANCED EXAMPLES" section in the man page shows how to use extractAt() to extract the matches returned by vmatchPattern().
Could that be implemented for the next release of Biostrings? Or, perhaps it can be deprecated since it duplicates the functionality of substr?
substr(words, start(matches), end(matches))
[1] "GOAT" "MOAT" NA
2 issues with substr():
(1) It will be quite inefficient if there are millions of matches
to extract since it actually generates a copy of the matches.
extractAllMatches() and extractAt() don't have this problem
because they don't generate copies of the original sequence
data. Even extractAt(), because the DNAStringSetList object
it returns actually contains views on the original DNAStringSet
subject, except that these views are Biostrings internal business
and not something that can easily be seen unless you look
at the internals of the DNAStringSet and DNAStringSetList
objects.
(2) substr() returns a "flat" vector so in general the mapping
between the matches and the individual sequences in the
DNAStringSet subject is lost.
Also, the expected subsetting fails for MIndex objects.
class(matches)
[1] "ByPos_MIndex"
length(matches)
[1] 3
length(matches[1])
[1] 3
This should be addressed in Biostrings 2.43.1. Thanks! H.
-------------------------------------- Dario Strbenac University of Sydney Camperdown NSW 2050 Australia
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Herv? Pag?s Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fredhutch.org Phone: (206) 667-5791 Fax: (206) 667-1319