[Bioc-devel] Changes in the %in% function for DNAStringSet?
Hi Herv?, Thanks a lot for fixing it and for the super detailed description! Learned a lot :-) And thanks for the benchmarking, that's really useful as well! I can't really think of any drawbacks there, but my %in% usage is certainly limited. What do the R developer guys say about it? Wouldn't it make sense to have it that way in base R? Cheers, Nico
On May 4, 2012, at 3:56 AM, Herv? Pag?s wrote:
Hi Nico, Last week I did some improvements/reorganization of the match(), %in%, duplicated(), and unique() stuff in IRanges/GenomicRanges/Biostrings, and apparently forgot to define the "%in%" method for ANY,Vector. Thanks for the catch! This is fixed in IRanges 1.15.8. FWIW I also added an "%in%" method for Vector,ANY so now this works too:
> DNAStringSet(c("TTGCGA","ATGRCT","ACASTG")) %in% c("TTGCGA","ATGGCT","ACACTG")
[1] TRUE FALSE FALSE <rant mode> It is so sad that we have to redefine "%in%" methods that do exactly the same thing as base::`%in%`:
> base::`%in%`
function (x, table)
match(x, table, nomatch = 0L) > 0L
<environment: namespace:base>
just because base::`%in%` cannot dispatch on the appropriate
"match" method. A well-known issue of the way generics, methods
and NAMESPACE interact with each other... but still an unfortunate
one.
</rant mode>
The good news is that we have on our TODO list to explicitly define
the match() and %in% generics in BiocGenerics so there will be an
opportunity to overwrite the "%in%" default method:
setMethod("%in%", c("ANY", "ANY"), function (x, table) match(x, table, nomatch = 0L) > 0L)
(I'm still hesitant about this though. What could be the drawbacks
of overwriting the default method?)
Also last week at the same time I did the changes on match() and
family, I also reimplemented the "match" method for DNAStringSet
objects (which is called when either 'x' or 'table' or both are
DNAStringSet). The new implementation is in Biostrings 2.25.3.
It uses a hash-based algorithm instead of the quicksort-based algo
that was used so far. The resulting speedup varies a lot depending
on the sizes of 'x' and 'table', and will typically be important
(10x or more) for big (i.e. > 1M elements) DNAStringSet objects.
This benefits directly %in%, duplicated() and unique() on
DNAStringSet objects.
With Biostrings 2.25.3 (Bioc 2.11):
> library(Biostrings) > probes <- DNAStringSet(hgu133aprobe) > system.time(isdup <- duplicated(probes))
user system elapsed 0.048 0.000 0.050 With Bioc <= 2.10:
> system.time(isdup <- duplicated(probes))
user system elapsed 0.232 0.000 0.233 Finally I should mention that, even though the hash function I use for DNAStringSet (and RNAStringSet, AAStringSet, BStringSet) is the same as the function used in base R for hashing the strings of a standard character vector, calling match(), %in%, duplicated() or unique() on a standard character vector is still slightly faster (2x) than on a DNAStringSet. This can probably be explained by the fact that all the strings in all the character vectors defined in a session are pre-hashed i.e. hashed the 1st time the string is created and the result of the hash stored in the "global CHARSXP hash table". Cheers, H. On 05/01/2012 05:28 AM, Nicolas Delhomme wrote:
Hi all,
In R 2.15.0, Bioc 2.10, the following works:
library(Biostrings)
c("TTGCGA","ATGGCT","ACACTG") %in% DNAStringSet(c("TTGCGA","ATGRCT","ACASTG"))
[1] TRUE FALSE FALSE
sessionInfo()R version 2.15.0 (2012-03-30)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) locale: [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] Biostrings_2.24.1 IRanges_1.14.2 BiocGenerics_0.2.0 loaded via a namespace (and not attached): [1] stats4_2.15.0 While in Bioc 2.11 it fails: Error in match(x, table, nomatch = 0L) : 'match' requires vector arguments
sessionInfo()
R version 2.15.0 (2012-03-30) Platform: x86_64-apple-darwin10.8.0 (64-bit) locale: [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] Biostrings_2.25.3 IRanges_1.15.7 BiocGenerics_0.3.0 loaded via a namespace (and not attached): [1] stats4_2.15.0 I'd just like to know if that is that a change of API or not. If yes, I'd need to adapt my code that currently fails building. Cheers, Nico --------------------------------------------------------------- Nicolas Delhomme Genome Biology Computational Support European Molecular Biology Laboratory Tel: +49 6221 387 8310 Email: nicolas.delhomme at embl.de Meyerhofstrasse 1 - Postfach 10.2209 69102 Heidelberg, Germany
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
-- Herv? Pag?s Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
--------------------------------------------------------------- Nicolas Delhomme Genome Biology Computational Support European Molecular Biology Laboratory Tel: +49 6221 387 8310 Email: nicolas.delhomme at embl.de Meyerhofstrasse 1 - Postfach 10.2209 69102 Heidelberg, Germany