Fast multiple match function
Hi Jeff, Indeed the data.table package does provide a much cleaner way to achieve the same functionality, and a lot of other functionality as bonus. Thanks for letting me know about it.
On Tue, 7 Apr 2015 at 15:41 Jeff Newmiller <jdnewmil at dcn.davis.ca.us> wrote:
You might find the data.table package helpful. It uses an index sorted
with a radix sort and minimizes moving the data around in memory.
---------------------------------------------------------------------------
Jeff Newmiller The ..... ..... Go Live...
DCN:<jdnewmil at dcn.davis.ca.us> Basics: ##.#. ##.#. Live
Go...
Live: OO#.. Dead: OO#.. Playing
Research Engineer (Solar/Batteries O.O#. #.O#. with
/Software/Embedded Controllers) .OO#. .OO#. rocks...1k
---------------------------------------------------------------------------
Sent from my phone. Please excuse my brevity.
On April 7, 2015 1:50:39 PM PDT, Keshav Dhandhania <kshav.91 at gmail.com>
wrote:
Hi all, Thanks for the responses. Herve's example is a good small size example of what I wanted.
y <- c(16, -3, -2, 15, 15, 0, 8, 15, -2) someCoolFunc(-2, y)
[1] 3 9
someCoolFunc(15, y)
[1] 4 5 8 The requirement is that I want someCoolFunc() to run in O(number of matches) time, instead of O(size of y). This is because y is big. And I don't know all the queries I want to do up-front. And the results of some queries might change the queries I want to do in the future. @David: I hope the above description is more clear. @Enrico, Herve: I want both the functionality provided by one function. - On repeated calls, fmatch() does give O(1) performance, but it does not give all matches. - findMatches() gives all matches, but I need to know the entire vector x beforehand. I don't have that luxury. I do have something that works now, using split and fmatch (package fastmatch). So just posting that in case anyone in the future has the same problem.
y.unique <- unique(y) # create a map from the unique elements of y to the locations of all
occurrences of the element
y.map <- split(1:length(y), match(y, y.unique)) # write a wrapper function that does a look-up on the unique list.
and then returns all matches using the map.
someCoolFunc <- function(x) { y.map[[ fmatch(x, y.unique) ]] }
On Tue, 7 Apr 2015 at 13:21 Herv? Pag?s <hpages at fredhutch.org> wrote:
Hi Keshav, findMatches() in the S4Vectors/IRanges packages (Bioconductor) I
think
does what you want: library(IRanges) y <- c(16L, -3L, -2L, 15L, 15L, 0L, 8L, 15L, -2L) x <- c(unique(y), 999L) hits <- findMatches(x, y) Then:
> hits
Hits object with 9 hits and 0 metadata columns:
queryHits subjectHits
<integer> <integer>
[1] 1 1
[2] 2 2
[3] 3 3
[4] 3 9
[5] 4 4
[6] 4 5
[7] 4 8
[8] 5 6
[9] 6 7
-------
queryLength: 7
subjectLength: 9
The Hits object can be turned into a list with:
> as.list(hits)
[[1]] [1] 1 [[2]] [1] 2 [[3]] [1] 3 9 [[4]] [1] 4 5 8 [[5]] [1] 6 [[6]] [1] 7 [[7]] integer(0) H.
> sessionInfo()
R version 3.2.0 beta (2015-04-05 r68151) Platform: x86_64-unknown-linux-gnu (64-bit) Running under: Ubuntu 14.04.2 LTS locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats4 stats graphics grDevices utils
datasets
[8] methods base other attached packages: [1] IRanges_2.1.43 S4Vectors_0.5.22 BiocGenerics_0.13.11 loaded via a namespace (and not attached): [1] tools_3.2.0 On 04/06/2015 01:56 PM, Keshav Dhandhania wrote:
Hi, I know that one can find all occurrences of x in a vector v by
doing
which(x == v).
However, if I need to do this again and again, where v is remaining
the
same, then this is quite inefficient. In my particular case, I need
to do
this millions of times, and length(v) = 100 million.
Does anyone have suggestion on how to go about it?
I know of a package called fmatch that does the above for the match
function. But they don't handle multiple matches.
Thanks
[[alternative HTML version deleted]]
______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Herv? Pag?s Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fredhutch.org Phone: (206) 667-5791 Fax: (206) 667-1319 ______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.