Compare two data sets
Easiest way to do it is to try it out and time it. Here is a case where I generated two sets of data with 120,000 characters each (just random numbers converted to character strings) and then asked for the intersection of them. Came up with 3 matched in about 0.2 seconds. That would seem fastest enough, unless you plan to do this operation tens of thousands of times:
x <- as.character(runif(120000)) y <- as.character(runif(120000)) system.time(z <- intersect(x,y))
user system elapsed 0.22 0.00 0.22
str(z)
chr [1:3] "0.289942682255059" "0.75132836541161" "0.638638160191476"
Here is the timing if you get 50000 matches and it is about the same:
x <- as.character(round(runif(120000),5)) y <- as.character(round(runif(120000),5)) system.time(z <- intersect(x,y))
user system elapsed
0.2 0.0 0.2
str(z)
chr [1:48908] "0.08385" "0.62639" "0.47603" "0.18578" "0.89447" "0.58435" "0.15297" ...
On Tue, Mar 25, 2008 at 10:28 PM, Suhaila Zainudin
<suhaila.zainudin at gmail.com> wrote:
Hi, Thanks for the feedback. I have tried it on the small size sample and ref and it works. Now I want to use a larger dataset for myref (the reference file) . The reference file contains 112189 rows. Can I use the same approach that works for the small example? Or are there other alternatives when dealing with data of that magnitude? -- Suhaila Zainudin PhD Candidate Universiti Teknologi Malaysia
Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem you are trying to solve?