distance between two matrices
On Wed, 28 Jan 2004, "H?sing, Johannes" wrote:
Hi all,
Say I have a matrix A with dimension m x 2 and matrix B with
dimension n x 2. I would like to find the row in A that is closest to
the each row in B. Here's an example (using a loop):
set.seed(1)
A <- matrix(runif(12), 6, 2) # 6 x 2
B <- matrix(runif(6), 3, 2) # 3 x 2
m <- vector("numeric", nrow(B))
make the lines below a function of a vector argument and apply it over the rows of B. ?apply for more info. You'll want to know about apply if you want to avoid loops (which is a good approach).
Unfortunately apply() is a wrapper for a for() loop, so will not help much (if at all).
for(j in 1:nrow(B)) {
d <- (A[, 1] - B[j, 1])^2 + (A[, 2] - B[j, 2])^2
m[j] <- which.min(d)
}
You can improve this a bit: see predict.qda.
All I need is m[]. I would like to accomplish this without using the loop if possible, since for my real data n > 140K and m > 1K. I hope this makes sense.
Thing is, the above approach requires all data to be in main memory. i hope this is not a problem.
A 140K x 2 array takes up 1.6Mb, and R needs 10x that to run at all. Several people have mentioned knn1 as a C-level equivalent of the loops (and I timed it as probably fast enough). Roger Bivand mentioned quadtrees, and that is one of a class of possible solutions if you need extra speed. Which member of that class is suitable depends on the spatial distribution of A and B (viewing the rows as 2D points), but it is hard to do very much better for only around a 1000 reference points.
Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595