Skip to content
Prev 53709 / 63421 Next

dist function in R is very slow

There are two many reasons for the relatively low speed of the built-in dist() function: (i) it operates on row vectors, which leads to many cache misses because matrices are stored by column in R (as you guessed); (ii) the function takes care to handle missing values correctly, which adds a relatively expensive test and conditional branch to each iteration of the inner loop.

A faster implementation, which omits the NA test and can compute distances between column vectors, is available as dist.matrix() in the "wordspace" package.  However, it cannot be used with matrices that might contain NAs (and doesn't warn about such arguments).

If you want the best possible speed, use cosine similarity (or equivalently, angular distance).  The underlying cross product is very efficient with a suitable BLAS implementation.

Best,
Stefan