How to do knn regression?

Date: Fri, 19 Sep 2008 07:00:33 +0000 (UTC)
From: "Hans W. Borchers" <hwborchers at gmail.com>
Subject: Re: [R] How to do knn regression?
To: r-help at stat.math.ethz.ch
Message-ID: <loom.20080919T065847-65 at post.gmane.org>
Content-Type: text/plain; charset=us-ascii

Shengqiao Li <shli <at> stat.wvu.edu> writes:
Hello,

I want to do regression or missing value imputation by knn. I searched
r-help mailing list. This question was asked in 2005. ksmooth and loess
were recommended. But my case is different. I have many predictors
(p>20) and I really want try knn with a given k. ksmooth and loess use
band width to define neighborhood size. This contrasts to knn's 
variable
band width via fixing a k. Are there any such functions I can use in R
packages?

The R package 'knnFinder' provides a nearest neighbor search based on 
the approach through kd-tree data structures. Therefore, it is extremely 
fast even for very large data sets. It returns as many neighbors as you 
need  and can also be used, e.g., for determining distance-based 
outliers.

Thanks for your info. But it seems that there are problems to use 
knnFinder. knnFinder doesn't distinguish Test data and Train data. It 
searches in all data. New data with unknow Y's may appear in neighbors in 
the X space. The mask arg. seems not solving this problems. In addtion, I 
notice that there are several other possible problems with knnFinder:

(1) Ties are ignored.
(2) knnFinder is slower than class::knn when number of 
variables is relatively small, eg. 70. 
(3) Memory leakage.
(4) Maximum distance is small.
(5) One extra column is needed.

I rewrote knnFinder code to solve the last three problems for other 
purposes for which the self-match is not allowed. But self-math option is 
not a function parameter. It's a MACRO variable. So this option cannot be 
changed once the library is compiled. For regression, ties should be 
used. I have to compile two versions. This is not neat.

Any other convenient ways?
Hans Werner Borchers
ABB Corporate Research

Your help is highly appreciated.

Shengqiao Li

______________________________________________
R-help <at> r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide 
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

This is a summary of discussions between Shengqiao Li and me,
entered here as a reference for future requests on knn regression
or missing value imputation based on a nearest neighbor approach.
There several functions that can be used for 'nearest neighbor'
classification such as knn, knn1 (in package class), knn3(caret),
kknn(kknn), ipredknn(ipred), sknn(klaR), or gknn(cba).

To utilize these functions for 'nearest neighbor' regression would be
difficult. There is actually just one knn-like functions that can be
applied to continuous variables:

kknn(kknn)
     uses a formula and looks at the type of the target variable:
     if the target variable is continuous will return a regression
     result for each row in the learning set

And two implementations of functions that simply return the indices
and distances of k nearest neighbors for further processing:

ann(yaImpute)
    constructs kd- or bd-trees to find k nearest neighbors
    and returns indices and distances of those neighbors
    (it may kill the whole R process when matrices are too big)
    [Remark: Watch out, default distance is sum of squares]

knnFinder(knnFinder)
    constructs a kd-tree to find the k nearest neighbors;
    has too many bugs and quirks to make it almost unusable;
    not maintained anymore (perhaps should be removed from CRAN)

The other approach is to use a distance function and sort 'manually'
to find the nearest neighbors and their values for the target variable.

'dist' itself is not really appropriate as it can only be applied to 
_one_ matrix where here we need something like dist(A, B). Combining
A and B into one matrix is often forbidden as it needs too much memory.

dists(cba)
    computes a distance matrix between rows of two matrices
    can be a bit slow for very big matrices (slower than 'dist')
    [Rem: default distance is square root of sum of squares]

I would appreciate to hear from you when I missed something.

Hans Werner Borchers
ABB Corporate Research