dist() {"mva" package} bug: treats +/- Inf as NA - R-devel

Mon, Oct 21, 2002 8:42 AM #

Vince Carey found this (thank you!).
Since the fix to the problem is not entirely obvious, I post
this to R-devel as RFC:

help(dist)  says:

 >>  Missing values are allowed, and are excluded from all computations
 >>  involving the rows within which they occur.  If some columns are
 >>  excluded in calculating a Euclidean, Manhattan or Canberra
 >>  distance, the sum is scaled up proportionally to the number of
 >>  columns used. If all pairs are excluded when calculating a
 >>  particular distance, the value is `NA'.

but the C code in  ....../src/library/mva/src/distance.c,
has, e.g. for the Euclidean distance :

    count= 0;
    dist = 0;
    for(j = 0 ; j < nc ; j++) {
	if(R_FINITE(x[i1]) && R_FINITE(x[i2])) {
	    dev = (x[i1] - x[i2]);
	    dist += dev * dev;
	    count++;
	}
	i1 += nr;
	i2 += nr;
    }
    if(count == 0) return NA_REAL;
    if(count != nc) dist /= ((double)count/nc);
    return sqrt(dist);

where it is clear that "R_FINITE(*)" should in principle be
replaced by  "!ISNAN(*)".

Note however that "Inf - Inf -> NaN" and e.g the canberra metric
has more ways to get "NaN".

The current code drops all pairs with an +-Inf in it.
I would be inclined to really replace
	if(R_FINITE(x[i1]) && R_FINITE(x[i2])) {
by
	if(!ISNAN(x[i1])   && !ISNAN(x[i2])) {

for all metrics -- for R-patched.

But I'd also see reasons where we'd want to be smarter/different
than that. One possibility would be to drop the pair (as
currently) also when both are not finite, 
for "binary" to signal an error for +- Inf,
and for "Canberra" :
    Of course,  d =  |x - y| / |x + y| 
    will be 1 , when one of {x,y} is infinite.
This could be considered the desired answser, or also
we may give a warning in any case.

Opinions?


Martin Maechler <maechler@stat.math.ethz.ch>	http://stat.ethz.ch/~maechler/
Seminar fuer Statistik, ETH-Zentrum  LEO C16	Leonhardstr. 27
ETH (Federal Inst. Technology)	8092 Zurich	SWITZERLAND
phone: x-41-1-632-3408		fax: ...-1228			<><
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-devel-request@stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._

Brian Ripley

Mon, Oct 21, 2002 9:44 AM #

I think this is definitely better left as is (and definitely so for
1.6.1), perhaps documenting the fact.  What is done seems quite sensible
to me: you really can't have infinite values in an L_2 or L_1 space, and
so Euclidean and L_1 (= Manhattan) distances are not defined.

One could argue for just one infinity point, with zero distance between
infinity points and infinite to all finite points, for example (and NA to
missing values).  It really isn't just one answer.

On Mon, 21 Oct 2002, Martin Maechler wrote:

Vince Carey found this (thank you!).
Since the fix to the problem is not entirely obvious, I post
this to R-devel as RFC:

help(dist)  says:

 >>  Missing values are allowed, and are excluded from all computations
 >>  involving the rows within which they occur.  If some columns are
 >>  excluded in calculating a Euclidean, Manhattan or Canberra
 >>  distance, the sum is scaled up proportionally to the number of
 >>  columns used. If all pairs are excluded when calculating a
 >>  particular distance, the value is `NA'.

but the C code in  ....../src/library/mva/src/distance.c,
has, e.g. for the Euclidean distance :

    count= 0;
    dist = 0;
    for(j = 0 ; j < nc ; j++) {
	if(R_FINITE(x[i1]) && R_FINITE(x[i2])) {
	    dev = (x[i1] - x[i2]);
	    dist += dev * dev;
	    count++;
	}
	i1 += nr;
	i2 += nr;
    }
    if(count == 0) return NA_REAL;
    if(count != nc) dist /= ((double)count/nc);
    return sqrt(dist);

where it is clear that "R_FINITE(*)" should in principle be
replaced by  "!ISNAN(*)".

Note however that "Inf - Inf -> NaN" and e.g the canberra metric
has more ways to get "NaN".

The current code drops all pairs with an +-Inf in it.
I would be inclined to really replace
	if(R_FINITE(x[i1]) && R_FINITE(x[i2])) {
by
	if(!ISNAN(x[i1])   && !ISNAN(x[i2])) {

for all metrics -- for R-patched.

But I'd also see reasons where we'd want to be smarter/different
than that. One possibility would be to drop the pair (as
currently) also when both are not finite,
for "binary" to signal an error for +- Inf,
and for "Canberra" :
    Of course,  d =  |x - y| / |x + y|
    will be 1 , when one of {x,y} is infinite.
This could be considered the desired answser, or also
we may give a warning in any case.

Opinions?


Martin Maechler <maechler@stat.math.ethz.ch>	http://stat.ethz.ch/~maechler/
Seminar fuer Statistik, ETH-Zentrum  LEO C16	Leonhardstr. 27
ETH (Federal Inst. Technology)	8092 Zurich	SWITZERLAND
phone: x-41-1-632-3408		fax: ...-1228			<><
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-devel-request@stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._

Brian D. Ripley,                  ripley@stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272860 (secr)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-devel-request@stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._

Martin Maechler

Tue, Oct 22, 2002 5:29 AM #

BDR> I think this is definitely better left as is (and
    BDR> definitely so for 1.6.1), perhaps documenting the fact.

I don't agree. Whenever there's only one +/- Inf in a pair, the
result is well-defined for most metrics, (often = +Inf, but for Canberra).
The current behavior of silently treating Infs the same as NA is
not acceptable IMO.

Vincent's example was basically

   dist(1:4, c(1e10, 2:4))
vs
   dist(1:4, c(Inf, 2:4))

where the 2nd should be the "limit" of the first, or at the
least should not silently give the equivalent of
   dist(1:4, c(NA, 2:4))

    BDR> 1.6.1), perhaps documenting the fact.  What is done
    BDR> seems quite sensible to me: you really can't have
    BDR> infinite values in an L_2 or L_1 space, and so
    BDR> Euclidean and L_1 (= Manhattan) distances are not
    BDR> defined.

    BDR> One could argue for just one infinity point, 
you mean, when for a scalar pair, one is +/- Inf, the other not?
Yes, this is exactly my point.

    BDR> with zero distance between infinity points and infinite to all
    BDR> finite points, for example (and NA to missing values).

    BDR> It really isn't just one answer.

I think it's not hard to do the right thing in the cases where
that's well defined (e.g. when there's at most one Inf per pair),
and
use the current behavior {dropping a pair} in the other
cases, namely in exactly those where IEEE-arithmetic (excuse the
misnomer) would return a NaN.

Martin

BDR> On Mon, 21 Oct 2002, Martin Maechler wrote:

>> Vince Carey found this (thank you!).
    >> Since the fix to the problem is not entirely obvious, I post
    >> this to R-devel as RFC:
    >> 
    >> help(dist)  says:
    >> 
    >> >>  Missing values are allowed, and are excluded from all computations
    >> >>  involving the rows within which they occur.  If some columns are
    >> >>  excluded in calculating a Euclidean, Manhattan or Canberra
    >> >>  distance, the sum is scaled up proportionally to the number of
    >> >>  columns used. If all pairs are excluded when calculating a
    >> >>  particular distance, the value is `NA'.
    >> 
    >> but the C code in  ....../src/library/mva/src/distance.c,
    >> has, e.g. for the Euclidean distance :
    >> 
    >> count= 0;
    >> dist = 0;
    >> for(j = 0 ; j < nc ; j++) {
    >> if(R_FINITE(x[i1]) && R_FINITE(x[i2])) {
    >> dev = (x[i1] - x[i2]);
    >> dist += dev * dev;
    >> count++;
    >> }
    >> i1 += nr;
    >> i2 += nr;
    >> }
    >> if(count == 0) return NA_REAL;
    >> if(count != nc) dist /= ((double)count/nc);
    >> return sqrt(dist);
    >> 
    >> where it is clear that "R_FINITE(*)" should in principle be
    >> replaced by  "!ISNAN(*)".
    >> 
    >> Note however that "Inf - Inf -> NaN" and e.g the canberra metric
    >> has more ways to get "NaN".
    >> 
    >> The current code drops all pairs with an +-Inf in it.
    >> I would be inclined to really replace
    >> if(R_FINITE(x[i1]) && R_FINITE(x[i2])) {
    >> by
    >> if(!ISNAN(x[i1])   && !ISNAN(x[i2])) {
    >> 
    >> for all metrics -- for R-patched.
    >> 
    >> But I'd also see reasons where we'd want to be smarter/different
    >> than that. One possibility would be to drop the pair (as
    >> currently) also when both are not finite,
    >> for "binary" to signal an error for +- Inf,
    >> and for "Canberra" :
    >> Of course,  d =  |x - y| / |x + y|
    >> will be 1 , when one of {x,y} is infinite.
    >> This could be considered the desired answser, or also
    >> we may give a warning in any case.
    >> 
    >> Opinions?
    >> 
    >> 
    >> Martin Maechler <maechler@stat.math.ethz.ch>	http://stat.ethz.ch/~maechler/
    >> Seminar fuer Statistik, ETH-Zentrum  LEO C16	Leonhardstr. 27
    >> ETH (Federal Inst. Technology)	8092 Zurich	SWITZERLAND
    >> phone: x-41-1-632-3408		fax: ...-1228			<><

    BDR> -- 
    BDR> Brian D. Ripley,                  ripley@stats.ox.ac.uk
    BDR> Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
    BDR> University of Oxford,             Tel:  +44 1865 272861 (self)
    BDR> 1 South Parks Road,                     +44 1865 272860 (secr)
    BDR> Oxford OX1 3TG, UK                Fax:  +44 1865 272595













-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-devel-request@stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._