Vince Carey found this (thank you!).
Since the fix to the problem is not entirely obvious, I post
this to R-devel as RFC:
help(dist) says:
>> Missing values are allowed, and are excluded from all computations
>> involving the rows within which they occur. If some columns are
>> excluded in calculating a Euclidean, Manhattan or Canberra
>> distance, the sum is scaled up proportionally to the number of
>> columns used. If all pairs are excluded when calculating a
>> particular distance, the value is `NA'.
but the C code in ....../src/library/mva/src/distance.c,
has, e.g. for the Euclidean distance :
count= 0;
dist = 0;
for(j = 0 ; j < nc ; j++) {
if(R_FINITE(x[i1]) && R_FINITE(x[i2])) {
dev = (x[i1] - x[i2]);
dist += dev * dev;
count++;
}
i1 += nr;
i2 += nr;
}
if(count == 0) return NA_REAL;
if(count != nc) dist /= ((double)count/nc);
return sqrt(dist);
where it is clear that "R_FINITE(*)" should in principle be
replaced by "!ISNAN(*)".
Note however that "Inf - Inf -> NaN" and e.g the canberra metric
has more ways to get "NaN".
The current code drops all pairs with an +-Inf in it.
I would be inclined to really replace
if(R_FINITE(x[i1]) && R_FINITE(x[i2])) {
by
if(!ISNAN(x[i1]) && !ISNAN(x[i2])) {
for all metrics -- for R-patched.
But I'd also see reasons where we'd want to be smarter/different
than that. One possibility would be to drop the pair (as
currently) also when both are not finite,
for "binary" to signal an error for +- Inf,
and for "Canberra" :
Of course, d = |x - y| / |x + y|
will be 1 , when one of {x,y} is infinite.
This could be considered the desired answser, or also
we may give a warning in any case.
Opinions?
Martin Maechler <maechler@stat.math.ethz.ch> http://stat.ethz.ch/~maechler/
Seminar fuer Statistik, ETH-Zentrum LEO C16 Leonhardstr. 27
ETH (Federal Inst. Technology) 8092 Zurich SWITZERLAND
phone: x-41-1-632-3408 fax: ...-1228 <><
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !) To: r-devel-request@stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
dist() {"mva" package} bug: treats +/- Inf as NA
3 messages · Brian Ripley, Martin Maechler
I think this is definitely better left as is (and definitely so for 1.6.1), perhaps documenting the fact. What is done seems quite sensible to me: you really can't have infinite values in an L_2 or L_1 space, and so Euclidean and L_1 (= Manhattan) distances are not defined. One could argue for just one infinity point, with zero distance between infinity points and infinite to all finite points, for example (and NA to missing values). It really isn't just one answer.
On Mon, 21 Oct 2002, Martin Maechler wrote:
Vince Carey found this (thank you!). Since the fix to the problem is not entirely obvious, I post this to R-devel as RFC: help(dist) says:
>> Missing values are allowed, and are excluded from all computations >> involving the rows within which they occur. If some columns are >> excluded in calculating a Euclidean, Manhattan or Canberra >> distance, the sum is scaled up proportionally to the number of >> columns used. If all pairs are excluded when calculating a >> particular distance, the value is `NA'.
but the C code in ....../src/library/mva/src/distance.c,
has, e.g. for the Euclidean distance :
count= 0;
dist = 0;
for(j = 0 ; j < nc ; j++) {
if(R_FINITE(x[i1]) && R_FINITE(x[i2])) {
dev = (x[i1] - x[i2]);
dist += dev * dev;
count++;
}
i1 += nr;
i2 += nr;
}
if(count == 0) return NA_REAL;
if(count != nc) dist /= ((double)count/nc);
return sqrt(dist);
where it is clear that "R_FINITE(*)" should in principle be
replaced by "!ISNAN(*)".
Note however that "Inf - Inf -> NaN" and e.g the canberra metric
has more ways to get "NaN".
The current code drops all pairs with an +-Inf in it.
I would be inclined to really replace
if(R_FINITE(x[i1]) && R_FINITE(x[i2])) {
by
if(!ISNAN(x[i1]) && !ISNAN(x[i2])) {
for all metrics -- for R-patched.
But I'd also see reasons where we'd want to be smarter/different
than that. One possibility would be to drop the pair (as
currently) also when both are not finite,
for "binary" to signal an error for +- Inf,
and for "Canberra" :
Of course, d = |x - y| / |x + y|
will be 1 , when one of {x,y} is infinite.
This could be considered the desired answser, or also
we may give a warning in any case.
Opinions?
Martin Maechler <maechler@stat.math.ethz.ch> http://stat.ethz.ch/~maechler/
Seminar fuer Statistik, ETH-Zentrum LEO C16 Leonhardstr. 27
ETH (Federal Inst. Technology) 8092 Zurich SWITZERLAND
phone: x-41-1-632-3408 fax: ...-1228 <><
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !) To: r-devel-request@stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
Brian D. Ripley, ripley@stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272860 (secr) Oxford OX1 3TG, UK Fax: +44 1865 272595 -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-devel-request@stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
"BDR" == Brian D Ripley <ripley@stats.ox.ac.uk>
on Mon, 21 Oct 2002 17:44:53 +0100 (BST) writes:
BDR> I think this is definitely better left as is (and
BDR> definitely so for 1.6.1), perhaps documenting the fact.
I don't agree. Whenever there's only one +/- Inf in a pair, the
result is well-defined for most metrics, (often = +Inf, but for Canberra).
The current behavior of silently treating Infs the same as NA is
not acceptable IMO.
Vincent's example was basically
dist(1:4, c(1e10, 2:4))
vs
dist(1:4, c(Inf, 2:4))
where the 2nd should be the "limit" of the first, or at the
least should not silently give the equivalent of
dist(1:4, c(NA, 2:4))
BDR> 1.6.1), perhaps documenting the fact. What is done
BDR> seems quite sensible to me: you really can't have
BDR> infinite values in an L_2 or L_1 space, and so
BDR> Euclidean and L_1 (= Manhattan) distances are not
BDR> defined.
BDR> One could argue for just one infinity point,
you mean, when for a scalar pair, one is +/- Inf, the other not?
Yes, this is exactly my point.
BDR> with zero distance between infinity points and infinite to all
BDR> finite points, for example (and NA to missing values).
BDR> It really isn't just one answer.
I think it's not hard to do the right thing in the cases where
that's well defined (e.g. when there's at most one Inf per pair),
and
use the current behavior {dropping a pair} in the other
cases, namely in exactly those where IEEE-arithmetic (excuse the
misnomer) would return a NaN.
Martin
BDR> On Mon, 21 Oct 2002, Martin Maechler wrote:
>> Vince Carey found this (thank you!).
>> Since the fix to the problem is not entirely obvious, I post
>> this to R-devel as RFC:
>>
>> help(dist) says:
>>
>> >> Missing values are allowed, and are excluded from all computations
>> >> involving the rows within which they occur. If some columns are
>> >> excluded in calculating a Euclidean, Manhattan or Canberra
>> >> distance, the sum is scaled up proportionally to the number of
>> >> columns used. If all pairs are excluded when calculating a
>> >> particular distance, the value is `NA'.
>>
>> but the C code in ....../src/library/mva/src/distance.c,
>> has, e.g. for the Euclidean distance :
>>
>> count= 0;
>> dist = 0;
>> for(j = 0 ; j < nc ; j++) {
>> if(R_FINITE(x[i1]) && R_FINITE(x[i2])) {
>> dev = (x[i1] - x[i2]);
>> dist += dev * dev;
>> count++;
>> }
>> i1 += nr;
>> i2 += nr;
>> }
>> if(count == 0) return NA_REAL;
>> if(count != nc) dist /= ((double)count/nc);
>> return sqrt(dist);
>>
>> where it is clear that "R_FINITE(*)" should in principle be
>> replaced by "!ISNAN(*)".
>>
>> Note however that "Inf - Inf -> NaN" and e.g the canberra metric
>> has more ways to get "NaN".
>>
>> The current code drops all pairs with an +-Inf in it.
>> I would be inclined to really replace
>> if(R_FINITE(x[i1]) && R_FINITE(x[i2])) {
>> by
>> if(!ISNAN(x[i1]) && !ISNAN(x[i2])) {
>>
>> for all metrics -- for R-patched.
>>
>> But I'd also see reasons where we'd want to be smarter/different
>> than that. One possibility would be to drop the pair (as
>> currently) also when both are not finite,
>> for "binary" to signal an error for +- Inf,
>> and for "Canberra" :
>> Of course, d = |x - y| / |x + y|
>> will be 1 , when one of {x,y} is infinite.
>> This could be considered the desired answser, or also
>> we may give a warning in any case.
>>
>> Opinions?
>>
>>
>> Martin Maechler <maechler@stat.math.ethz.ch> http://stat.ethz.ch/~maechler/
>> Seminar fuer Statistik, ETH-Zentrum LEO C16 Leonhardstr. 27
>> ETH (Federal Inst. Technology) 8092 Zurich SWITZERLAND
>> phone: x-41-1-632-3408 fax: ...-1228 <><
BDR> --
BDR> Brian D. Ripley, ripley@stats.ox.ac.uk
BDR> Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
BDR> University of Oxford, Tel: +44 1865 272861 (self)
BDR> 1 South Parks Road, +44 1865 272860 (secr)
BDR> Oxford OX1 3TG, UK Fax: +44 1865 272595
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !) To: r-devel-request@stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._