Canberra dist and double zeros

2 messages · Jari Oksanen, Brian Ripley

Tue, Mar 6, 2001 1:16 AM #

ripley@stats.ox.ac.uk said:

This means that I probably have to subsribe (momentarily) for R-devel which I 
have regarded as too technical for non-developer like me.


ripley@stats.ox.ac.uk said:

I am not sure either: it is right for me in my present applications, but I 
think it may not be right in general.  I used dist() for community data, where 
zero *is* zero (not only approximately zero floating point number) and means 
that the species is absent, and of course, all numbers are positive or zeros.  
Canberra distance is OK for negative numbers as well, and so x_i = -1, y_1 = 1 
would yield 2/0 which probably shouldn't be regarded as zero, but rather as 
NaN.  So a better test would be for above-zero numerator or explicitly for 
both x_i && y_i.

ripley@stats.ox.ac.uk said:

I don't know, and I don't have Lance & Williams 1967 to check. However, more 
recent papers by Canberra people do *not* increment count for double-zeros 
(Faith, Minchin, Belbin 1987. Compositional dissimilarity as a robust measure 
of ecological distance. Vegetatio 69, 57-68.).  I have no idea about the 
really *correct* solution or what are the arguments for incrementing or not 
incrementing count. At least not incrementing means that count varies with 
pairs of observations instead of being a simple down-scaling by a constant for 
the entire matrix.  However, probably the original Lance & Williams choice was 
to increment only for sum > 0.  Some other people may have better libraries to 
check both the choice and the argument (I may have a look there, but I would 
be surprised if I find Aust. Comput. J. 1, 15-20 here).  Checking for 
incrementing count would need testing above-zero denominator which begins to 
look ugly coding if we need testing for numerator as well.

In community ecology data, the number of species per site (= non-zero values 
per column) is a valid statistic of something, but the total number of species 
in a data set (= number of rows in the matrix) increases with the size of the 
sample set.  So the data is the more infested with zeros the larger the data 
set is.  I guess this the argument here for incrementing only for 
non-double-zeros: the count is dependent only on the pair compared instead of 
other observations not involved in this comparison.  On the other hand, I do 
not understand why you need to divide at all instead of using only the sum 
(this formulation occurs as well in literature).

As a quick solutions with the original 1.2.2 code  I replaced:

Class 'dist'  atomic [1:153] NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
....
with a dirty hack:

Class 'dist'  atomic [1:153] 49.2 61.7 50.9 52.1 60.1 ...

which certainly increments count for every pair, although it shouldn't.


-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-devel-request@stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._

Brian Ripley

Tue, Mar 6, 2001 1:40 AM #

On Tue, 6 Mar 2001, Jari Oksanen wrote:

We'll keep you on the Cc: list.  Normally things like this are on R-devel,
as they are specialized.

I think it should be Inf, and was going to comment that was another
problem.

Note count is only relevant if count < nc, and the code in 1.2.2 is wrong:
it should have been

    if(count != nc) dist /= ((double)count/nc);

Fortunately, it was never used.

You do anyway to get 2/0 different from 0/0.  We can code any solution,
and this is simple and clean compared to, say, scan.c!

I am going to implement that x1=x1=0 is equivalent to missing, and that
x1=+1, x2=-1 gives 2/0 = Inf.

Brian D. Ripley,                  ripley@stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272860 (secr)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-devel-request@stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._