Skip to content

Canberra distance

10 messages · Christophe Genolini, Duncan Murdoch, Jari Oksanen +4 more

#
Hi the list,

According to what I know, the Canberra distance between X et Y is : sum[ 
(|x_i - y_i|) / (|x_i|+|y_i|) ] (with | | denoting the function 
'absolute value')
In the source code of the canberra distance in the file distance.c, we 
find :

    sum = fabs(x[i1] + x[i2]);
    diff = fabs(x[i1] - x[i2]);
    dev = diff/sum;

which correspond to the formula : sum[ (|x_i - y_i|) / (|x_i+y_i|) ]
(note that this does not define a distance... This is correct when x_i 
and y_i are positive, but not when a value is negative.)

Is it on purpose or is it a bug?

Christophe
#
On 06/02/2010 10:39 AM, Christophe Genolini wrote:
It matches the documentation in ?dist, so it's not just a coding error. 
  It will give the same value as your definition if the two items have 
the same sign (not only both positive), but different values if the 
signs differ.

The first three links I found searching Google Scholar for "Canberra 
distance" all define it only for non-negative data.  One of them gave 
exactly the R formula (even though the absolute value in the denominator 
is redundant), the others just put x_i + y_i in the denominator.

None of the 3 papers cited the origin of the definition, so I can't tell 
you who is wrong.

Duncan Murdoch
#
The definition I use is the on find in the book "Cluster analysis" by 
Brian Everitt, Sabine Landau and Morven Leese.
They cite, as definition paper for Canberra distance, an article of 
Lance and Williams "Computer programs for hierarchical polythetic 
classification" Computer Journal 1966.
I do not have access, but here is the link : 
http://comjnl.oxfordjournals.org/cgi/content/abstract/9/1/60
Hope this helps.

Christophe
#
I guess there is also a problem in the binary distance since

x <- y <- rep(0,10)
dist(rbind(x,y),method="binary")

gives 0 whereas it suppose to be undefine. (the aka asymmetric binary is 
not suppose to take in account the (off,off) couples in its calculation)

Christophe
#
On 06/02/2010 11:31 AM, Christophe Genolini wrote:
I do have access to that journal, and that paper gives the definition

sum(|x_i - y_i|) / sum(x_i + y_i)

and suggests the variation

sum( [|x_i - y_i|) / (x_i + y_i) ] )

It doesn't call either one the Canberra distance; it calls the first one 
the "non-metric coefficient" and doesn't name the second.  (I imagine 
the Canberra name came from the fact that the authors were at CSIRO in 
Canberra.)

So I'd agree your definition is better, but I don't know if it can 
really be called the "Canberra distance".

Duncan Murdoch
#
On 06/02/2010 18:10, "Duncan Murdoch" <murdoch at stats.uwo.ca> wrote:

            
G'day cobbers, 

Without checking the original sources (that I can't do before Monday), I'd
say that the "Canberra distance" was originally suggested only for
non-negative data (abundances of organisms which are non-negative if
observed directly). The fabs(x-y) notation was used just as a convenient
tool to get rid off the original pmin(x,y) for non-negative data -- which is
nice in R, but not so natural in C. Extension of the "Canberra distance" to
negative data probably makes a new distance perhaps deserving a new name
(Eureka distance?).

If you ever go to Canberra and drive around you'll see that it's all going
through a roundabout after a roundabout, and going straight somewhere means
goin' 'round 'n' 'round. That may make you skeptical about the "Canberra
distance". 

Cheers, Jazza Oksanen
#
That is interesting.  The first of these, namely

sum(|x_i - y_i|) / sum(x_i + y_i)

is now better known in ecology as the Bray-Curtis distance.  Even more interesting is the typo in Henry & Stevens "A Primer of Ecology in R" where the Bray Curtis distance formula is actually the Canberra distance  (Eq. 10.2 p. 289).  There seems to be a certain slipperiness of definition in this field.

What surprises me most is why ecologists still cling to this way of doing things,  It is one of the few places I know of where the analysis is justified purely heuristically and not from any kind of explicit model for the ecological processes under study.

Bill Venables.
#
This is cetainly ancient R history.  The essence of the formula was 
last changed
-	    dist += fabs(x[i1] - x[i2])/(x[i1] + x[i2]);
+	    dist += fabs(x[i1] - x[i2])/fabs(x[i1] + x[i2]);

in October 1998.  The help page description came later.

The
            dist += fabs(x[i1] - x[i2])/(x[i1] + x[i2]);
form was there as 'canberra' in the first CVS archive in September 
1997 (as src/library/mva/src/dist.c) so it looks like one of R&R was 
the original author and this could be called pre-history.
On Sun, 7 Feb 2010, Bill.Venables at csiro.au wrote:

            

  
    
#
<Bill.Venables <at> csiro.au> writes:
interesting is the typo in Henry &
actually the Canberra
definition in this field.

  Actually, the author is M. H. Henry (Hank) Stevens, not "Henry & Stevens" ...

  Ben Bolker
#
<Bill.Venables <at> csiro.au> writes:
interesting is the typo in Henry &
actually the Canberra
definition in this field.

Thank you for bringing to my attention the similarity of the Canberra and
Bray-Curtis quantitative indices. Bray-Curtis dissimilarity can also, of course,
be defined as 

1 - 2w/(a+b) 

where w is sum of the minimum of each relevant pair of values, and a and b are
the totals for sites a and b, respectively. These definitions appear to yield
similar results, and to better reflect the original work by Bray and Curtis, I
should probably define their distance as they did!

Cheers,

Martin Henry Hoffman Stevens (a.k.a. Hank)
things,  It is one of the few places I
kind of explicit model for