Hi the list,
According to what I know, the Canberra distance between X et Y is : sum[
(|x_i - y_i|) / (|x_i|+|y_i|) ] (with | | denoting the function
'absolute value')
In the source code of the canberra distance in the file distance.c, we
find :
sum = fabs(x[i1] + x[i2]);
diff = fabs(x[i1] - x[i2]);
dev = diff/sum;
which correspond to the formula : sum[ (|x_i - y_i|) / (|x_i+y_i|) ]
(note that this does not define a distance... This is correct when x_i
and y_i are positive, but not when a value is negative.)
Is it on purpose or is it a bug?
Christophe
Canberra distance
10 messages · Christophe Genolini, Duncan Murdoch, Jari Oksanen +4 more
On 06/02/2010 10:39 AM, Christophe Genolini wrote:
Hi the list,
According to what I know, the Canberra distance between X et Y is : sum[
(|x_i - y_i|) / (|x_i|+|y_i|) ] (with | | denoting the function
'absolute value')
In the source code of the canberra distance in the file distance.c, we
find :
sum = fabs(x[i1] + x[i2]);
diff = fabs(x[i1] - x[i2]);
dev = diff/sum;
which correspond to the formula : sum[ (|x_i - y_i|) / (|x_i+y_i|) ]
(note that this does not define a distance... This is correct when x_i
and y_i are positive, but not when a value is negative.)
Is it on purpose or is it a bug?
It matches the documentation in ?dist, so it's not just a coding error. It will give the same value as your definition if the two items have the same sign (not only both positive), but different values if the signs differ. The first three links I found searching Google Scholar for "Canberra distance" all define it only for non-negative data. One of them gave exactly the R formula (even though the absolute value in the denominator is redundant), the others just put x_i + y_i in the denominator. None of the 3 papers cited the origin of the definition, so I can't tell you who is wrong. Duncan Murdoch
The definition I use is the on find in the book "Cluster analysis" by Brian Everitt, Sabine Landau and Morven Leese. They cite, as definition paper for Canberra distance, an article of Lance and Williams "Computer programs for hierarchical polythetic classification" Computer Journal 1966. I do not have access, but here is the link : http://comjnl.oxfordjournals.org/cgi/content/abstract/9/1/60 Hope this helps. Christophe
On 06/02/2010 10:39 AM, Christophe Genolini wrote:
Hi the list,
According to what I know, the Canberra distance between X et Y is :
sum[ (|x_i - y_i|) / (|x_i|+|y_i|) ] (with | | denoting the function
'absolute value')
In the source code of the canberra distance in the file distance.c,
we find :
sum = fabs(x[i1] + x[i2]);
diff = fabs(x[i1] - x[i2]);
dev = diff/sum;
which correspond to the formula : sum[ (|x_i - y_i|) / (|x_i+y_i|) ]
(note that this does not define a distance... This is correct when
x_i and y_i are positive, but not when a value is negative.)
Is it on purpose or is it a bug?
It matches the documentation in ?dist, so it's not just a coding error. It will give the same value as your definition if the two items have the same sign (not only both positive), but different values if the signs differ. The first three links I found searching Google Scholar for "Canberra distance" all define it only for non-negative data. One of them gave exactly the R formula (even though the absolute value in the denominator is redundant), the others just put x_i + y_i in the denominator. None of the 3 papers cited the origin of the definition, so I can't tell you who is wrong. Duncan Murdoch
I guess there is also a problem in the binary distance since x <- y <- rep(0,10) dist(rbind(x,y),method="binary") gives 0 whereas it suppose to be undefine. (the aka asymmetric binary is not suppose to take in account the (off,off) couples in its calculation) Christophe
The definition I use is the on find in the book "Cluster analysis" by Brian Everitt, Sabine Landau and Morven Leese. They cite, as definition paper for Canberra distance, an article of Lance and Williams "Computer programs for hierarchical polythetic classification" Computer Journal 1966. I do not have access, but here is the link : http://comjnl.oxfordjournals.org/cgi/content/abstract/9/1/60 Hope this helps. Christophe
On 06/02/2010 10:39 AM, Christophe Genolini wrote:
Hi the list,
According to what I know, the Canberra distance between X et Y is :
sum[ (|x_i - y_i|) / (|x_i|+|y_i|) ] (with | | denoting the function
'absolute value')
In the source code of the canberra distance in the file distance.c,
we find :
sum = fabs(x[i1] + x[i2]);
diff = fabs(x[i1] - x[i2]);
dev = diff/sum;
which correspond to the formula : sum[ (|x_i - y_i|) / (|x_i+y_i|) ]
(note that this does not define a distance... This is correct when
x_i and y_i are positive, but not when a value is negative.)
Is it on purpose or is it a bug?
It matches the documentation in ?dist, so it's not just a coding error. It will give the same value as your definition if the two items have the same sign (not only both positive), but different values if the signs differ. The first three links I found searching Google Scholar for "Canberra distance" all define it only for non-negative data. One of them gave exactly the R formula (even though the absolute value in the denominator is redundant), the others just put x_i + y_i in the denominator. None of the 3 papers cited the origin of the definition, so I can't tell you who is wrong. Duncan Murdoch
On 06/02/2010 11:31 AM, Christophe Genolini wrote:
The definition I use is the on find in the book "Cluster analysis" by Brian Everitt, Sabine Landau and Morven Leese. They cite, as definition paper for Canberra distance, an article of Lance and Williams "Computer programs for hierarchical polythetic classification" Computer Journal 1966. I do not have access, but here is the link : http://comjnl.oxfordjournals.org/cgi/content/abstract/9/1/60 Hope this helps.
I do have access to that journal, and that paper gives the definition sum(|x_i - y_i|) / sum(x_i + y_i) and suggests the variation sum( [|x_i - y_i|) / (x_i + y_i) ] ) It doesn't call either one the Canberra distance; it calls the first one the "non-metric coefficient" and doesn't name the second. (I imagine the Canberra name came from the fact that the authors were at CSIRO in Canberra.) So I'd agree your definition is better, but I don't know if it can really be called the "Canberra distance". Duncan Murdoch
Christophe
On 06/02/2010 10:39 AM, Christophe Genolini wrote:
Hi the list,
According to what I know, the Canberra distance between X et Y is :
sum[ (|x_i - y_i|) / (|x_i|+|y_i|) ] (with | | denoting the function
'absolute value')
In the source code of the canberra distance in the file distance.c,
we find :
sum = fabs(x[i1] + x[i2]);
diff = fabs(x[i1] - x[i2]);
dev = diff/sum;
which correspond to the formula : sum[ (|x_i - y_i|) / (|x_i+y_i|) ]
(note that this does not define a distance... This is correct when
x_i and y_i are positive, but not when a value is negative.)
Is it on purpose or is it a bug?
It matches the documentation in ?dist, so it's not just a coding error. It will give the same value as your definition if the two items have the same sign (not only both positive), but different values if the signs differ. The first three links I found searching Google Scholar for "Canberra distance" all define it only for non-negative data. One of them gave exactly the R formula (even though the absolute value in the denominator is redundant), the others just put x_i + y_i in the denominator. None of the 3 papers cited the origin of the definition, so I can't tell you who is wrong. Duncan Murdoch
On 06/02/2010 18:10, "Duncan Murdoch" <murdoch at stats.uwo.ca> wrote:
On 06/02/2010 10:39 AM, Christophe Genolini wrote:
Hi the list,
According to what I know, the Canberra distance between X et Y is : sum[
(|x_i - y_i|) / (|x_i|+|y_i|) ] (with | | denoting the function
'absolute value')
In the source code of the canberra distance in the file distance.c, we
find :
sum = fabs(x[i1] + x[i2]);
diff = fabs(x[i1] - x[i2]);
dev = diff/sum;
which correspond to the formula : sum[ (|x_i - y_i|) / (|x_i+y_i|) ]
(note that this does not define a distance... This is correct when x_i
and y_i are positive, but not when a value is negative.)
Is it on purpose or is it a bug?
It matches the documentation in ?dist, so it's not just a coding error. It will give the same value as your definition if the two items have the same sign (not only both positive), but different values if the signs differ. The first three links I found searching Google Scholar for "Canberra distance" all define it only for non-negative data. One of them gave exactly the R formula (even though the absolute value in the denominator is redundant), the others just put x_i + y_i in the denominator.
G'day cobbers, Without checking the original sources (that I can't do before Monday), I'd say that the "Canberra distance" was originally suggested only for non-negative data (abundances of organisms which are non-negative if observed directly). The fabs(x-y) notation was used just as a convenient tool to get rid off the original pmin(x,y) for non-negative data -- which is nice in R, but not so natural in C. Extension of the "Canberra distance" to negative data probably makes a new distance perhaps deserving a new name (Eureka distance?). If you ever go to Canberra and drive around you'll see that it's all going through a roundabout after a roundabout, and going straight somewhere means goin' 'round 'n' 'round. That may make you skeptical about the "Canberra distance". Cheers, Jazza Oksanen
That is interesting. The first of these, namely sum(|x_i - y_i|) / sum(x_i + y_i) is now better known in ecology as the Bray-Curtis distance. Even more interesting is the typo in Henry & Stevens "A Primer of Ecology in R" where the Bray Curtis distance formula is actually the Canberra distance (Eq. 10.2 p. 289). There seems to be a certain slipperiness of definition in this field. What surprises me most is why ecologists still cling to this way of doing things, It is one of the few places I know of where the analysis is justified purely heuristically and not from any kind of explicit model for the ecological processes under study. Bill Venables.
From: r-devel-bounces at r-project.org [r-devel-bounces at r-project.org] On Behalf Of Duncan Murdoch [murdoch at stats.uwo.ca]
Sent: 07 February 2010 03:00
To: genolini at u-paris10.fr
Cc: r-devel at r-project.org
Subject: Re: [Rd] Canberra distance
Sent: 07 February 2010 03:00
To: genolini at u-paris10.fr
Cc: r-devel at r-project.org
Subject: Re: [Rd] Canberra distance
On 06/02/2010 11:31 AM, Christophe Genolini wrote: > The definition I use is the on find in the book "Cluster analysis" by > Brian Everitt, Sabine Landau and Morven Leese. > They cite, as definition paper for Canberra distance, an article of > Lance and Williams "Computer programs for hierarchical polythetic > classification" Computer Journal 1966. > I do not have access, but here is the link : > http://comjnl.oxfordjournals.org/cgi/content/abstract/9/1/60 > Hope this helps. > I do have access to that journal, and that paper gives the definition sum(|x_i - y_i|) / sum(x_i + y_i) and suggests the variation sum( [|x_i - y_i|) / (x_i + y_i) ] ) It doesn't call either one the Canberra distance; it calls the first one the "non-metric coefficient" and doesn't name the second. (I imagine the Canberra name came from the fact that the authors were at CSIRO in Canberra.) So I'd agree your definition is better, but I don't know if it can really be called the "Canberra distance". Duncan Murdoch > Christophe >> On 06/02/2010 10:39 AM, Christophe Genolini wrote: >>> Hi the list, >>> >>> According to what I know, the Canberra distance between X et Y is : >>> sum[ (|x_i - y_i|) / (|x_i|+|y_i|) ] (with | | denoting the function >>> 'absolute value') >>> In the source code of the canberra distance in the file distance.c, >>> we find : >>> >>> sum = fabs(x[i1] + x[i2]); >>> diff = fabs(x[i1] - x[i2]); >>> dev = diff/sum; >>> >>> which correspond to the formula : sum[ (|x_i - y_i|) / (|x_i+y_i|) ] >>> (note that this does not define a distance... This is correct when >>> x_i and y_i are positive, but not when a value is negative.) >>> >>> Is it on purpose or is it a bug? >> It matches the documentation in ?dist, so it's not just a coding >> error. It will give the same value as your definition if the two >> items have the same sign (not only both positive), but different >> values if the signs differ. >> >> The first three links I found searching Google Scholar for "Canberra >> distance" all define it only for non-negative data. One of them gave >> exactly the R formula (even though the absolute value in the >> denominator is redundant), the others just put x_i + y_i in the >> denominator. >> >> None of the 3 papers cited the origin of the definition, so I can't >> tell you who is wrong. >> >> Duncan Murdoch >> >> ______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
This is cetainly ancient R history. The essence of the formula was
last changed
- dist += fabs(x[i1] - x[i2])/(x[i1] + x[i2]);
+ dist += fabs(x[i1] - x[i2])/fabs(x[i1] + x[i2]);
in October 1998. The help page description came later.
The
dist += fabs(x[i1] - x[i2])/(x[i1] + x[i2]);
form was there as 'canberra' in the first CVS archive in September
1997 (as src/library/mva/src/dist.c) so it looks like one of R&R was
the original author and this could be called pre-history.
On Sun, 7 Feb 2010, Bill.Venables at csiro.au wrote:
That is interesting. The first of these, namely sum(|x_i - y_i|) / sum(x_i + y_i) is now better known in ecology as the Bray-Curtis distance. Even more interesting is the typo in Henry & Stevens "A Primer of Ecology in R" where the Bray Curtis distance formula is actually the Canberra distance (Eq. 10.2 p. 289). There seems to be a certain slipperiness of definition in this field. What surprises me most is why ecologists still cling to this way of doing things, It is one of the few places I know of where the analysis is justified purely heuristically and not from any kind of explicit model for the ecological processes under study. Bill Venables.
________________________________________ From: r-devel-bounces at r-project.org [r-devel-bounces at r-project.org] On Behalf Of Duncan Murdoch [murdoch at stats.uwo.ca] Sent: 07 February 2010 03:00 To: genolini at u-paris10.fr Cc: r-devel at r-project.org Subject: Re: [Rd] Canberra distance On 06/02/2010 11:31 AM, Christophe Genolini wrote: The definition I use is the on find in the book "Cluster analysis" by Brian Everitt, Sabine Landau and Morven Leese. They cite, as definition paper for Canberra distance, an article of Lance and Williams "Computer programs for hierarchical polythetic classification" Computer Journal 1966. I do not have access, but here is the link : http://comjnl.oxfordjournals.org/cgi/content/abstract/9/1/60 Hope this helps. I do have access to that journal, and that paper gives the definition sum(|x_i - y_i|) / sum(x_i + y_i) and suggests the variation sum( [|x_i - y_i|) / (x_i + y_i) ] ) It doesn't call either one the Canberra distance; it calls the first one the "non-metric coefficient" and doesn't name the second. (I imagine the Canberra name came from the fact that the authors were at CSIRO in Canberra.) So I'd agree your definition is better, but I don't know if it can really be called the "Canberra distance". Duncan Murdoch Christophe On 06/02/2010 10:39 AM, Christophe Genolini wrote: Hi the list, According to what I know, the Canberra distance between X et Y is : sum[ (|x_i - y_i|) / (|x_i|+|y_i|) ] (with | | denoting the function 'absolute value') In the source code of the canberra distance in the file distance.c, we find : sum = fabs(x[i1] + x[i2]); diff = fabs(x[i1] - x[i2]); dev = diff/sum; which correspond to the formula : sum[ (|x_i - y_i|) / (|x_i+y_i|) ] (note that this does not define a distance... This is correct when x_i and y_i are positive, but not when a value is negative.) Is it on purpose or is it a bug? It matches the documentation in ?dist, so it's not just a coding error. It will give the same value as your definition if the two items have the same sign (not only both positive), but different values if the signs differ. The first three links I found searching Google Scholar for "Canberra distance" all define it only for non-negative data. One of them gave exactly the R formula (even though the absolute value in the denominator is redundant), the others just put x_i + y_i in the denominator. None of the 3 papers cited the origin of the definition, so I can't tell you who is wrong. Duncan Murdoch ______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel ______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595
<Bill.Venables <at> csiro.au> writes:
That is interesting. The first of these, namely sum(|x_i - y_i|) / sum(x_i + y_i) is now better known in ecology as the Bray-Curtis distance. Even more
interesting is the typo in Henry &
Stevens "A Primer of Ecology in R" where the Bray Curtis distance formula is
actually the Canberra
distance (Eq. 10.2 p. 289). There seems to be a certain slipperiness of
definition in this field. Actually, the author is M. H. Henry (Hank) Stevens, not "Henry & Stevens" ... Ben Bolker
<Bill.Venables <at> csiro.au> writes:
That is interesting. The first of these, namely sum(|x_i - y_i|) / sum(x_i + y_i) is now better known in ecology as the Bray-Curtis distance. Even more
interesting is the typo in Henry &
Stevens "A Primer of Ecology in R" where the Bray Curtis distance formula is
actually the Canberra
distance (Eq. 10.2 p. 289). There seems to be a certain slipperiness of
definition in this field. Thank you for bringing to my attention the similarity of the Canberra and Bray-Curtis quantitative indices. Bray-Curtis dissimilarity can also, of course, be defined as 1 - 2w/(a+b) where w is sum of the minimum of each relevant pair of values, and a and b are the totals for sites a and b, respectively. These definitions appear to yield similar results, and to better reflect the original work by Bray and Curtis, I should probably define their distance as they did! Cheers, Martin Henry Hoffman Stevens (a.k.a. Hank)
What surprises me most is why ecologists still cling to this way of doing
things, It is one of the few places I
know of where the analysis is justified purely heuristically and not from any
kind of explicit model for
the ecological processes under study. Bill Venables.
______________________________________________ R-devel <at> r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel