This means that I probably have to subsribe (momentarily) for R-devel which I
have regarded as too technical for non-developer like me.
ripley@stats.ox.ac.uk said:
I am sure we should do something, but is this exactly right?
I am not sure either: it is right for me in my present applications, but I
think it may not be right in general. I used dist() for community data, where
zero *is* zero (not only approximately zero floating point number) and means
that the species is absent, and of course, all numbers are positive or zeros.
Canberra distance is OK for negative numbers as well, and so x_i = -1, y_1 = 1
would yield 2/0 which probably shouldn't be regarded as zero, but rather as
NaN. So a better test would be for above-zero numerator or explicitly for
both x_i && y_i.
ripley@stats.ox.ac.uk said:
The issue is if count should be incremented if sum == 0.0 or not.
I don't know, and I don't have Lance & Williams 1967 to check. However, more
recent papers by Canberra people do *not* increment count for double-zeros
(Faith, Minchin, Belbin 1987. Compositional dissimilarity as a robust measure
of ecological distance. Vegetatio 69, 57-68.). I have no idea about the
really *correct* solution or what are the arguments for incrementing or not
incrementing count. At least not incrementing means that count varies with
pairs of observations instead of being a simple down-scaling by a constant for
the entire matrix. However, probably the original Lance & Williams choice was
to increment only for sum > 0. Some other people may have better libraries to
check both the choice and the argument (I may have a look there, but I would
be surprised if I find Aust. Comput. J. 1, 15-20 here). Checking for
incrementing count would need testing above-zero denominator which begins to
look ugly coding if we need testing for numerator as well.
In community ecology data, the number of species per site (= non-zero values
per column) is a valid statistic of something, but the total number of species
in a data set (= number of rows in the matrix) increases with the size of the
sample set. So the data is the more infested with zeros the larger the data
set is. I guess this the argument here for incrementing only for
non-double-zeros: the count is dependent only on the pair compared instead of
other observations not involved in this comparison. On the other hand, I do
not understand why you need to divide at all instead of using only the sum
(this formulation occurs as well in literature).
As a quick solutions with the original 1.2.2 code I replaced:
str(dist(kasvit, method="can"))
Class 'dist' atomic [1:153] NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
....
with a dirty hack:
Class 'dist' atomic [1:153] 49.2 61.7 50.9 52.1 60.1 ...
which certainly increments count for every pair, although it shouldn't.
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !) To: r-devel-request@stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
This means that I probably have to subsribe (momentarily) for R-devel which I
have regarded as too technical for non-developer like me.
We'll keep you on the Cc: list. Normally things like this are on R-devel,
as they are specialized.
ripley@stats.ox.ac.uk said:
I am sure we should do something, but is this exactly right?
I am not sure either: it is right for me in my present applications, but I
think it may not be right in general. I used dist() for community data, where
zero *is* zero (not only approximately zero floating point number) and means
that the species is absent, and of course, all numbers are positive or zeros.
Canberra distance is OK for negative numbers as well, and so x_i = -1, y_1 = 1
would yield 2/0 which probably shouldn't be regarded as zero, but rather as
NaN. So a better test would be for above-zero numerator or explicitly for
both x_i && y_i.
I think it should be Inf, and was going to comment that was another
problem.
ripley@stats.ox.ac.uk said:
The issue is if count should be incremented if sum == 0.0 or not.
I don't know, and I don't have Lance & Williams 1967 to check. However, more
recent papers by Canberra people do *not* increment count for double-zeros
(Faith, Minchin, Belbin 1987. Compositional dissimilarity as a robust measure
of ecological distance. Vegetatio 69, 57-68.). I have no idea about the
really *correct* solution or what are the arguments for incrementing or not
incrementing count. At least not incrementing means that count varies with
pairs of observations instead of being a simple down-scaling by a constant for
the entire matrix. However, probably the original Lance & Williams choice was
to increment only for sum > 0.
Note count is only relevant if count < nc, and the code in 1.2.2 is wrong:
it should have been
if(count != nc) dist /= ((double)count/nc);
Fortunately, it was never used.
Some other people may have better libraries to
check both the choice and the argument (I may have a look there, but I would
be surprised if I find Aust. Comput. J. 1, 15-20 here). Checking for
incrementing count would need testing above-zero denominator which begins to
look ugly coding if we need testing for numerator as well.
You do anyway to get 2/0 different from 0/0. We can code any solution,
and this is simple and clean compared to, say, scan.c!
I am going to implement that x1=x1=0 is equivalent to missing, and that
x1=+1, x2=-1 gives 2/0 = Inf.
Brian D. Ripley, ripley@stats.ox.ac.uk
Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel: +44 1865 272861 (self)
1 South Parks Road, +44 1865 272860 (secr)
Oxford OX1 3TG, UK Fax: +44 1865 272595
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !) To: r-devel-request@stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._