Match .3 in a sequence
On Tue, Mar 17, 2009 at 10:04:39AM -0400, Stavros Macrakis wrote:
...
1) Factor allows repeated levels, e.g. factor(c(1),c(1,1,1)), with no warning or error.
Yes, this is a confusing behavior, since repeated levels are never meaningful.
2) Even from distinct inputs, factor of a numeric vector may generate repeated levels, because it only uses 15 digits.
I think, 15 digits is a reasonable choice. Mapping double precision numbers and character strings with a given decimal precision is never bijective. With 15 digits, we can achive that every character value has unique double precision representation, but not vice versa. With 17 digits, we have a unique character string for each double precision number, but not vice versa. What is better? Specification of as.character says() that the numbers are represented with 15 significant digits. So, I think, if as.factor() applies signif(,digits=15) to a numeric vector before determining the levels using sort(unique.default(x), this could help to eliminate most of the problems without being in conflict with the existing specification.
3) The algorithm to determine the shortest format is inconsistent with the algorithm to actually print, giving pathological cases like 0.3 vs. 0.300000000000000.
I do not exactly understand what you mean by inconsistent. If you do nums <- (.3 + 2e-16 * c(-2,-1,1,2)) options(digits=15) for (x in nums) print(x) # [1] 0.300000000000000 # [1] 0.3 # [1] 0.3 # [1] 0.300000000000000 as.character(nums) # [1] "0.300000000000000" "0.3" "0.3" # [4] "0.300000000000000" then print and as.character are consistent. Printing the whole vector behaves differently, since it uses the same format for all numbers.
The original problem was testing whether a floating-point number was a member of a vector. rounding and then converting to a factor seem like a very poor way of doing that, even if the above problems were resolved. Comparing with a tolerance seems much more robust, clean, and efficient.
Definitely, using comparison tolerance is a meaningful approach. Its disadvantage is that the relation abs(x - y) <= eps is not transitive. So, it may also produce confusing results in some situations. I think that one has to choose the right solution depending on the application. Petr.