Skip to content
Prev 39302 / 63421 Next

factor() on a double vector

Hi,

When 'x' is a vector of doubles, it's not clear how 'factor(x)'
compares its values in order to determine the levels. For example,
here all the values in 'x' are "conceptually" the same:

   x <- c(11/3,
          2/3 + 4/3 + 5/3,
          50 + 11/3 - 50,
          7.00001 - 1000003/300000)

However, due to machine rounding errors, they are not strictly equal:

   > duplicated(x)
   [1] FALSE FALSE FALSE FALSE
   > unique(x)
   [1] 3.666667 3.666667 3.666667 3.666667

but they are nearly equal:

   > all.equal(x, rep(11/3, 4))
   [1] TRUE

Now factor(), and therefore table() (which seems to be using factor()
internally), have a different opinion:

   > factor(x)
   [1] 3.66666666666667 3.66666666666667 3.66666666666666 3.66666666666667
   Levels: 3.66666666666666 3.66666666666667

   > table(x)
   x
   3.66666666666666 3.66666666666667
                  1                3

So factor() doesn't seem to be using "strict equality" or "near
equality" to determine the levels. What does it use? Sorry if I
missed it but I couldn't find any information about this in its
man page.

Wouldn't it be better if factor() was consistent with either
duplicated() or all.equal() instead of introducing its own way
of comparing doubles that lies somewhere in between?

Cheers,
H.

 > sessionInfo()
R version 2.12.0 (2010-10-15)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
  [1] LC_CTYPE=en_US.utf8       LC_NUMERIC=C
  [3] LC_TIME=en_US.utf8        LC_COLLATE=en_US.utf8
  [5] LC_MONETARY=C             LC_MESSAGES=en_US.utf8
  [7] LC_PAPER=en_US.utf8       LC_NAME=C
  [9] LC_ADDRESS=C              LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.utf8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

loaded via a namespace (and not attached):
[1] tools_2.12.0