Skip to content

as.numeric(<factor>) [Difference R/S]

4 messages · Martin Maechler, Peter Dalgaard, Kurt Hornik

#
Since 0.60,  the semantics of   as.numeric(<factor>)  has changed,
e.g.

R> as.integer(factor(c("A","BB")))
[1] NA NA
R> as.integer(factor(c(100,40,100)))
[1] 100  40 100

whereas older R and S:

S> as.integer(factor(c("A","BB")))
[1] 1 2
S> as.integer(factor(c(100,40,100)))
[1] 2 1 2


-------------------------------------
as explained by Ross, below :
KH>> From hornik@ci.tuwien.ac.at Mon Jan 19 22:52 NZD 1998 Subject:
    KH>> Difference R/S
    KH>> 
    KH>> Andreas just pointed me to the following:
    KH>> 
    KH>> v <- as.factor(c("Age","Number","Age")) as.numeric(v)
    KH>> 
    KH>> gives
    KH>> 
    KH>> [1] 1 2 1
    KH>> 
    KH>> in S+ and
    KH>> 
    KH>> [1] NA NA NA
    KH>> 
    KH>> Bug/feature/intentional?
    KH>> 
    KH>> Of course, R makes more sense because as.numeric("Age") gives NA in
    KH>> both R and S+ ...
    KH>> 
    KH>> Or, should we have as.numeric() return the codes on a non-numeric
    KH>> factor?

    Ross> At present R (implicitly) computes as.numeric(x) for x a factor as

    Ross> 	as.numeric(as.character(x))

    Ross> and S computes

    Ross> 	codes(x)

    Ross> I mistakenly thought that S does what I have implemented for R.
    Ross> Thomas first objected to the difference and then said he quite liked
    Ross> it.

    Ross> I quite like the present semantics, but it is easy to change if
    Ross> others have different preferences.

    KH> I personally think that the current R approach makes more sense,
    KH> too.  If we all agree on it, I would like to add the difference to
    KH> the FAQ, so that it is (well) documented.

Hmm,  I first had advocated your view above, myself.

Later, I started to discover in how much S-code
	as.numeric(ff) 
is just used to extract the factor codes (in {1:M})  from a factor.

This lead me (and Peter Dalgaard, I think) to the conclusion that
- yes, the present R behavior maybe ``cleaner'' than S's
- no, it is a pain to keep it, because it breaks S code too often.

However, as you see, we haven't agreed yet on the topic.
I think we should agree ASAP, since it involves code in several places
(outside R base).

Martin
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-devel-request@stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
#
Martin Maechler <maechler@stat.math.ethz.ch> writes:
...
Actually, I'm even stronger in favour of the S semantics. In addition
to the above

	- you can always get current behaviour with
        as.numeric(as.character(f)) or as.numeric(levels(f))[f]

	- one should avoid generating NA's unless absolutely necessary

	- when a factor is used for subscripting, you mean the codes,
        not the levels. Currently, we have
[1] 1 2 3 4 5

but
[1] 5 4 3 2 1

	I.e. *sometimes* when a factor is coerced to numeric you get
	something different. (And if you change the index semantics,
	code for trend tests and the like is likely to break!).
#
R> as.integer(factor(c("A","BB")))
R> as.integer(factor(c(100,40,100)))
S> as.integer(factor(c("A","BB")))
S> as.integer(factor(c(100,40,100)))
Right :-)
But that is really a matter of how subscripting treats factors, and not
necessarily what coercion does.

As much as I am in favor of compatibility (remember I do a lot of
porting):

* Suppose f is a factor with numeric levels other than 1 to n.  Then
as.numeric(f) returning the codes rather than the levels is strange.

* You also cannot coerce a character vector to numeric without getting
NA's.

Btw:

	x <- factor(c(10, 5, 6, 7))

Then levels(x) gives the CHARACTER vector c("5", "6", "7", "10") [in
both R and S+], why that?

And:

R> codes(x)
[1] 4 1 2 3

S> codes(x)
[1] 1 2 3 4

???
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-devel-request@stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
#
Kurt Hornik <Kurt.Hornik@ci.tuwien.ac.at> writes:
Well, the point was that one could say that there's an implicit
coercion involved in indexing. (Note BTW that in general
A[as.numeric(as.character(f))] != 
A[as.character(f)] !=
A[codes(f)] == A[f])
No, it's not. It may come as a surprise to some, but it might as well
be thought strange that the internal numeric codes are not available
via as.numeric. The levels are character strings that may or may not
happen to be convertible to numbers, so why should you expect that the
general procedure assumes that they are convertible?
This is true, but in that case, there's no obvious alternative.
By definition, a factor is an integer vector of codes coupled with a
character vector of levels. Where's the problem? We could of course
introduce the possibility of having levels vectors of any type (and
take the pains arising from differences between factor(1:3) and
factor(1:3,labels=as.character(1:3)) ).
Apparently R is defaulting levels=as.character(sort(unique(x)),
whereas S is doing levels=sort(as.character(unique(x))), so that 10
sorts alphabetically before 5,6,7...