From R-core; this should interest most R-devel'ers (to some extent):
Since 0.60, the semantics of as.numeric(<factor>) has changed,
e.g.
R> as.integer(factor(c("A","BB")))
[1] NA NA
R> as.integer(factor(c(100,40,100)))
[1] 100 40 100
whereas older R and S:
S> as.integer(factor(c("A","BB")))
[1] 1 2
S> as.integer(factor(c(100,40,100)))
[1] 2 1 2
-------------------------------------
as explained by Ross, below :
"KH" == Kurt Hornik <Kurt.Hornik@ci.tuwien.ac.at> writes:
Ross Ihaka writes:
KH>> From hornik@ci.tuwien.ac.at Mon Jan 19 22:52 NZD 1998 Subject:
KH>> Difference R/S
KH>>
KH>> Andreas just pointed me to the following:
KH>>
KH>> v <- as.factor(c("Age","Number","Age")) as.numeric(v)
KH>>
KH>> gives
KH>>
KH>> [1] 1 2 1
KH>>
KH>> in S+ and
KH>>
KH>> [1] NA NA NA
KH>>
KH>> Bug/feature/intentional?
KH>>
KH>> Of course, R makes more sense because as.numeric("Age") gives NA in
KH>> both R and S+ ...
KH>>
KH>> Or, should we have as.numeric() return the codes on a non-numeric
KH>> factor?
Ross> At present R (implicitly) computes as.numeric(x) for x a factor as
Ross> as.numeric(as.character(x))
Ross> and S computes
Ross> codes(x)
Ross> I mistakenly thought that S does what I have implemented for R.
Ross> Thomas first objected to the difference and then said he quite liked
Ross> it.
Ross> I quite like the present semantics, but it is easy to change if
Ross> others have different preferences.
KH> I personally think that the current R approach makes more sense,
KH> too. If we all agree on it, I would like to add the difference to
KH> the FAQ, so that it is (well) documented.
Hmm, I first had advocated your view above, myself.
Later, I started to discover in how much S-code
as.numeric(ff)
is just used to extract the factor codes (in {1:M}) from a factor.
This lead me (and Peter Dalgaard, I think) to the conclusion that
- yes, the present R behavior maybe ``cleaner'' than S's
- no, it is a pain to keep it, because it breaks S code too often.
However, as you see, we haven't agreed yet on the topic.
I think we should agree ASAP, since it involves code in several places
(outside R base).
Martin
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !) To: r-devel-request@stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
Martin Maechler <maechler@stat.math.ethz.ch> writes:
From R-core; this should interest most R-devel'ers (to some extent):
Since 0.60, the semantics of as.numeric(<factor>) has changed,
e.g.
R> as.integer(factor(c("A","BB")))
[1] NA NA
R> as.integer(factor(c(100,40,100)))
[1] 100 40 100
whereas older R and S:
S> as.integer(factor(c("A","BB")))
[1] 1 2
S> as.integer(factor(c(100,40,100)))
[1] 2 1 2
...
Hmm, I first had advocated your view above, myself.
Later, I started to discover in how much S-code
as.numeric(ff)
is just used to extract the factor codes (in {1:M}) from a factor.
This lead me (and Peter Dalgaard, I think) to the conclusion that
- yes, the present R behavior maybe ``cleaner'' than S's
- no, it is a pain to keep it, because it breaks S code too often.
However, as you see, we haven't agreed yet on the topic.
I think we should agree ASAP, since it involves code in several places
(outside R base).
Actually, I'm even stronger in favour of the S semantics. In addition
to the above
- you can always get current behaviour with
as.numeric(as.character(f)) or as.numeric(levels(f))[f]
- one should avoid generating NA's unless absolutely necessary
- when a factor is used for subscripting, you mean the codes,
not the levels. Currently, we have
(1:5)[factor(1:5,labels=5:1)]
[1] 1 2 3 4 5
but
as.numeric(factor(1:5,labels=5:1))
[1] 5 4 3 2 1
I.e. *sometimes* when a factor is coerced to numeric you get
something different. (And if you change the index semantics,
code for trend tests and the like is likely to break!).
O__ ---- Peter Dalgaard Blegdamsvej 3
c/ /'_ --- Dept. of Biostatistics 2200 Cph. N
(*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918
~~~~~~~~~~ - (p.dalgaard@biostat.ku.dk) FAX: (+45) 35327907
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !) To: r-devel-request@stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
Martin Maechler <maechler@stat.math.ethz.ch> writes:
From R-core; this should interest most R-devel'ers (to some extent):
Since 0.60, the semantics of as.numeric(<factor>) has changed,
e.g.
R> as.integer(factor(c("A","BB")))
[1] NA NA
R> as.integer(factor(c(100,40,100)))
[1] 100 40 100
whereas older R and S:
S> as.integer(factor(c("A","BB")))
[1] 1 2
S> as.integer(factor(c(100,40,100)))
[1] 2 1 2
...
Hmm, I first had advocated your view above, myself.
Later, I started to discover in how much S-code
as.numeric(ff)
is just used to extract the factor codes (in {1:M}) from a factor.
This lead me (and Peter Dalgaard, I think) to the conclusion that
- yes, the present R behavior maybe ``cleaner'' than S's
- no, it is a pain to keep it, because it breaks S code too often.
However, as you see, we haven't agreed yet on the topic.
I think we should agree ASAP, since it involves code in several places
(outside R base).
Actually, I'm even stronger in favour of the S semantics. In addition
to the above
- you can always get current behaviour with
as.numeric(as.character(f)) or as.numeric(levels(f))[f]
- one should avoid generating NA's unless absolutely necessary
Right :-)
- when a factor is used for subscripting, you mean the codes,
not the levels. Currently, we have
(1:5)[factor(1:5,labels=5:1)]
[1] 1 2 3 4 5
but
as.numeric(factor(1:5,labels=5:1))
[1] 5 4 3 2 1
I.e. *sometimes* when a factor is coerced to numeric you get
something different. (And if you change the index semantics,
code for trend tests and the like is likely to break!).
But that is really a matter of how subscripting treats factors, and not
necessarily what coercion does.
As much as I am in favor of compatibility (remember I do a lot of
porting):
* Suppose f is a factor with numeric levels other than 1 to n. Then
as.numeric(f) returning the codes rather than the levels is strange.
* You also cannot coerce a character vector to numeric without getting
NA's.
Btw:
x <- factor(c(10, 5, 6, 7))
Then levels(x) gives the CHARACTER vector c("5", "6", "7", "10") [in
both R and S+], why that?
And:
R> codes(x)
[1] 4 1 2 3
S> codes(x)
[1] 1 2 3 4
???
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !) To: r-devel-request@stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
But that is really a matter of how subscripting treats factors, and not
necessarily what coercion does.
Well, the point was that one could say that there's an implicit
coercion involved in indexing. (Note BTW that in general
A[as.numeric(as.character(f))] !=
A[as.character(f)] !=
A[codes(f)] == A[f])
As much as I am in favor of compatibility (remember I do a lot of
porting):
* Suppose f is a factor with numeric levels other than 1 to n. Then
as.numeric(f) returning the codes rather than the levels is strange.
No, it's not. It may come as a surprise to some, but it might as well
be thought strange that the internal numeric codes are not available
via as.numeric. The levels are character strings that may or may not
happen to be convertible to numbers, so why should you expect that the
general procedure assumes that they are convertible?
* You also cannot coerce a character vector to numeric without getting
NA's.
This is true, but in that case, there's no obvious alternative.
Btw:
x <- factor(c(10, 5, 6, 7))
Then levels(x) gives the CHARACTER vector c("5", "6", "7", "10") [in
both R and S+], why that?
By definition, a factor is an integer vector of codes coupled with a
character vector of levels. Where's the problem? We could of course
introduce the possibility of having levels vectors of any type (and
take the pains arising from differences between factor(1:3) and
factor(1:3,labels=as.character(1:3)) ).
Apparently R is defaulting levels=as.character(sort(unique(x)),
whereas S is doing levels=sort(as.character(unique(x))), so that 10
sorts alphabetically before 5,6,7...
O__ ---- Peter Dalgaard Blegdamsvej 3
c/ /'_ --- Dept. of Biostatistics 2200 Cph. N
(*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918
~~~~~~~~~~ - (p.dalgaard@biostat.ku.dk) FAX: (+45) 35327907
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !) To: r-devel-request@stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._