Why do we have to turn factors into characters for various functions?
Hello Petr,
don't want to convince you. If you like the following:
x <- factor(1:4, labels=c("one", "two", "three", "four"))
y <- factor(3:5, labels=c("three", "four", "five"))
data.frame(character=c(as.character(x), as.character(y)), numeric=c(x, y))
character numeric
1 one 1
2 two 2
3 three 3
4 four 4
5 three 1
6 four 2
7 five 3
For me the behaviour of character vectors is easier to follow and
less errror prone.
cx <- c("one", "two", "three", "four")
cy <- c("three", "four", "five")
c(cx, cy)
[1] "one" "two" "three" "four" "three" "four" "five"
Anyway it is maybe more about personal habits than about bad factor "features"
I agree with you regarding personal habits. It's not the features of factors. For me it's the rather inconsistent use in functions like c() or print(). If you print a factor, you see it's levels, but if you combine it using c(), you combine the famouse implementation specific underlying integer vector. best regards, Heinz
At 13.12.2010 08:50 +0100, Petr PIKAL wrote:
Hi r-help-bounces at r-project.org napsal dne 12.12.2010 21:00:37:
At 12.12.2010 00:48 +0200, Tal Galili wrote:
Hello dear R-help mailing list, My question is *not* about how factors are implemented in R (which is,
if I
understand correctly, that factors keeps numbers and assign levels to
them).
My question *is* about why so many functions that work on factors don't treat them as characters by default? Here are two simple examples: Example one turning the characters inside a factor into numeric: x <- factor(4:6) as.numeric(x) # output: 1 2 3 as.numeric(as.character(x)) # output: 4 5 6 # isn't this what we
wanted?
Example two, using strsplit on a factor: x <- factor(paste(letters[4:6], 4:6, sep="A")) strsplit(x, "A") # will result in an error: # Error in strsplit(x,
"A") :
non-character argument strsplit(as.character(x), "A") # will work and split So what is the reason this is the case? Is it that implementing a switch of factors to characters as the
default in
some of the basic function will cause old code to break? Is it a better design in some other way? I am curious to know the reason for this.
In my view the answer can be found implicitly in the language
definition.
"Factors are currently implemented using an integer array to specify the actual levels and a second array of names that are mapped to the integers. Rather unfortunately users often make use of the implementation in order to make some calculations easier." It is the "unfortunate" use of factors that seems generally accepted, even if the language definition continues: "This, however, is an implementation issue and is not guaranteed to hold in all implementations of R." Personally, like some others, I avoid factors, except in cases, where they represent a statistical concept.
On contrary I find factors quite useful. Consider possibility to change its levels
set.seed(111)
x <- factor(sample(1:4, 20, replace=T), labels=c("one", "two", "three",
"four"))
x
[1] three three two three two two one three two one three three [13] one one one two one four two three Levels: one two three four
levels(x)[3:4] <- "more" x
[1] more more two more two two one more two one more more one one one [16] two one more two more Levels: one two more I believe that if x is character, it can be also done but factor way seems to me more convenient. I also use point distinction in plots by pch=as.numeric(some.factor) quite often. Anyway it is maybe more about personal habits than about bad factor "features" Regards Petr
Certainly I would agree with you that, if only reading the "R Language Definition" and not the documentation of the function factor, one would rather expect functions like as.numeric or strsplit to operate on the levels of a factor and not on the underlying, implementation specific, integer array. Heinz
Thank you for your reading, Tal ----------------Contact Details:------------------------------------------------------- Contact me: Tal.Galili at gmail.com | 972-52-7275845 Read me: www.talgalili.com (Hebrew) | www.biostatistics.co.il (Hebrew)
|
www.r-statistics.com (English)
-------------------------------------------------------------------
---------------------------
[[alternative HTML version deleted]]
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. ______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.