I'm rather confused by the semantics of factors.
When applied to factors, some functions (whose results are elements of
the original factor argument) return results of class factor, some
return integer vectors, some return character vectors, some give
errors. I understand some but not all of this. Consider:
Preserve factors: `[`, `[[`, sort, unique, subset, head, tapply, rep, rev, by,
sample, expand.grid,
as.matrix(structure(factor(1:3),dim=c(1,3))), data.frame, list
Convert to integers: c, ifelse, cbind/rbind
Convert to characters: intersect, union, setdiff, matrix, array,
matrix(factor(1:3),1,3),
as.matrix(factor(1:3))
Gives error: rle
No error (output of some other type): <, ==, etc.
In the case of ordered factors:
Preserve factors: quantile (for exact quantiles only)
Gives error: min, cut, range
No error: which.min, pmin, rank
(But some operations which are meaningful only on ordered factors also
give results on unordered factors, without even a warning: which.min,
pmin, rank, quantile.)
The general principle seems to be that if the result can contain only
elements of a single factor, then a factor is returned. I understand
this: it may not be meaningful to mingle factors with different level
sets. But I don't understand what the problem is with rle.
If the result can contain elements from more than one factor, it is
still not clear to me what the principle is for determining whether
the factors are converted to the integers representing them, or to the
characters naming them, or that the operation gives an error.
I also don't understand what is going on with min. min is well-defined
for any class supporting a < operator, but though < works on ordered
factors as do pmin, rank, etc., min does not. And equally strangely,
which.min and rank blithely convert *un*ordered factors to the
integers which happen to represent them, returning what are presumably
meaningless results without giving an error; while pmin appropriately
gives an error.
It is all very confusing. Of course, most of this behavior is
documented and is easily determined by experimentation, but it would
be easier to learn and teach the language if there were some clear
principle underlying all this. What am I missing?
-s
Handling of factors
4 messages · Stavros Macrakis, Thomas Lumley, Peter Dalgaard
On Tue, 20 Jan 2009, Stavros Macrakis wrote:
I'm rather confused by the semantics of factors.
<snip actual confusion>
It is all very confusing. Of course, most of this behavior is documented and is easily determined by experimentation, but it would be easier to learn and teach the language if there were some clear principle underlying all this. What am I missing?
No, it really is confusing. The problem is that there are two conflicting clear principles. Factors could be
- integer variables with labels (similar to value labels in Stata/SPSS or C enums)
- variables that takes on values from a pre-specified set, implemented using integer codes (like Pascal enumerated types).
[In fact, there was historically even a third way to view factors, as way to reduce the memory use of string variables. That's obsolete now.]
That is, the fact that they are small integers can be seen as part of the interface or just as part of the implementation. It's obvious which one is right, but unfortunately it is differently obvious to different people.
AFAIK there has never been a unified policy on this, dating back before R, so different functions behave differently. There have been changes in R over the years, mostly in the direction of making factors more like Pascal enumerations.
-thomas
Thomas Lumley Assoc. Professor, Biostatistics
tlumley at u.washington.edu University of Washington, Seattle
As a follow-up, I don't see any reason why rle() shouldn't work on factors. There's no ambiguity about what the result should be, and the current implementation in rle() would work on factors if they could get past the pre-test.
-thomas
On Wed, 21 Jan 2009, Thomas Lumley wrote:
On Tue, 20 Jan 2009, Stavros Macrakis wrote:
I'm rather confused by the semantics of factors.
<snip actual confusion>
It is all very confusing. Of course, most of this behavior is documented and is easily determined by experimentation, but it would be easier to learn and teach the language if there were some clear principle underlying all this. What am I missing?
No, it really is confusing. The problem is that there are two conflicting clear
principles. Factors could be
- integer variables with labels (similar to value labels in Stata/SPSS or C
enums)
- variables that takes on values from a pre-specified set, implemented using
integer codes (like Pascal enumerated types).
[In fact, there was historically even a third way to view factors, as way to
reduce the memory use of string variables. That's obsolete now.]
That is, the fact that they are small integers can be seen as part of the
interface or just as part of the implementation. It's obvious which one is
right, but unfortunately it is differently obvious to different people.
AFAIK there has never been a unified policy on this, dating back before R, so
different functions behave differently. There have been changes in R over the
years, mostly in the direction of making factors more like Pascal enumerations.
-thomas
Thomas Lumley Assoc. Professor, Biostatistics
tlumley at u.washington.edu University of Washington, Seattle
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Thomas Lumley Assoc. Professor, Biostatistics tlumley at u.washington.edu University of Washington, Seattle
Thomas Lumley wrote:
On Tue, 20 Jan 2009, Stavros Macrakis wrote:
I'm rather confused by the semantics of factors.
<snip actual confusion>
It is all very confusing. Of course, most of this behavior is documented and is easily determined by experimentation, but it would be easier to learn and teach the language if there were some clear principle underlying all this. What am I missing?
No, it really is confusing. The problem is that there are two conflicting clear principles. Factors could be - integer variables with labels (similar to value labels in Stata/SPSS or C enums) - variables that takes on values from a pre-specified set, implemented using integer codes (like Pascal enumerated types).
It might be worth noting here that in the second variation, the set will have to be ordered for pragmatic reasons (order of entries in tables, contrast matrices, etc.) even for non-ordered factors. So you can always _define_ the integer codes. In that light, you could say that it is only a matter of making the conventions consistent as to whether factors are character-like or integer-like.
[In fact, there was historically even a third way to view factors, as way to reduce the memory use of string variables. That's obsolete now.] That is, the fact that they are small integers can be seen as part of the interface or just as part of the implementation. It's obvious which one is right, but unfortunately it is differently obvious to different people. AFAIK there has never been a unified policy on this, dating back before R, so different functions behave differently. There have been changes in R over the years, mostly in the direction of making factors more like Pascal enumerations.
S3-style object-orientation and coercion rules also played their part: It was easy to code a group method for "==" so that sex=="male" works and sex==1 does not (unless levels(sex) include "1"), but in the "[" operator we have automatic unclass() of the index (with S3, you can dispatch on what class of object you index, but not what you index with), so that plot(x,y, col=c(male="lightblue", female="pink")[sex]) will _not_ do character indexing, and may well give the opposite result of what it looks like. We could change the convention here (coerce factor to character), but there are a couple of demons: What if the object you are indexing does not have names or has incompatible names, and would there not be a performance hit? Also, the law of inertia: The existing conventions have been used for quite a while, so changing them could break code in unexpected places. Notice, by the way, that in comparison operations between (ordered) factor and character, it is the character that is coerced to a factor, not the other way around: cooked <= "medium" should include "rare"...
O__ ---- Peter Dalgaard ?ster Farimagsgade 5, Entr.B c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918 ~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk) FAX: (+45) 35327907