stringsAsFactors

At 18:01 11/02/2013, Ista Zahn wrote:
FWIW my view is that for data cleaning and organizing factors just get
it the way. For modeling I like them because they make it easier to
understand what is happening. For example I can look at the levels()
to see what the reference group will be. With characters one has to
know a) that levels are created in alphabetical order and b) the
alphabetical order of the the unique values in the character vector.
Ugh. So my habit is to turn off stringsAsFactors, then explicitly
convert to factors before modeling (I also use factors to change the
order in which things are displayed in tables and graphs, another
place where converting to factors myself is useful but the creating
them in alphabetical order by default is not)

All this is to say that I would like options(stingsAsFactors=FALSE) to
be the default, but I like the warning about converting to factors in
modeling functions because it reminds me that I forgot to covert them,
which I like to do anyway...

I seem to be one of the few people who find the current default helpful.
When I read in a dataset I am nearly always going to follow it with one or
more of the modelling functions and so I do want to treat the categorical
variables as factors. I cannot off-hand think of an example where I have had
to convert them to characters.
Your data must reach you in a much better state than mine reaches me.
I spend most of my time organizing, combining, fixing typos,
reshaping, merging and so on. Then I see the dreaded warning

"In `[<-.factor`(`*tmp*`, 6, value = "z") :
  invalid factor level, NAs generated

which reminds me that I've forgotten to set stringsAsFactors=FALSE.
However, I'm not saying I don't like factors. Once the data is cleaned
up they are very useful. But often I find that when I'm trying to
clean up a messy data set they just get in the way. And since that is
what I spend most of my time doing, factors get in the way most of the
time for me.
Incidentally xkcd has, while this discussion has been going on, posted
something relevant
http://www.xkcd.com/1172/

Best,
Ista

On Mon, Feb 11, 2013 at 12:50 PM, Duncan Murdoch
<murdoch.duncan at gmail.com> wrote:
On 11/02/2013 12:13 PM, William Dunlap wrote:
Note that changing this does not just mean getting rid of "silly
warnings".
Currently, predict.lm() can give wrong answers when stringsAsFactors is
FALSE.

   > d <- data.frame(x=1:10, f=rep(c("A","B","C"), c(4,3,3)), y=c(1:4,
15:17, 28.1,28.8,30.1))
   > fit_ab <- lm(y ~ x + f, data = d, subset = f!="B")
   Warning message:
   In model.matrix.default(mt, mf, contrasts) :
     variable 'f' converted to a factor
   > predict(fit_ab, newdata=d)
    1 2 3 4 5 6 7 8 9 10
    1  2  3  4 25 26 27  8  9 10
   Warning messages:
   1: In model.matrix.default(Terms, m, contrasts.arg =
object$contrasts)
:
     variable 'f' converted to a factor
   2: In predict.lm(fit_ab, newdata = d) :
     prediction from a rank-deficient fit may be misleading

fit_ab is not rank-deficient and the predict should report
    1 2 3 4 NA NA NA 28 29 30

In R-devel, the two warnings about factor conversions are no longer
given,
but the predictions are the same and the warning about rank deficiency
still
shows up.  If f is set to be a factor, an error is generated:

Error in model.frame.default(Terms, newdata, na.action = na.action, xlev
=
object$xlevels) :
  factor f has new levels B

I think both the warning and error are somewhat reasonable responses.
The
fit is rank deficient relative to the model that includes f == "B",
because
the column of the design matrix corresponding to f level B would be
completely zero.  In this particular model, we could still do
predictions
for the other levels, but it also seems reasonable to quit, given that
clearly something has gone wrong.

I do think that it's unfortunate that we don't get the same result in
both
cases, and I'd like to have gotten the predictions you suggested, but I
don't think that's going to happen.  The reason for the difference is
that
the subsetting is done before the conversion to a factor, but I think
that
is unavoidable without really big changes.

Duncan Murdoch

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com

-----Original Message-----
From: r-devel-bounces at r-project.org
[mailto:r-devel-bounces at r-project.org] On Behalf
Of Terry Therneau
Sent: Monday, February 11, 2013 5:50 AM
To: r-devel at r-project.org; Duncan Murdoch
Subject: Re: [Rd] stringsAsFactors

I think your idea to remove the warnings is excellent, and a good
compromise.
Characters
already work fine in modeling functions except for the silly warning.

It is interesting how often the defaults for a program reflect the
data
sets in use at the
time the defaults were chosen.  There are some such in my own
survival
package whose
proper value is no longer as "obvious" as it was when I chose them.
Factors are very
handy for variables which have only a few levels and will be used in
modeling.  Every
character variable of every dataset in "Statistical Models in S",
which
introduced
factors, is of this type so auto-transformation made a lot of sense.
The "solder" data
set there is one for which Helmert contrasts are proper so guess what
the default
contrast
option was?  (I think there are only a few data sets in the world for
which Helmert makes
sense, however, and R eventually changed the default.)

For character variables that should not be factors such as a street
adress
stringsAsFactors can be a real PITA, and I expect that people's
preference for the option
depends almost entirely on how often these arise in their own work.
As
long as there is
an option that can be overridden I'm okay.  Yes, I'd prefer FALSE as
the
default, partly
because the current value is a tripwire in the hallway that
eventually
catches every new
user.

Terry Therneau

On 02/11/2013 05:00 AM, r-devel-request at r-project.org wrote:
Both of these were discussed by R Core.  I think it's unlikely the
default for stringsAsFactors will be changed (some R Core members
like
the current behaviour), but it's fairly likely the
show.signif.stars
default will change.  (That's if someone gets around to it:  I
personally don't care about that one.  P-values are commonly used
statistics, and the stars are just a simple graphical display of
them.
I find some p-values to be useful, and the display to be harmless.)

I think it's really unlikely the more extreme changes (i.e.
dropping
show.signif.stars completely, or dropping p-values) will happen.

Regarding stringsAsFactors:  I'm not going to defend keeping it as
is,
I'll let the people who like it defend it.  What I will likely do
is
make a few changes so that character vectors are automatically
changed
to factors in modelling functions, so that operating with
stringsAsFactors=FALSE doesn't trigger silly warnings.

______________________________________________
R-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
R-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Michael Dewey
info at aghmed.fsnet.co.uk
http://www.aghmed.fsnet.co.uk/home.html

stringsAsFactors

Thread (15 messages)