Skip to content

stringsAsFactors = FALSE

8 messages · Hadley Wickham, William Dunlap, Brian Ripley +2 more

#
Hi all,

I love the option to not automatically convert strings into factors,
but there are three places that the current option doesn't work where
I think it should:

options(stringsAsFactors = FALSE)

str(expand.grid(letters))
str(type.convert(letters))

df <- read.fwf(textConnection(paste(letters,collapse="\n")), 1)
str(df)

I think type.convert and read.fwf can be fixed by giving them a
stringsAsFactors argument and then using asis = !stringsAsFactors
(like read.table).  The key lines in expand.grid would seem to be

            if (!is.factor(x) && is.character(x))
                x <- factor(x, levels = unique(x))

but I'm not sure why they are being converted to factors in the first place.

Regards,

Hadley
#
On Mon, 17 Nov 2008, hadley wickham wrote:

            
Perhaps you mean 'when I would like it to'?   Things *should* work as 
documented, surely?
I get
'data.frame':   26 obs. of  1 variable:
  $ V1: chr  "a" "b" "c" "d" ...

so what is wrong with that?  read.fwf just calls read.table, so the 
default options of read.table apply.
Seems to me that there is nothing wrong with read.fwf.  For type.convert() 
we could have the default

as.is = !default.stringsAsFactors()

but I think a strong case needs to be made to change the documented 
behaviour.
Nor I am, but it goes back to at least r2107, over 10 years ago.  I don't 
see much problem with adding a 'stringsAsFactors' argument there.
#
I think expand.grid converts input strings to factors so they
retain the order they have in the input.  (Note that the levels
argument is unique(x), not the sort(unique(x)) that data.frame uses.)
People generally give expand.grid sorted input and expect it to
not alter the order (the order of the levels affects tables and
and some plots).
lapply(expand.grid(Grade=c("Bad","Good","Better"),Size=c("Small","Medium
","Large")), levels)
$Grade
[1] "Bad"    "Good"   "Better"

$Size
[1] "Small"  "Medium" "Large"
lapply(data.frame(Grade=c("Bad","Good","Better"),Size=c("Small","Medium"
,"Large")), levels)
$Grade
[1] "Bad"    "Better" "Good"

$Size
[1] "Large"  "Medium" "Small"


I have nothing against adding the stringsAsFactors argument to
expand.grid.

Bill Dunlap
TIBCO Software Inc - Spotfire Division
wdunlap tibco.com
#
On Mon, 17 Nov 2008, Prof Brian Ripley wrote:

            
It seems only to be used in RODBC (where I have some extra control 
pending), simecol and BioC:beadarraySNP (both with as.is=TRUE) and reshape 
(author, one Hadley Wickham).  Given it is documented as a help utilty, it 
seems up to the caller to set the behaviour he wants.

  
    
#
On Mon, Nov 17, 2008 at 9:03 AM, Prof Brian Ripley
<ripley at stats.ox.ac.uk> wrote:
In an ideal world, I think things should be documented *and* consistent.
Ok, that's weird. I get factors.
Well, my intuition was that type.convert should mirror the behaviour
of read.table, since it is what does the conversion behind the scenes.
 I can of course change my own code.
Great, thanks.

Hadley
#
On Mon, Nov 17, 2008 at 11:06 AM, William Dunlap <wdunlap at tibco.com> wrote:
Ah, that makes sense.  (Although the conversion to factors just seems
to be a convenient way to achieve the desired effect in this case -
there's no reason they have to be factors in the output)

Hadley
#
William Dunlap wrote:

            
Yep. These things do matter. Incidentally, I recently got burned by 
cooking an example using expand.grid, writing the data to a file with 
write.table and reading it back in during lecture with read.table. Odds 
ratio turned upside down...
#
>> From: r-devel-bounces at r-project.org
    >> [mailto:r-devel-bounces at r-project.org] On Behalf Of
    >> hadley wickham Sent: Monday, November 17, 2008 5:10 AM
    >> To: r-devel at r-project.org Subject: [Rd] stringsAsFactors
    >> = FALSE ...  The key lines in expand.grid would seem to
    >> be
    >> 
    >> if (!is.factor(x) && is.character(x)) x <- factor(x,
    >> levels = unique(x))
    >> 
    >> but I'm not sure why they are being converted to factors
    >> in the first place.

    WD> I think expand.grid converts input strings to factors so
    WD> they retain the order they have in the input.  (Note
    WD> that the levels argument is unique(x), not the
    WD> sort(unique(x)) that data.frame uses.)  People generally
    WD> give expand.grid sorted input and expect it to not alter
    WD> the order (the order of the levels affects tables and
    WD> and some plots).

    >> 
    WD> lapply(expand.grid(Grade=c("Bad","Good","Better"),Size=c("Small","Medium
    WD> ","Large")), levels) $Grade [1] "Bad" "Good" "Better"

    WD> $Size [1] "Small" "Medium" "Large"

    >> 
    WD> lapply(data.frame(Grade=c("Bad","Good","Better"),Size=c("Small","Medium"
    WD> ,"Large")), levels) $Grade [1] "Bad" "Better" "Good"

    WD> $Size [1] "Large" "Medium" "Small"


    WD> I have nothing against adding the stringsAsFactors
    WD> argument to expand.grid.

That's fine, but I am VERY MUCH against 
making the default of that argument depend on the ominous
  default.stringsAsFactors()
which is determined by getOption("stringsAsFactors").

Why would I hate such a change very much : 
 Note that we have here an option which would change the
 result of a standard R (S) function  expand.grid().

Whereas I already did not like that change when it happened for
read.table(), in that case, one could at least say, that
read.table() is in some way platform dependent 
{(because it
  typically depends on files of the local platform, but as we
  know this is not true even there; even now, if I tell my
  students, or a book author tells her readers to use
  read.table("http://.....")  I can no longer be sure that my
  students get the same data frame, because they could have
  different settings of getOptions("stringsAsFactors")
  .... horrible, really!! )}

Please, R should stay as much a functional language as possible
and sensible!
If we start having global options more and more influence
the result of standard R functions, we are going down a very
slippery rope, and one that is making R even more idionsyncratic
than it already needs to be. 
Please, no !!  
Rather revert the read.table() default of "stringsAsFactors" to
not depend on the option, and maybe provide another set of short
forms of the various
       read.table(*, stringsAsFactors=FALSE)
incantations such that
all the factor-haters-string-lovers can use these short forms...

At the very first DSC, 1999, Joe Eaton, author of GNU octave,
told us how he regretted that he had started going down that bad
path, because users had started asking for it.
In the extreme case, we are ending up with a "language" that
depends on a whole huge status setting, and what a given
function computes can no longer be predicted by looking at the
function calls, unless you simultaneously know that whole status.
Please, No !!

Martin Maechler, ETH Zurich


    WD> Bill Dunlap TIBCO Software Inc - Spotfire Division
    WD> wdunlap tibco.com

    WD> ______________________________________________
    WD> R-devel at r-project.org mailing list
    WD> https://stat.ethz.ch/mailman/listinfo/r-devel