Hi all,
I love the option to not automatically convert strings into factors,
but there are three places that the current option doesn't work where
I think it should:
options(stringsAsFactors = FALSE)
str(expand.grid(letters))
str(type.convert(letters))
df <- read.fwf(textConnection(paste(letters,collapse="\n")), 1)
str(df)
I think type.convert and read.fwf can be fixed by giving them a
stringsAsFactors argument and then using asis = !stringsAsFactors
(like read.table). The key lines in expand.grid would seem to be
if (!is.factor(x) && is.character(x))
x <- factor(x, levels = unique(x))
but I'm not sure why they are being converted to factors in the first place.
Regards,
Hadley
stringsAsFactors = FALSE
8 messages · Hadley Wickham, William Dunlap, Brian Ripley +2 more
On Mon, 17 Nov 2008, hadley wickham wrote:
Hi all, I love the option to not automatically convert strings into factors, but there are three places that the current option doesn't work where I think it should:
Perhaps you mean 'when I would like it to'? Things *should* work as documented, surely?
options(stringsAsFactors = FALSE) str(expand.grid(letters)) str(type.convert(letters)) df <- read.fwf(textConnection(paste(letters,collapse="\n")), 1) str(df)
I get
str(df)
'data.frame': 26 obs. of 1 variable: $ V1: chr "a" "b" "c" "d" ... so what is wrong with that? read.fwf just calls read.table, so the default options of read.table apply.
I think type.convert and read.fwf can be fixed by giving them a stringsAsFactors argument and then using asis = !stringsAsFactors (like read.table).
Seems to me that there is nothing wrong with read.fwf. For type.convert() we could have the default as.is = !default.stringsAsFactors() but I think a strong case needs to be made to change the documented behaviour.
The key lines in expand.grid would seem to be
if (!is.factor(x) && is.character(x))
x <- factor(x, levels = unique(x))
but I'm not sure why they are being converted to factors in the first place.
Nor I am, but it goes back to at least r2107, over 10 years ago. I don't see much problem with adding a 'stringsAsFactors' argument there.
Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595
From: r-devel-bounces at r-project.org
[mailto:r-devel-bounces at r-project.org] On Behalf Of hadley wickham
Sent: Monday, November 17, 2008 5:10 AM
To: r-devel at r-project.org
Subject: [Rd] stringsAsFactors = FALSE
...
The key lines in
expand.grid would seem to be
if (!is.factor(x) && is.character(x))
x <- factor(x, levels = unique(x))
but I'm not sure why they are being converted to factors in
the first place.
I think expand.grid converts input strings to factors so they retain the order they have in the input. (Note that the levels argument is unique(x), not the sort(unique(x)) that data.frame uses.) People generally give expand.grid sorted input and expect it to not alter the order (the order of the levels affects tables and and some plots).
lapply(expand.grid(Grade=c("Bad","Good","Better"),Size=c("Small","Medium
","Large")), levels)
$Grade
[1] "Bad" "Good" "Better"
$Size
[1] "Small" "Medium" "Large"
lapply(data.frame(Grade=c("Bad","Good","Better"),Size=c("Small","Medium"
,"Large")), levels)
$Grade
[1] "Bad" "Better" "Good"
$Size
[1] "Large" "Medium" "Small"
I have nothing against adding the stringsAsFactors argument to
expand.grid.
Bill Dunlap
TIBCO Software Inc - Spotfire Division
wdunlap tibco.com
On Mon, 17 Nov 2008, Prof Brian Ripley wrote:
On Mon, 17 Nov 2008, hadley wickham wrote:
Hi all, I love the option to not automatically convert strings into factors, but there are three places that the current option doesn't work where I think it should:
Perhaps you mean 'when I would like it to'? Things *should* work as documented, surely?
options(stringsAsFactors = FALSE) str(expand.grid(letters)) str(type.convert(letters)) df <- read.fwf(textConnection(paste(letters,collapse="\n")), 1) str(df)
I get
str(df)
'data.frame': 26 obs. of 1 variable: $ V1: chr "a" "b" "c" "d" ... so what is wrong with that? read.fwf just calls read.table, so the default options of read.table apply.
I think type.convert and read.fwf can be fixed by giving them a stringsAsFactors argument and then using asis = !stringsAsFactors (like read.table).
Seems to me that there is nothing wrong with read.fwf. For type.convert() we could have the default as.is = !default.stringsAsFactors() but I think a strong case needs to be made to change the documented behaviour.
It seems only to be used in RODBC (where I have some extra control pending), simecol and BioC:beadarraySNP (both with as.is=TRUE) and reshape (author, one Hadley Wickham). Given it is documented as a help utilty, it seems up to the caller to set the behaviour he wants.
The key lines in expand.grid would seem to be
if (!is.factor(x) && is.character(x))
x <- factor(x, levels = unique(x))
but I'm not sure why they are being converted to factors in the first
place.
Nor I am, but it goes back to at least r2107, over 10 years ago. I don't see much problem with adding a 'stringsAsFactors' argument there. -- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595
On Mon, Nov 17, 2008 at 9:03 AM, Prof Brian Ripley
<ripley at stats.ox.ac.uk> wrote:
On Mon, 17 Nov 2008, hadley wickham wrote:
Hi all, I love the option to not automatically convert strings into factors, but there are three places that the current option doesn't work where I think it should:
Perhaps you mean 'when I would like it to'? Things *should* work as documented, surely?
In an ideal world, I think things should be documented *and* consistent.
options(stringsAsFactors = FALSE) str(expand.grid(letters)) str(type.convert(letters)) df <- read.fwf(textConnection(paste(letters,collapse="\n")), 1) str(df)
I get
str(df)
'data.frame': 26 obs. of 1 variable: $ V1: chr "a" "b" "c" "d" ... so what is wrong with that? read.fwf just calls read.table, so the default options of read.table apply.
Ok, that's weird. I get factors.
I think type.convert and read.fwf can be fixed by giving them a stringsAsFactors argument and then using asis = !stringsAsFactors (like read.table).
Seems to me that there is nothing wrong with read.fwf. For type.convert() we could have the default as.is = !default.stringsAsFactors() but I think a strong case needs to be made to change the documented behaviour.
Well, my intuition was that type.convert should mirror the behaviour of read.table, since it is what does the conversion behind the scenes. I can of course change my own code.
The key lines in expand.grid would seem to be
if (!is.factor(x) && is.character(x))
x <- factor(x, levels = unique(x))
but I'm not sure why they are being converted to factors in the first
place.
Nor I am, but it goes back to at least r2107, over 10 years ago. I don't see much problem with adding a 'stringsAsFactors' argument there.
Great, thanks. Hadley
On Mon, Nov 17, 2008 at 11:06 AM, William Dunlap <wdunlap at tibco.com> wrote:
From: r-devel-bounces at r-project.org
[mailto:r-devel-bounces at r-project.org] On Behalf Of hadley wickham
Sent: Monday, November 17, 2008 5:10 AM
To: r-devel at r-project.org
Subject: [Rd] stringsAsFactors = FALSE
...
The key lines in
expand.grid would seem to be
if (!is.factor(x) && is.character(x))
x <- factor(x, levels = unique(x))
but I'm not sure why they are being converted to factors in
the first place.
I think expand.grid converts input strings to factors so they retain the order they have in the input. (Note that the levels argument is unique(x), not the sort(unique(x)) that data.frame uses.) People generally give expand.grid sorted input and expect it to not alter the order (the order of the levels affects tables and and some plots).
Ah, that makes sense. (Although the conversion to factors just seems to be a convenient way to achieve the desired effect in this case - there's no reason they have to be factors in the output) Hadley
William Dunlap wrote:
but I'm not sure why they are being converted to factors in the first place.
I think expand.grid converts input strings to factors so they retain the order they have in the input.
Yep. These things do matter. Incidentally, I recently got burned by cooking an example using expand.grid, writing the data to a file with write.table and reading it back in during lecture with read.table. Odds ratio turned upside down...
O__ ---- Peter Dalgaard ?ster Farimagsgade 5, Entr.B c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918 ~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk) FAX: (+45) 35327907
"WD" == William Dunlap <wdunlap at tibco.com>
on Mon, 17 Nov 2008 09:06:49 -0800 writes:
>> From: r-devel-bounces at r-project.org
>> [mailto:r-devel-bounces at r-project.org] On Behalf Of
>> hadley wickham Sent: Monday, November 17, 2008 5:10 AM
>> To: r-devel at r-project.org Subject: [Rd] stringsAsFactors
>> = FALSE ... The key lines in expand.grid would seem to
>> be
>>
>> if (!is.factor(x) && is.character(x)) x <- factor(x,
>> levels = unique(x))
>>
>> but I'm not sure why they are being converted to factors
>> in the first place.
WD> I think expand.grid converts input strings to factors so
WD> they retain the order they have in the input. (Note
WD> that the levels argument is unique(x), not the
WD> sort(unique(x)) that data.frame uses.) People generally
WD> give expand.grid sorted input and expect it to not alter
WD> the order (the order of the levels affects tables and
WD> and some plots).
>>
WD> lapply(expand.grid(Grade=c("Bad","Good","Better"),Size=c("Small","Medium
WD> ","Large")), levels) $Grade [1] "Bad" "Good" "Better"
WD> $Size [1] "Small" "Medium" "Large"
>>
WD> lapply(data.frame(Grade=c("Bad","Good","Better"),Size=c("Small","Medium"
WD> ,"Large")), levels) $Grade [1] "Bad" "Better" "Good"
WD> $Size [1] "Large" "Medium" "Small"
WD> I have nothing against adding the stringsAsFactors
WD> argument to expand.grid.
That's fine, but I am VERY MUCH against
making the default of that argument depend on the ominous
default.stringsAsFactors()
which is determined by getOption("stringsAsFactors").
Why would I hate such a change very much :
Note that we have here an option which would change the
result of a standard R (S) function expand.grid().
Whereas I already did not like that change when it happened for
read.table(), in that case, one could at least say, that
read.table() is in some way platform dependent
{(because it
typically depends on files of the local platform, but as we
know this is not true even there; even now, if I tell my
students, or a book author tells her readers to use
read.table("http://.....") I can no longer be sure that my
students get the same data frame, because they could have
different settings of getOptions("stringsAsFactors")
.... horrible, really!! )}
Please, R should stay as much a functional language as possible
and sensible!
If we start having global options more and more influence
the result of standard R functions, we are going down a very
slippery rope, and one that is making R even more idionsyncratic
than it already needs to be.
Please, no !!
Rather revert the read.table() default of "stringsAsFactors" to
not depend on the option, and maybe provide another set of short
forms of the various
read.table(*, stringsAsFactors=FALSE)
incantations such that
all the factor-haters-string-lovers can use these short forms...
At the very first DSC, 1999, Joe Eaton, author of GNU octave,
told us how he regretted that he had started going down that bad
path, because users had started asking for it.
In the extreme case, we are ending up with a "language" that
depends on a whole huge status setting, and what a given
function computes can no longer be predicted by looking at the
function calls, unless you simultaneously know that whole status.
Please, No !!
Martin Maechler, ETH Zurich
WD> Bill Dunlap TIBCO Software Inc - Spotfire Division
WD> wdunlap tibco.com
WD> ______________________________________________
WD> R-devel at r-project.org mailing list
WD> https://stat.ethz.ch/mailman/listinfo/r-devel