Skip to content

should `data` respect default.stringsAsFactors()?

9 messages · Joshua Ulrich, Michael Nelson, Peter Dalgaard +1 more

#
Hiya,

Probably been debated elsewhere....

I note that R's `data` function does not respect default.stringsAsFactors 

By my lights, it should, especially as it is documented to call read.table, which DOES respect.

Oh, but:  http://r.789695.n4.nabble.com/stringsAsFactors-FALSE-tp921891p921893.html  

Compelling.  I have to agree.

So, I change my mind.  

By my lights, `data` should then be documented to NOT respect default.stringsAsFactors.

Else?

~Malcolm Cook
#
What the <bleep> are you on about? data() does many things, only some of which call read.table() et al., and the ones that do have no special treatment of stringsAsFactors.

-pd

  
    
#
Hi Peter,

Sorry if I was not clear.  Perhaps an example will make my point:
[1] "factor"
[1] "factor"
[1] "factor"
[1] "character"

I am surprised to find that in the above
	  setting the global option stringsAsFactors = FALSE does NOT effect how Species is being read in by the `data` function
whereas
	setting the global option stringsAsFactors = FALSE DOES effect how Species is being read in by read.table

especially since data is documented as calling read.table.

In my opinion, one or the other should change (the behavior of data, or the documentation).

<bleep> <bleep>,

~ Malcolm


 > -----Original Message-----
 > From: peter dalgaard [mailto:pdalgd at gmail.com]
 > Sent: Thursday, February 18, 2016 3:32 PM
 > To: Cook, Malcolm <MEC at stowers.org>
 > Cc: r-devel at stat.math.ethz.ch
 > Subject: Re: [Rd] should `data` respect default.stringsAsFactors()?
 > 
 > What the <bleep> are you on about? data() does many things, only some of
 > which call read.table() et al., and the ones that do have no special treatment
 > of stringsAsFactors.
 > 
 > -pd
 >
> > On 18 Feb 2016, at 21:25 , Cook, Malcolm <MEC at stowers.org> wrote:
> >
 > > Hiya,
 > >
 > > Probably been debated elsewhere....
 > >
 > > I note that R's `data` function does not respect default.stringsAsFactors
 > >
 > > By my lights, it should, especially as it is documented to call read.table,
 > which DOES respect.
 > >
 > > Oh, but:  http://r.789695.n4.nabble.com/stringsAsFactors-FALSE-
 > tp921891p921893.html
 > >
 > > Compelling.  I have to agree.
 > >
 > > So, I change my mind.
 > >
 > > By my lights, `data` should then be documented to NOT respect
 > default.stringsAsFactors.
 > >
 > > Else?
 > >
 > > ~Malcolm Cook
 > >
 > > ______________________________________________
 > > R-devel at r-project.org mailing list
 > > https://stat.ethz.ch/mailman/listinfo/r-devel
 > 
 > --
 > Peter Dalgaard, Professor,
 > Center for Statistics, Copenhagen Business School
 > Solbjerg Plads 3, 2000 Frederiksberg, Denmark
 > Phone: (+45)38153501
 > Office: A 4.23
 > Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com
 > 
 > 
 > 
 > 
 > 
 > 
 > 
 >
#
On Thu, Feb 18, 2016 at 6:03 PM, Cook, Malcolm <MEC at stowers.org> wrote:
To be explicit, it's documented as calling read.table(..., header =
TRUE) in this case, but it actually calls read.table(..., header =
TRUE, as.is = FALSE), which results in class(myiris$Species) of
"factor".

R> myiris<-read.table("data/myiris.tab",header=TRUE,as.is=FALSE)
R> class(myiris$Species)
[1] "factor"

So it seems like adding as.is = FALSE to the call in the documentation
would clear this up.

  
    
#
As Peter pointed out.


data loads data from packages. Various formats are supported. The package author(s) will decide how best to ship (and load) any such data. 


When you call `data(iris)`, it loads iris as it is defined in the datasets package

The definition can be seen here:

https://github.com/wch/r-source/blob/trunk/src/library/datasets/data/iris.R

You will note that Species is explicitly a factor and it won't have been read in by read.table, but by being "source()d" because it is a .R file.


Michael
#
Aha... Hadn't noticed that stringsAsFactors only works via as.is in read.table. 

Yes, the doc should probably be fixed. The code probably not -- packages loading different data sets depending on user options is an even worse idea than hav?ng the option in the first place... (I don't mean having the possibility, I mean the default.stringsAsFactor thing). 

In general, read.table() gets many things wrong, if you don't set switches and/or postprocess. E.g., even when you do intend to read factors, the alphabetical level order is often not desired. My favourite workaround for data() is to drop a corresponding foo.R file in the ./data directory. This will be run in preference to loading foo.txt (or foo.csv, etc) and can contain, like, 

dd <- read.table(foo.txt,.....) 
dd$cook <- factor(dd$cook, levels=c("rare","medium","well-done"))

etc.

-pd

  
    
#
Joshua,
> > Hi Peter,
 > >
 > > Sorry if I was not clear.  Perhaps an example will make my point:
 > >
 > >> data(iris)
 > >> class(iris$Species)
 > > [1] "factor"
 > >> write.table(iris,'data/myiris.tab')
 > >> data(myiris)
 > >> class(myiris$Species)
 > > [1] "factor"
 > >> rm(myiris)
 > >> options(stringsAsFactors = FALSE)
 > >> data(myiris)
 > >> class(myiris$Species)
 > > [1] "factor"
 > >> myiris<-read.table("data/myiris.tab",header=TRUE)
 > >> class(myiris$Species)
 > > [1] "character"
 > >
 > > I am surprised to find that in the above
 > >           setting the global option stringsAsFactors = FALSE does NOT effect
 > how Species is being read in by the `data` function
 > > whereas
 > >         setting the global option stringsAsFactors = FALSE DOES effect how
 > Species is being read in by read.table
 > >
 > > especially since data is documented as calling read.table.
 > >
 > To be explicit, it's documented as calling read.table(..., header =
 > TRUE) in this case, but it actually calls read.table(..., header =
 > TRUE, as.is = FALSE), which results in class(myiris$Species) of
 > "factor".

Aha - makes sense.

 > 
 > R> myiris<-read.table("data/myiris.tab",header=TRUE,as.is=FALSE)
 > R> class(myiris$Species)
 > [1] "factor"
 > 
 > So it seems like adding as.is = FALSE to the call in the documentation
 > would clear this up.

I agree - thanks for digging into the source - you have unearthed the root cause.

~Malcolm

 > > In my opinion, one or the other should change (the behavior of data, or the
 > documentation).
 > >
 > > <bleep> <bleep>,
 > >
 > > ~ Malcolm
 > >
 > >
 > >  > -----Original Message-----
 > >  > From: peter dalgaard [mailto:pdalgd at gmail.com]
 > >  > Sent: Thursday, February 18, 2016 3:32 PM
 > >  > To: Cook, Malcolm <MEC at stowers.org>
 > >  > Cc: r-devel at stat.math.ethz.ch
 > >  > Subject: Re: [Rd] should `data` respect default.stringsAsFactors()?
 > >  >
 > >  > What the <bleep> are you on about? data() does many things, only some
 > of
 > >  > which call read.table() et al., and the ones that do have no special
 > treatment
 > >  > of stringsAsFactors.
 > >  >
 > >  > -pd
 > >  >
> > > > On 18 Feb 2016, at 21:25 , Cook, Malcolm <MEC at stowers.org> wrote:
> >  > >
 > >  > > Hiya,
 > >  > >
 > >  > > Probably been debated elsewhere....
 > >  > >
 > >  > > I note that R's `data` function does not respect default.stringsAsFactors
 > >  > >
 > >  > > By my lights, it should, especially as it is documented to call read.table,
 > >  > which DOES respect.
 > >  > >
 > >  > > Oh, but:  http://r.789695.n4.nabble.com/stringsAsFactors-FALSE-
 > >  > tp921891p921893.html
 > >  > >
 > >  > > Compelling.  I have to agree.
 > >  > >
 > >  > > So, I change my mind.
 > >  > >
 > >  > > By my lights, `data` should then be documented to NOT respect
 > >  > default.stringsAsFactors.
 > >  > >
 > >  > > Else?
 > >  > >
 > >  > > ~Malcolm Cook
 > >  > >
 > >  > > ______________________________________________
 > >  > > R-devel at r-project.org mailing list
 > >  > > https://stat.ethz.ch/mailman/listinfo/r-devel
 > >  >
 > >  > --
 > >  > Peter Dalgaard, Professor,
 > >  > Center for Statistics, Copenhagen Business School
 > >  > Solbjerg Plads 3, 2000 Frederiksberg, Denmark
 > >  > Phone: (+45)38153501
 > >  > Office: A 4.23
 > >  > Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com
 > >  >
 > >  >
 > >  >
 > >  >
 > >  >
 > >  >
 > >  >
 > >  >
 > >
 > > ______________________________________________
 > > R-devel at r-project.org mailing list
 > > https://stat.ethz.ch/mailman/listinfo/r-devel
 > 
 > 
 > 
 > --
 > Joshua Ulrich  |  about.me/joshuaulrich
 > FOSS Trading  |  www.fosstrading.com
 > R/Finance 2016 | www.rinfinance.com
#
Hi,

 > Aha... Hadn't noticed that stringsAsFactors only works via as.is in read.table.
 > 
 > Yes, the doc should probably be fixed. The code probably not 

Agreed.  

Is someone on-list authorized and willing to make the documentation change?  I suppose I could learn what it takes to be a "player", but for such a trivial fix, it probably is overkill.  Dissenting opinions?
> loading different data sets depending on user options is an even worse idea
 > than hav?ng the option in the first place... (I don't mean having the possibility, I
 > mean the default.stringsAsFactor thing).
 > 
 > In general, read.table() gets many things wrong

I agree with you that "read.table() gets many things wrong" and I too have my favorite workarounds - but that was not my concern.  My concern is that data() does not work as documented.

~Malcolm
> and/or postprocess. E.g., even when you do intend to read factors, the
 > alphabetical level order is often not desired. My favourite workaround for
 > data() is to drop a corresponding foo.R file in the ./data directory. This will be
 > run in preference to loading foo.txt (or foo.csv, etc) and can contain, like,
 > 
 > dd <- read.table(foo.txt,.....)
 > dd$cook <- factor(dd$cook, levels=c("rare","medium","well-done"))
 > 
 > etc.
 > 
 > -pd
 > 
 > 
 >
> > On 19 Feb 2016, at 01:39 , Joshua Ulrich <josh.m.ulrich at gmail.com> wrote:
> >
 > > On Thu, Feb 18, 2016 at 6:03 PM, Cook, Malcolm <MEC at stowers.org>
> wrote:
> >> Hi Peter,
 > >>
 > >> Sorry if I was not clear.  Perhaps an example will make my point:
 > >>
 > >>> data(iris)
 > >>> class(iris$Species)
 > >> [1] "factor"
 > >>> write.table(iris,'data/myiris.tab')
 > >>> data(myiris)
 > >>> class(myiris$Species)
 > >> [1] "factor"
 > >>> rm(myiris)
 > >>> options(stringsAsFactors = FALSE)
 > >>> data(myiris)
 > >>> class(myiris$Species)
 > >> [1] "factor"
 > >>> myiris<-read.table("data/myiris.tab",header=TRUE)
 > >>> class(myiris$Species)
 > >> [1] "character"
 > >>
 > >> I am surprised to find that in the above
 > >>          setting the global option stringsAsFactors = FALSE does NOT effect
 > how Species is being read in by the `data` function
 > >> whereas
 > >>        setting the global option stringsAsFactors = FALSE DOES effect how
 > Species is being read in by read.table
 > >>
 > >> especially since data is documented as calling read.table.
 > >>
 > > To be explicit, it's documented as calling read.table(..., header =
 > > TRUE) in this case, but it actually calls read.table(..., header =
 > > TRUE, as.is = FALSE), which results in class(myiris$Species) of
 > > "factor".
 > >
 > > R> myiris<-read.table("data/myiris.tab",header=TRUE,as.is=FALSE)
 > > R> class(myiris$Species)
 > > [1] "factor"
 > >
 > > So it seems like adding as.is = FALSE to the call in the documentation
 > > would clear this up.
 > >
 > >> In my opinion, one or the other should change (the behavior of data, or the
 > documentation).
 > >>
 > >> <bleep> <bleep>,
 > >>
 > >> ~ Malcolm
 > >>
 > >>
 > >>> -----Original Message-----
 > >>> From: peter dalgaard [mailto:pdalgd at gmail.com]
 > >>> Sent: Thursday, February 18, 2016 3:32 PM
 > >>> To: Cook, Malcolm <MEC at stowers.org>
 > >>> Cc: r-devel at stat.math.ethz.ch
 > >>> Subject: Re: [Rd] should `data` respect default.stringsAsFactors()?
 > >>>
 > >>> What the <bleep> are you on about? data() does many things, only some
 > of
 > >>> which call read.table() et al., and the ones that do have no special
 > treatment
 > >>> of stringsAsFactors.
 > >>>
 > >>> -pd
 > >>>
> >>>> On 18 Feb 2016, at 21:25 , Cook, Malcolm <MEC at stowers.org> wrote:
> >>>>
 > >>>> Hiya,
 > >>>>
 > >>>> Probably been debated elsewhere....
 > >>>>
 > >>>> I note that R's `data` function does not respect default.stringsAsFactors
 > >>>>
 > >>>> By my lights, it should, especially as it is documented to call read.table,
 > >>> which DOES respect.
 > >>>>
 > >>>> Oh, but:  http://r.789695.n4.nabble.com/stringsAsFactors-FALSE-
 > >>> tp921891p921893.html
 > >>>>
 > >>>> Compelling.  I have to agree.
 > >>>>
 > >>>> So, I change my mind.
 > >>>>
 > >>>> By my lights, `data` should then be documented to NOT respect
 > >>> default.stringsAsFactors.
 > >>>>
 > >>>> Else?
 > >>>>
 > >>>> ~Malcolm Cook
 > >>>>
 > >>>> ______________________________________________
 > >>>> R-devel at r-project.org mailing list
 > >>>> https://stat.ethz.ch/mailman/listinfo/r-devel
 > >>>
 > >>> --
 > >>> Peter Dalgaard, Professor,
 > >>> Center for Statistics, Copenhagen Business School
 > >>> Solbjerg Plads 3, 2000 Frederiksberg, Denmark
 > >>> Phone: (+45)38153501
 > >>> Office: A 4.23
 > >>> Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com
 > >>>
 > >>>
 > >>>
 > >>>
 > >>>
 > >>>
 > >>>
 > >>>
 > >>
 > >> ______________________________________________
 > >> R-devel at r-project.org mailing list
 > >> https://stat.ethz.ch/mailman/listinfo/r-devel
 > >
 > >
 > >
 > > --
 > > Joshua Ulrich  |  about.me/joshuaulrich
 > > FOSS Trading  |  www.fosstrading.com
 > > R/Finance 2016 | www.rinfinance.com
 > 
 > --
 > Peter Dalgaard, Professor,
 > Center for Statistics, Copenhagen Business School
 > Solbjerg Plads 3, 2000 Frederiksberg, Denmark
 > Phone: (+45)38153501
 > Office: A 4.23
 > Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com
 > 
 > 
 > 
 > 
 > 
 > 
 > 
 >
#
On 19 Feb 2016, at 16:02 , Cook, Malcolm <MEC at stowers.org> wrote:

            
I have fixed it for r-devel.

-pd