Thanks for bringing this up, Frank. Since many of us are "educators," I'd like to suggest a bolder approach. Discontinue even offering the stars as an option. Sadly, we can't stop reporting p-values, as the world expects them, but does R need to cater to that attitude by offering star display? For that matter, why not have R report confidence intervals as a default? Many years ago, I wrote a short textbook on stat, and included a substantial section on the dangers of significance testing. All three internal reviewers liked it, but the funny part is that all three said, "I agree with this, but no one else will." :-) Norm
Regression stars
23 messages · Norm Matloff, Tim Triche, Jr., Ben Bolker +8 more
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-devel/attachments/20130209/efc70643/attachment.pl>
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-devel/attachments/20130209/c447f0ae/attachment.pl>
On 13-02-09 3:49 PM, Tim Triche, Jr. wrote:
To clarify, I favor changing the defaults for stringsAsFactors and show.signif.stars to FALSE in R-3.0.0, and view any attempt to remove either functionality as a seemingly simple but fundamentally misguided idea.
Both of these were discussed by R Core. I think it's unlikely the default for stringsAsFactors will be changed (some R Core members like the current behaviour), but it's fairly likely the show.signif.stars default will change. (That's if someone gets around to it: I personally don't care about that one. P-values are commonly used statistics, and the stars are just a simple graphical display of them. I find some p-values to be useful, and the display to be harmless.) I think it's really unlikely the more extreme changes (i.e. dropping show.signif.stars completely, or dropping p-values) will happen. Regarding stringsAsFactors: I'm not going to defend keeping it as is, I'll let the people who like it defend it. What I will likely do is make a few changes so that character vectors are automatically changed to factors in modelling functions, so that operating with stringsAsFactors=FALSE doesn't trigger silly warnings. Duncan Murdoch
This is just my opinion, of course. The change could easily be accompanied by a startup notice or release notes indicating that the changes have been made, and can be reverted to past behavior if the user so desires. Perhaps more users will investigate the various settings, as a happy side effect. My thanks to everyone who spends time supporting and working on R-core. On Sat, Feb 9, 2013 at 12:44 PM, Tim Triche, Jr. <tim.triche at gmail.com>wrote:
Changing the default for show.signif.stars should be sufficient to ensure that, if people are going to get themselves into trouble, they will have to do it on purpose. It's just a visual cue; removing it will not remove the underlying issue, namely blind acceptance of unlikely null models and distributions. For any complex problem, there is a solution that is simple, elegant, and wrong. As grants and careers can depend on these magic numbers, Upton Sinclair might save everyone some trouble... It is difficult to get a man to understand something, when his salary depends upon his not understanding. stringsAsFactors, however, is responsible for an endless stream of mildly irritating misunderstandings, and defaulting that to FALSE would be very nice. Just my $0.02. Defaults are one of the most powerful forces in the universe. Also, I liked your book. On Sat, Feb 9, 2013 at 10:48 AM, Norm Matloff <matloff at cs.ucdavis.edu>wrote:
Thanks for bringing this up, Frank. Since many of us are "educators," I'd like to suggest a bolder approach. Discontinue even offering the stars as an option. Sadly, we can't stop reporting p-values, as the world expects them, but does R need to cater to that attitude by offering star display? For that matter, why not have R report confidence intervals as a default? Many years ago, I wrote a short textbook on stat, and included a substantial section on the dangers of significance testing. All three internal reviewers liked it, but the funny part is that all three said, "I agree with this, but no one else will." :-) Norm
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
-- *A model is a lie that helps you see the truth.* * * Howard Skipper<http://cancerres.aacrjournals.org/content/31/9/1173.full.pdf>
1 day later
Duncan Murdoch <murdoch.duncan <at> gmail.com> writes: [snip]
Regarding stringsAsFactors: I'm not going to defend keeping it as is, I'll let the people who like it defend it.
Would someone (anyone) like to come forward and give us a defense of stringsAsFactors=TRUE -- even someone who doesn't personally like it but would like to play devil's advocate?
What I will likely do is make a few changes so that character vectors are automatically changed to factors in modelling functions, so that operating with stringsAsFactors=FALSE doesn't trigger silly warnings. Duncan Murdoch
[apologies for snipping context: "gmane made me do it"]
On 12.02.2013 14:54, Ben Bolker wrote:
Duncan Murdoch <murdoch.duncan <at> gmail.com> writes: [snip]
Regarding stringsAsFactors: I'm not going to defend keeping it as is, I'll let the people who like it defend it.
Would someone (anyone) like to come forward and give us a defense of stringsAsFactors=TRUE -- even someone who doesn't personally like it but would like to play devil's advocate?
Sure: I will have to change all my scripts, my teaching examples, my book, and lots of code examples for research and particularly consulting jobs. Personally, I think having stringsAsFactors=TRUE is not too bad for read.table() but less useful for data.frame(). And since you ask for the devil's advocate already, related to the subject line: Removing stars is horrible for consulting: With all those people from biology, medicine and other fields who even ask us questions in term of significance stars that are obviously very common for them. Many of them will certainly ask us for the stars, and ask us to switch to another software product once they do not get it from R. They may not be interested in being taught about the advantages or disadvantages of p-values or stars. There are different use cases of R, and I want to keep stars for consulting tasks where things have to be delivered within minutes. I am happy with or without for teaching, where I have the time and can easily talk about the sense and nonsense of p-values. Best, Uwe
What I will likely do is make a few changes so that character vectors are automatically changed to factors in modelling functions, so that operating with stringsAsFactors=FALSE doesn't trigger silly warnings. Duncan Murdoch
[apologies for snipping context: "gmane made me do it"]
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Uwe I've been consulting for decades and have never once been asked for such stars. And when a clinical researcher puts a sentence in a study protocol that P<0.05 will be considered "significant" I get them to take it out. Frank Uwe Ligges-3 wrote
On 12.02.2013 14:54, Ben Bolker wrote:
Duncan Murdoch
<murdoch.duncan <at> gmail.com> writes:
[snip]
Regarding stringsAsFactors: I'm not going to defend keeping it as is, I'll let the people who like it defend it.
Would someone (anyone) like to come forward and give us a defense of stringsAsFactors=TRUE -- even someone who doesn't personally like it but would like to play devil's advocate?
Sure: I will have to change all my scripts, my teaching examples, my book, and lots of code examples for research and particularly consulting jobs. Personally, I think having stringsAsFactors=TRUE is not too bad for read.table() but less useful for data.frame(). And since you ask for the devil's advocate already, related to the subject line: Removing stars is horrible for consulting: With all those people from biology, medicine and other fields who even ask us questions in term of significance stars that are obviously very common for them. Many of them will certainly ask us for the stars, and ask us to switch to another software product once they do not get it from R. They may not be interested in being taught about the advantages or disadvantages of p-values or stars. There are different use cases of R, and I want to keep stars for consulting tasks where things have to be delivered within minutes. I am happy with or without for teaching, where I have the time and can easily talk about the sense and nonsense of p-values. Best, Uwe
What I will likely do is make a few changes so that character vectors are automatically changed to factors in modelling functions, so that operating with stringsAsFactors=FALSE doesn't trigger silly warnings. Duncan Murdoch
[apologies for snipping context: "gmane made me do it"]
______________________________________________
R-devel@
mailing list
______________________________________________
R-devel@
mailing list
----- Frank Harrell Department of Biostatistics, Vanderbilt University -- View this message in context: http://r.789695.n4.nabble.com/Regression-stars-tp4657795p4658268.html Sent from the R devel mailing list archive at Nabble.com.
On 12/02/2013 9:20 AM, Uwe Ligges wrote:
On 12.02.2013 14:54, Ben Bolker wrote:
Duncan Murdoch <murdoch.duncan <at> gmail.com> writes: [snip]
Regarding stringsAsFactors: I'm not going to defend keeping it as is, I'll let the people who like it defend it.
Would someone (anyone) like to come forward and give us a defense of stringsAsFactors=TRUE -- even someone who doesn't personally like it but would like to play devil's advocate?
Sure: I will have to change all my scripts, my teaching examples, my book, and lots of code examples for research and particularly consulting jobs.
Could you post an example of a non-trivial one? (By trivial, I mean one that says "data.frame() converts character vectors to factors". Obviously that would need to change. I mean one that just assumes current behaviour, and would be broken by the change.) Duncan Murdoch
Personally, I think having stringsAsFactors=TRUE is not too bad for read.table() but less useful for data.frame(). And since you ask for the devil's advocate already, related to the subject line: Removing stars is horrible for consulting: With all those people from biology, medicine and other fields who even ask us questions in term of significance stars that are obviously very common for them. Many of them will certainly ask us for the stars, and ask us to switch to another software product once they do not get it from R. They may not be interested in being taught about the advantages or disadvantages of p-values or stars. There are different use cases of R, and I want to keep stars for consulting tasks where things have to be delivered within minutes. I am happy with or without for teaching, where I have the time and can easily talk about the sense and nonsense of p-values. Best, Uwe
What I will likely do is make a few changes so that character vectors are automatically changed to factors in modelling functions, so that operating with stringsAsFactors=FALSE doesn't trigger silly warnings. Duncan Murdoch
[apologies for snipping context: "gmane made me do it"]
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
I think that we should use P < .03 (which approximates the probability of 5 consecutive heads) for assigning significance! Ravi -----Original Message----- From: r-devel-bounces at r-project.org [mailto:r-devel-bounces at r-project.org] On Behalf Of Frank Harrell Sent: Tuesday, February 12, 2013 9:43 AM To: r-devel at r-project.org Subject: Re: [Rd] Regression stars Uwe I've been consulting for decades and have never once been asked for such stars. And when a clinical researcher puts a sentence in a study protocol that P<0.05 will be considered "significant" I get them to take it out. Frank Uwe Ligges-3 wrote
On 12.02.2013 14:54, Ben Bolker wrote:
Duncan Murdoch
<murdoch.duncan <at> gmail.com> writes:
[snip]
Regarding stringsAsFactors: I'm not going to defend keeping it as is, I'll let the people who like it defend it.
Would someone (anyone) like to come forward and give us a defense of stringsAsFactors=TRUE -- even someone who doesn't personally like it but would like to play devil's advocate?
Sure: I will have to change all my scripts, my teaching examples, my book, and lots of code examples for research and particularly consulting jobs. Personally, I think having stringsAsFactors=TRUE is not too bad for read.table() but less useful for data.frame(). And since you ask for the devil's advocate already, related to the subject line: Removing stars is horrible for consulting: With all those people from biology, medicine and other fields who even ask us questions in term of significance stars that are obviously very common for them. Many of them will certainly ask us for the stars, and ask us to switch to another software product once they do not get it from R. They may not be interested in being taught about the advantages or disadvantages of p-values or stars. There are different use cases of R, and I want to keep stars for consulting tasks where things have to be delivered within minutes. I am happy with or without for teaching, where I have the time and can easily talk about the sense and nonsense of p-values. Best, Uwe
What I will likely do is make a few changes so that character vectors are automatically changed to factors in modelling functions, so that operating with stringsAsFactors=FALSE doesn't trigger silly warnings. Duncan Murdoch
[apologies for snipping context: "gmane made me do it"]
______________________________________________
R-devel@
mailing list
______________________________________________
R-devel@
mailing list
----- Frank Harrell Department of Biostatistics, Vanderbilt University -- View this message in context: http://r.789695.n4.nabble.com/Regression-stars-tp4657795p4658268.html Sent from the R devel mailing list archive at Nabble.com. ______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
On 12.02.2013 15:42, Frank Harrell wrote:
Uwe I've been consulting for decades and have never once been asked for such stars.
Honestly: last time I have been asked last week. And when I answered (in another case few months ago) "OK, I can add you another 5 stars for p values smaller than 0.5" they did not find it too funny. Best, Uwe
And when a clinical researcher puts a sentence in a study protocol that P<0.05 will be considered "significant" I get them to take it out. Frank Uwe Ligges-3 wrote
On 12.02.2013 14:54, Ben Bolker wrote:
Duncan Murdoch
<murdoch.duncan <at> gmail.com> writes:
[snip]
Regarding stringsAsFactors: I'm not going to defend keeping it as is, I'll let the people who like it defend it.
Would someone (anyone) like to come forward and give us a defense
of stringsAsFactors=TRUE -- even someone who doesn't personally like
it but would like to play devil's advocate?
Sure: I will have to change all my scripts, my teaching examples, my book, and lots of code examples for research and particularly consulting jobs. Personally, I think having stringsAsFactors=TRUE is not too bad for read.table() but less useful for data.frame(). And since you ask for the devil's advocate already, related to the subject line: Removing stars is horrible for consulting: With all those people from biology, medicine and other fields who even ask us questions in term of significance stars that are obviously very common for them. Many of them will certainly ask us for the stars, and ask us to switch to another software product once they do not get it from R. They may not be interested in being taught about the advantages or disadvantages of p-values or stars. There are different use cases of R, and I want to keep stars for consulting tasks where things have to be delivered within minutes. I am happy with or without for teaching, where I have the time and can easily talk about the sense and nonsense of p-values. Best, Uwe
What I will likely do is make a few changes so that character vectors are automatically changed to factors in modelling functions, so that operating with stringsAsFactors=FALSE doesn't trigger silly warnings. Duncan Murdoch
[apologies for snipping context: "gmane made me do it"]
______________________________________________
R-devel@
mailing list
______________________________________________
R-devel@
mailing list
----- Frank Harrell Department of Biostatistics, Vanderbilt University -- View this message in context: http://r.789695.n4.nabble.com/Regression-stars-tp4657795p4658268.html Sent from the R devel mailing list archive at Nabble.com.
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
They are "reaching for the stars". Pardon my jest, but I couldn't resist. Ravi -----Original Message----- From: r-devel-bounces at r-project.org [mailto:r-devel-bounces at r-project.org] On Behalf Of Uwe Ligges Sent: Tuesday, February 12, 2013 10:01 AM To: Frank Harrell Cc: r-devel at r-project.org Subject: Re: [Rd] Regression stars
On 12.02.2013 15:42, Frank Harrell wrote:
Uwe I've been consulting for decades and have never once been asked for such stars.
Honestly: last time I have been asked last week. And when I answered (in another case few months ago) "OK, I can add you another 5 stars for p values smaller than 0.5" they did not find it too funny. Best, Uwe
And when a clinical researcher puts a sentence in a study protocol that P<0.05 will be considered "significant" I get them to take it out. Frank Uwe Ligges-3 wrote
On 12.02.2013 14:54, Ben Bolker wrote:
Duncan Murdoch
<murdoch.duncan <at> gmail.com> writes:
[snip]
Regarding stringsAsFactors: I'm not going to defend keeping it as is, I'll let the people who like it defend it.
Would someone (anyone) like to come forward and give us a
defense of stringsAsFactors=TRUE -- even someone who doesn't
personally like it but would like to play devil's advocate?
Sure: I will have to change all my scripts, my teaching examples, my book, and lots of code examples for research and particularly consulting jobs. Personally, I think having stringsAsFactors=TRUE is not too bad for read.table() but less useful for data.frame(). And since you ask for the devil's advocate already, related to the subject line: Removing stars is horrible for consulting: With all those people from biology, medicine and other fields who even ask us questions in term of significance stars that are obviously very common for them. Many of them will certainly ask us for the stars, and ask us to switch to another software product once they do not get it from R. They may not be interested in being taught about the advantages or disadvantages of p-values or stars. There are different use cases of R, and I want to keep stars for consulting tasks where things have to be delivered within minutes. I am happy with or without for teaching, where I have the time and can easily talk about the sense and nonsense of p-values. Best, Uwe
What I will likely do is make a few changes so that character vectors are automatically changed to factors in modelling functions, so that operating with stringsAsFactors=FALSE doesn't trigger silly warnings. Duncan Murdoch
[apologies for snipping context: "gmane made me do it"]
______________________________________________
R-devel@
mailing list
______________________________________________
R-devel@
mailing list
----- Frank Harrell Department of Biostatistics, Vanderbilt University -- View this message in context: http://r.789695.n4.nabble.com/Regression-stars-tp4657795p4658268.html Sent from the R devel mailing list archive at Nabble.com.
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
On 13-02-12 09:20 AM, Uwe Ligges wrote:
On 12.02.2013 14:54, Ben Bolker wrote:
Duncan Murdoch <murdoch.duncan <at> gmail.com> writes: [snip]
Regarding stringsAsFactors: I'm not going to defend keeping it as is, I'll let the people who like it defend it.
Would someone (anyone) like to come forward and give us a defense of stringsAsFactors=TRUE -- even someone who doesn't personally like it but would like to play devil's advocate?
Sure: I will have to change all my scripts, my teaching examples, my book, and lots of code examples for research and particularly consulting jobs. Personally, I think having stringsAsFactors=TRUE is not too bad for read.table() but less useful for data.frame(). And since you ask for the devil's advocate already, related to the subject line: Removing stars is horrible for consulting: With all those people from biology, medicine and other fields who even ask us questions in term of significance stars that are obviously very common for them. Many of them will certainly ask us for the stars, and ask us to switch to another software product once they do not get it from R. They may not be interested in being taught about the advantages or disadvantages of p-values or stars. There are different use cases of R, and I want to keep stars for consulting tasks where things have to be delivered within minutes. I am happy with or without for teaching, where I have the time and can easily talk about the sense and nonsense of p-values. Best, Uwe
Thanks, Uwe. Now let me go one step farther. Can you (or anyone) give a good argument **other than backward compatibility** for keeping the stringAsFactors=TRUE argument on data.frame()? I appreciate your distinction between data.frame() and read.table()'s use of stringAsFactors, and I can see that there is some point for quick-and-dirty interactive use in setting all non-numeric variables to factors (arguing that wanting non-numerics as factors is somewhat more common than wanting them as strings). It might be nice to add an optional stringsAsFactors (and check.names) argument to transform(): I've had to write my own Transform() function to allow the defaults to be overridden, since transform() calls data.frame() with the defaults. (Setting the stringsAsFactors option globally would work, although not for check.names.) Ben BOlker
What I will likely do is make a few changes so that character vectors are automatically changed to factors in modelling functions, so that operating with stringsAsFactors=FALSE doesn't trigger silly warnings. Duncan Murdoch
[apologies for snipping context: "gmane made me do it"]
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
On 12.02.2013 16:40, Ben Bolker wrote:
On 13-02-12 09:20 AM, Uwe Ligges wrote:
On 12.02.2013 14:54, Ben Bolker wrote:
Duncan Murdoch <murdoch.duncan <at> gmail.com> writes:
[snip]
Regarding stringsAsFactors: I'm not going to defend keeping it as is, I'll let the people who like it defend it.
Would someone (anyone) like to come forward and give us a defense
of stringsAsFactors=TRUE -- even someone who doesn't personally like
it but would like to play devil's advocate?
Sure: I will have to change all my scripts, my teaching examples, my book, and lots of code examples for research and particularly consulting jobs. Personally, I think having stringsAsFactors=TRUE is not too bad for read.table() but less useful for data.frame(). And since you ask for the devil's advocate already, related to the subject line: Removing stars is horrible for consulting: With all those people from biology, medicine and other fields who even ask us questions in term of significance stars that are obviously very common for them. Many of them will certainly ask us for the stars, and ask us to switch to another software product once they do not get it from R. They may not be interested in being taught about the advantages or disadvantages of p-values or stars. There are different use cases of R, and I want to keep stars for consulting tasks where things have to be delivered within minutes. I am happy with or without for teaching, where I have the time and can easily talk about the sense and nonsense of p-values. Best, Uwe
Thanks, Uwe. Now let me go one step farther. Can you (or anyone) give a good argument **other than backward compatibility** for keeping the stringAsFactors=TRUE argument on data.frame()?
No, I cannot, Uwe
I appreciate your distinction between data.frame() and read.table()'s use of stringAsFactors, and I can see that there is some point for quick-and-dirty interactive use in setting all non-numeric variables to factors (arguing that wanting non-numerics as factors is somewhat more common than wanting them as strings). It might be nice to add an optional stringsAsFactors (and check.names) argument to transform(): I've had to write my own Transform() function to allow the defaults to be overridden, since transform() calls data.frame() with the defaults. (Setting the stringsAsFactors option globally would work, although not for check.names.) Ben BOlker
What I will likely do is make a few changes so that character vectors are automatically changed to factors in modelling functions, so that operating with stringsAsFactors=FALSE doesn't trigger silly warnings. Duncan Murdoch
[apologies for snipping context: "gmane made me do it"]
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
On 12/02/2013 10:40 AM, Ben Bolker wrote:
On 13-02-12 09:20 AM, Uwe Ligges wrote:
On 12.02.2013 14:54, Ben Bolker wrote:
Duncan Murdoch <murdoch.duncan <at> gmail.com> writes: [snip]
Regarding stringsAsFactors: I'm not going to defend keeping it as is, I'll let the people who like it defend it.
Would someone (anyone) like to come forward and give us a defense of stringsAsFactors=TRUE -- even someone who doesn't personally like it but would like to play devil's advocate?
Sure: I will have to change all my scripts, my teaching examples, my book, and lots of code examples for research and particularly consulting jobs. Personally, I think having stringsAsFactors=TRUE is not too bad for read.table() but less useful for data.frame(). And since you ask for the devil's advocate already, related to the subject line: Removing stars is horrible for consulting: With all those people from biology, medicine and other fields who even ask us questions in term of significance stars that are obviously very common for them. Many of them will certainly ask us for the stars, and ask us to switch to another software product once they do not get it from R. They may not be interested in being taught about the advantages or disadvantages of p-values or stars. There are different use cases of R, and I want to keep stars for consulting tasks where things have to be delivered within minutes. I am happy with or without for teaching, where I have the time and can easily talk about the sense and nonsense of p-values. Best, Uwe
Thanks, Uwe. Now let me go one step farther. Can you (or anyone) give a good argument **other than backward compatibility** for keeping the stringAsFactors=TRUE argument on data.frame()?
I can, under two assumptions: 1. We keep stringsAsFactors=TRUE on read.table(). 2. We keep the stringsAsFactors argument in data.frame(). Under those assumptions, it would just be confusing to have opposite defaults. (Just in case someone hasn't read all of this thread: I'd be happier to have the default be FALSE in both cases, but not until 3.1.x. For 3.0.x I think I'd just change the default value of default.stringsAsFactors() to FALSE, so people could easily get the old behaviour.) Duncan Murdoch
I appreciate your distinction between data.frame() and read.table()'s use of stringAsFactors, and I can see that there is some point for quick-and-dirty interactive use in setting all non-numeric variables to factors (arguing that wanting non-numerics as factors is somewhat more common than wanting them as strings). It might be nice to add an optional stringsAsFactors (and check.names) argument to transform(): I've had to write my own Transform() function to allow the defaults to be overridden, since transform() calls data.frame() with the defaults. (Setting the stringsAsFactors option globally would work, although not for check.names.) Ben BOlker
What I will likely do is make a few changes so that character vectors are automatically changed to factors in modelling functions, so that operating with stringsAsFactors=FALSE doesn't trigger silly warnings. Duncan Murdoch
[apologies for snipping context: "gmane made me do it"]
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
I thought that the default was the way it was for performance reasons. For large data.frames or repeated applications, using factors should be faster for non-trivial strings.
fs <- c('apple','peach','watermelon','spinach','persimmon','potato','kale')
n <- 1000000
a1 <- data.frame(f=sample(fs,n,replace=TRUE), x1=rnorm(n), x2=rnorm(n), stringsAsFactors=TRUE)
a2 <- data.frame(f=sample(fs,n,replace=TRUE), x1=rnorm(n), x2=rnorm(n), stringsAsFactors=FALSE)
fn <- function(i,x) x[x$f %in% c('kale','spinach'),]
system.time(z <- sapply(1:100, fn, a1))
user system elapsed 19.614 4.037 24.649
system.time(z <- sapply(1:100, fn, a2))
user system elapsed 19.726 7.715 36.761
On Feb 12, 2013, at 10:40 AM, Ben Bolker <bbolker at gmail.com> wrote:
Thanks, Uwe. Now let me go one step farther. Can you (or anyone) give a good argument **other than backward compatibility** for keeping the stringAsFactors=TRUE argument on data.frame()? I appreciate your distinction between data.frame() and read.table()'s use of stringAsFactors, and I can see that there is some point for quick-and-dirty interactive use in setting all non-numeric variables to factors (arguing that wanting non-numerics as factors is somewhat more common than wanting them as strings). It might be nice to add an optional stringsAsFactors (and check.names) argument to transform(): I've had to write my own Transform() function to allow the defaults to be overridden, since transform() calls data.frame() with the defaults. (Setting the stringsAsFactors option globally would work, although not for check.names.) Ben BOlker
What I will likely do is make a few changes so that character vectors are automatically changed to factors in modelling functions, so that operating with stringsAsFactors=FALSE doesn't trigger silly warnings. Duncan Murdoch
[apologies for snipping context: "gmane made me do it"]
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
On Feb 12, 2013, at 17:05 , Brian Lee Yung Rowe wrote:
I thought that the default was the way it was for performance reasons. For large data.frames or repeated applications, using factors should be faster for non-trivial strings.
I think not. Historically, it's more like "In statistics we have two kinds of variables, numerical and categorical. OK, so we have the occasional truly character-type variables like name and address, let's handle those as a special case".
Peter Dalgaard, Professor Center for Statistics, Copenhagen Business School Solbjerg Plads 3, 2000 Frederiksberg, Denmark Phone: (+45)38153501 Email: pd.mes at cbs.dk Priv: PDalgd at gmail.com
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-devel/attachments/20130212/ac477ce0/attachment.pl>
On Feb 12, 2013, at 11:05 AM, Brian Lee Yung Rowe wrote:
I thought that the default was the way it was for performance reasons. For large data.frames or repeated applications, using factors should be faster for non-trivial strings.
fs <- c('apple','peach','watermelon','spinach','persimmon','potato','kale')
n <- 1000000
a1 <- data.frame(f=sample(fs,n,replace=TRUE), x1=rnorm(n), x2=rnorm(n), stringsAsFactors=TRUE)
a2 <- data.frame(f=sample(fs,n,replace=TRUE), x1=rnorm(n), x2=rnorm(n), stringsAsFactors=FALSE)
fn <- function(i,x) x[x$f %in% c('kale','spinach'),]
system.time(z <- sapply(1:100, fn, a1))
user system elapsed 19.614 4.037 24.649
system.time(z <- sapply(1:100, fn, a2))
user system elapsed 19.726 7.715 36.761
Not really:
system.time(z <- sapply(1:100, fn, a1))
user system elapsed 13.780 0.444 14.229
rm(z) gc()
used (Mb) gc trigger (Mb) max used (Mb) Ncells 182113 9.8 407500 21.8 337655 18.1 Vcells 5789638 44.2 133982285 1022.3 163019778 1243.8
system.time(z <- sapply(1:100, fn, a2))
user system elapsed 13.201 0.668 13.873 But your test is bogus, because %in% uses match() which converts factors to character vectors anyway, so in your case you're just measuring noise in your system, character vectors are always faster in your example. The reason is that in R strings are hashed so character vectors are technically very similar to factors just with faster access (because they don't need to go through the integer indirection). On 32-bit strings are in theory always faster than factors, on 64-bit they use double the size so they may or may not be faster depending on how you hit the cache etc. Anyway, in modern R versions you're much better off using character vectors than factors for any processing, so stringsAsFactors=FALSE is what I use exclusively. Cheers, Simon
On Feb 12, 2013, at 10:40 AM, Ben Bolker <bbolker at gmail.com> wrote:
Thanks, Uwe. Now let me go one step farther. Can you (or anyone) give a good argument **other than backward compatibility** for keeping the stringAsFactors=TRUE argument on data.frame()? I appreciate your distinction between data.frame() and read.table()'s use of stringAsFactors, and I can see that there is some point for quick-and-dirty interactive use in setting all non-numeric variables to factors (arguing that wanting non-numerics as factors is somewhat more common than wanting them as strings). It might be nice to add an optional stringsAsFactors (and check.names) argument to transform(): I've had to write my own Transform() function to allow the defaults to be overridden, since transform() calls data.frame() with the defaults. (Setting the stringsAsFactors option globally would work, although not for check.names.) Ben BOlker
What I will likely do is make a few changes so that character vectors are automatically changed to factors in modelling functions, so that operating with stringsAsFactors=FALSE doesn't trigger silly warnings. Duncan Murdoch
[apologies for snipping context: "gmane made me do it"]
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
On 02/12/2013 08:20 AM, peter dalgaard wrote:
On Feb 12, 2013, at 17:05 , Brian Lee Yung Rowe wrote:
I thought that the default was the way it was for performance reasons. For large data.frames or repeated applications, using factors should be faster for non-trivial strings.
I think not. Historically, it's more like "In statistics we have two kinds of variables, numerical and categorical. OK, so we have the occasional truly character-type variables like name and address, let's handle those as a special case".
<sarcasm> Since character vectors are sooooo bad and people use them where they should instead use a factor, I propose to go all the way and by adding the stringsAsFactors arg to character() too. That way people are put on the right track from the very start. </sarcasm> No seriously, if my variable is categorical, it's already in a factor and that's how I pass it to data.frame(). But if I have it in a character vector, it's because that's how I want it. It's my choice. How could anybody ever think that having data.frame() alter his/her data is a good thing? Please *remove* the stringsAsFactors arg of data.frame() in R 3.0. You'll do a big favor to your user base. Thanks, H.
Herv? Pag?s Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
On 12/02/2013 1:47 PM, Herv? Pag?s wrote:
On 02/12/2013 08:20 AM, peter dalgaard wrote:
On Feb 12, 2013, at 17:05 , Brian Lee Yung Rowe wrote:
I thought that the default was the way it was for performance reasons. For large data.frames or repeated applications, using factors should be faster for non-trivial strings.
I think not. Historically, it's more like "In statistics we have two kinds of variables, numerical and categorical. OK, so we have the occasional truly character-type variables like name and address, let's handle those as a special case".
<sarcasm> Since character vectors are sooooo bad and people use them where they should instead use a factor, I propose to go all the way and by adding the stringsAsFactors arg to character() too. That way people are put on the right track from the very start. </sarcasm>
I think you are misreading what Peter wrote. He wasn't defending that point of view, he was describing it.
No seriously, if my variable is categorical, it's already in a factor and that's how I pass it to data.frame(). But if I have it in a character vector, it's because that's how I want it. It's my choice. How could anybody ever think that having data.frame() alter his/her data is a good thing? Please *remove* the stringsAsFactors arg of data.frame() in R 3.0. You'll do a big favor to your user base.
That's a really bad suggestion -- it would break code for people who set stringsAsFactors=FALSE as well as those who rely on the current default behaviour. We certainly won't do that. Duncan Murdoch
Hi Duncan,
On 02/12/2013 11:19 AM, Duncan Murdoch wrote:
On 12/02/2013 1:47 PM, Herv? Pag?s wrote:
On 02/12/2013 08:20 AM, peter dalgaard wrote:
On Feb 12, 2013, at 17:05 , Brian Lee Yung Rowe wrote:
I thought that the default was the way it was for performance
reasons. For large data.frames or repeated applications, using factors should be faster for non-trivial strings.
I think not. Historically, it's more like "In statistics we have two
kinds of variables, numerical and categorical. OK, so we have the occasional truly character-type variables like name and address, let's handle those as a special case". <sarcasm> Since character vectors are sooooo bad and people use them where they should instead use a factor, I propose to go all the way and by adding the stringsAsFactors arg to character() too. That way people are put on the right track from the very start. </sarcasm>
I think you are misreading what Peter wrote. He wasn't defending that point of view, he was describing it.
I was answering to the thread, not to Peter in particular. Sorry if it sounded otherwise.
No seriously, if my variable is categorical, it's already in a factor and that's how I pass it to data.frame(). But if I have it in a character vector, it's because that's how I want it. It's my choice. How could anybody ever think that having data.frame() alter his/her data is a good thing? Please *remove* the stringsAsFactors arg of data.frame() in R 3.0. You'll do a big favor to your user base.
That's a really bad suggestion -- it would break code for people who set stringsAsFactors=FALSE as well as those who rely on the current default behaviour. We certainly won't do that.
But since there seems to be a discussion about doing some changes to the stringsAsFactors "feature", I was hoping you would consider that one too. Doing the right thing sometimes requires breaking people's code, sadly! Cheers, H.
Duncan Murdoch
Herv? Pag?s Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
On Feb 12, 2013, at 20:19 , Duncan Murdoch wrote:
I think you are misreading what Peter wrote. He wasn't defending that point of view, he was describing it.
Yes. However, that being said, there is the point that the whole thing has been designed to work within the paradigm that I described, and, for better or worse, things are reasonably coherent and consistent within that framework. The thing that always worries me, when people get bothered by some aspect of software design, is that, if you change only that aspect, you may find yourself with something that is incoherent and inconsistent. I have quite a few times found myself realizing that "Uncle John was right after all". For instance, if you change the paradigm to say that "character variables are character, unless explicitly turned into factors", and then ameliorate the inconvenience by changing code that relies on factors to convert character variables on the fly, then you will lose the otherwise automatic consistency of level sets between subsets of data. (So, the math department not only has zero female professors, the entire female gender ceases to exist for that subgroup.)
Peter Dalgaard, Professor Center for Statistics, Copenhagen Business School Solbjerg Plads 3, 2000 Frederiksberg, Denmark Phone: (+45)38153501 Email: pd.mes at cbs.dk Priv: PDalgd at gmail.com
On 13-02-13 7:25 AM, peter dalgaard wrote:
On Feb 12, 2013, at 20:19 , Duncan Murdoch wrote:
I think you are misreading what Peter wrote. He wasn't defending that point of view, he was describing it.
Yes. However, that being said, there is the point that the whole thing has been designed to work within the paradigm that I described, and, for better or worse, things are reasonably coherent and consistent within that framework. The thing that always worries me, when people get bothered by some aspect of software design, is that, if you change only that aspect, you may find yourself with something that is incoherent and inconsistent. I have quite a few times found myself realizing that "Uncle John was right after all". For instance, if you change the paradigm to say that "character variables are character, unless explicitly turned into factors", and then ameliorate the inconvenience by changing code that relies on factors to convert character variables on the fly, then you will lose the otherwise automatic consistency of level sets between subsets of data. (So, the math department not only has zero female professors, the entire female gender ceases to exist for that subgroup.)
Sure, if I have a file that contains a column named Sex and it is all M,
I can't expect R to automatically know that there is another
possibility. That's always been a problem. If we automatically convert
the data to factors when we read, then maybe we'll be lucky and some
other part of that file that we're planning to throw away will contain
an F, and we'll automatically construct the right factor.
(Except we don't: lm and glm will throw away the F level if there are
none in the subset we pass to them, factor or not, because they use
drop.unused.levels=TRUE in their call to model.frame().)
There's also the possibility that there will be m and f in there, and
we'll get it wrong.
In R 2.15.2, we do the automatic conversion with a warning, but we do it
wrong, which leads to the inconsistency that Bill Dunlap reported.
R-devel drops the warning and comes closer to getting it right, but it's
really an impossible problem: if we never see an F, we'll never set the
levels of the factor properly. If we see a typo like m or f and don't
realize it's a typo, we'll have more than two Sex values.
The current R-devel implementation delays the conversion as much as it
can, and maybe it delays it too far. It allows model.frame() to
continue to return character columns, as it does in 2.15.2. This was to
support xtabs(), which treats character columns differently from
factors, and other unforeseen uses. Another possibility would be to add
an argument ("stringsAsFactors"?) to model.frame() to let modelling
functions choose whether they want factors or not. xtabs() would say
no, lm() and glm() would say yes. I think the current implementation is
preferable because it won't require changes to well written existing
functions.
With the current R-devel implementation, it is easier than in 2.15.2 to
get errors thrown when the auto-conversion goes wrong. I don't know of
any examples where you get incorrect results. I think this is an
improvement.
I'd appreciate hearing of any bugs in it.
Duncan Murdoch