Skip to content

Regression stars

23 messages · Norm Matloff, Tim Triche, Jr., Ben Bolker +8 more

#
Thanks for bringing this up, Frank.

Since many of us are "educators," I'd like to suggest a bolder approach.
Discontinue even offering the stars as an option.  Sadly, we can't stop
reporting p-values, as the world expects them, but does R need to cater
to that attitude by offering star display?  For that matter, why not
have R report confidence intervals as a default?

Many years ago, I wrote a short textbook on stat, and included a
substantial section on the dangers of significance testing.  All three
internal reviewers liked it, but the funny part is that all three said,
"I agree with this, but no one else will." :-)

Norm
#
On 13-02-09 3:49 PM, Tim Triche, Jr. wrote:
Both of these were discussed by R Core.  I think it's unlikely the 
default for stringsAsFactors will be changed (some R Core members like 
the current behaviour), but it's fairly likely the show.signif.stars 
default will change.  (That's if someone gets around to it:  I 
personally don't care about that one.  P-values are commonly used 
statistics, and the stars are just a simple graphical display of them. 
I find some p-values to be useful, and the display to be harmless.)

I think it's really unlikely the more extreme changes (i.e. dropping 
show.signif.stars completely, or dropping p-values) will happen.

Regarding stringsAsFactors:  I'm not going to defend keeping it as is, 
I'll let the people who like it defend it.  What I will likely do is 
make a few changes so that character vectors are automatically changed 
to factors in modelling functions, so that operating with 
stringsAsFactors=FALSE doesn't trigger silly warnings.

Duncan Murdoch
1 day later
#
Duncan Murdoch <murdoch.duncan <at> gmail.com> writes:

  [snip]
Would someone (anyone) like to come forward and give us a defense
of stringsAsFactors=TRUE -- even someone who doesn't personally like
it but would like to play devil's advocate?
[apologies for snipping context: "gmane made me do it"]
#
On 12.02.2013 14:54, Ben Bolker wrote:
Sure:
I will have to change all my scripts, my teaching examples, my book, and 
lots of code examples for research and particularly consulting jobs.

Personally, I think having stringsAsFactors=TRUE is not too bad for 
read.table() but less useful for data.frame().

And since you ask for the devil's advocate already, related to the 
subject line: Removing stars is horrible for consulting: With all those 
people from biology, medicine and other fields who even ask us questions 
in term of significance stars that are obviously very common for them. 
Many of them will certainly ask us for the stars, and ask us to switch 
to another software product once they do not get it from R. They may not 
be interested in being taught about the advantages or disadvantages of 
p-values or stars.

There are different use cases of R, and I want to keep stars for 
consulting tasks where things have to be delivered within minutes. I am 
happy with or without for teaching, where I have the time and can easily 
talk about the sense and nonsense of p-values.


Best,
Uwe
#
Uwe I've been consulting for decades and have never once been asked for such
stars.  And when a clinical researcher puts a sentence in a study protocol
that P<0.05 will be considered "significant" I get them to take it out.
Frank

Uwe Ligges-3 wrote

            

            
-----
Frank Harrell
Department of Biostatistics, Vanderbilt University
--
View this message in context: http://r.789695.n4.nabble.com/Regression-stars-tp4657795p4658268.html
Sent from the R devel mailing list archive at Nabble.com.
#
On 12/02/2013 9:20 AM, Uwe Ligges wrote:
Could you post an example of a non-trivial one?  (By trivial, I mean one 
that says "data.frame() converts character vectors to factors". 
Obviously that would need to change.  I mean one that just assumes 
current behaviour, and would be broken by the change.)

Duncan Murdoch
#
I think that we should use P < .03 (which approximates the probability of 5 consecutive heads) for assigning significance!

Ravi

-----Original Message-----
From: r-devel-bounces at r-project.org [mailto:r-devel-bounces at r-project.org] On Behalf Of Frank Harrell
Sent: Tuesday, February 12, 2013 9:43 AM
To: r-devel at r-project.org
Subject: Re: [Rd] Regression stars

Uwe I've been consulting for decades and have never once been asked for such stars.  And when a clinical researcher puts a sentence in a study protocol that P<0.05 will be considered "significant" I get them to take it out.
Frank

Uwe Ligges-3 wrote

            

            
-----
Frank Harrell
Department of Biostatistics, Vanderbilt University
--
View this message in context: http://r.789695.n4.nabble.com/Regression-stars-tp4657795p4658268.html
Sent from the R devel mailing list archive at Nabble.com.

______________________________________________
R-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
#
On 12.02.2013 15:42, Frank Harrell wrote:
Honestly: last time I have been asked last week.

And when I answered (in another case few months ago) "OK, I can add you 
another 5 stars for p values smaller than 0.5" they did not find it too 
funny.

Best,
Uwe
#
They are "reaching for the stars".  Pardon my jest, but I couldn't resist. 

Ravi

-----Original Message-----
From: r-devel-bounces at r-project.org [mailto:r-devel-bounces at r-project.org] On Behalf Of Uwe Ligges
Sent: Tuesday, February 12, 2013 10:01 AM
To: Frank Harrell
Cc: r-devel at r-project.org
Subject: Re: [Rd] Regression stars
On 12.02.2013 15:42, Frank Harrell wrote:
Honestly: last time I have been asked last week.

And when I answered (in another case few months ago) "OK, I can add you another 5 stars for p values smaller than 0.5" they did not find it too funny.

Best,
Uwe
______________________________________________
R-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
#
On 13-02-12 09:20 AM, Uwe Ligges wrote:
Thanks, Uwe.
  Now let me go one step farther.

  Can you (or anyone) give a good argument **other than backward
compatibility** for keeping the stringAsFactors=TRUE argument on
data.frame()?

  I appreciate your distinction between data.frame() and read.table()'s
use of stringAsFactors, and I can see that there is some point for
quick-and-dirty interactive use in setting all non-numeric variables to
factors (arguing that wanting non-numerics as factors is somewhat more
common than wanting them as strings).

  It might be nice to add an optional stringsAsFactors (and check.names)
argument to transform(): I've had to write my own Transform() function
to allow the defaults to be overridden, since transform() calls
data.frame() with the defaults.  (Setting the stringsAsFactors option
globally would work, although not for check.names.)

  Ben BOlker
#
On 12.02.2013 16:40, Ben Bolker wrote:
No, I cannot,
Uwe
#
On 12/02/2013 10:40 AM, Ben Bolker wrote:
I can, under two assumptions:

   1.  We keep stringsAsFactors=TRUE on read.table().
   2.  We keep the stringsAsFactors argument in data.frame().

Under those assumptions, it would just be confusing to have opposite 
defaults.  (Just in case someone hasn't read all of this thread: I'd be 
happier to have the default be FALSE in both cases, but not until 
3.1.x.  For 3.0.x I think I'd just change the default value of 
default.stringsAsFactors() to FALSE, so people could easily get the old 
behaviour.)

Duncan Murdoch
#
I thought that the default was the way it was for performance reasons. For large data.frames or repeated applications, using factors should be faster for non-trivial strings.
user  system elapsed 
 19.614   4.037  24.649
user  system elapsed 
 19.726   7.715  36.761
On Feb 12, 2013, at 10:40 AM, Ben Bolker <bbolker at gmail.com> wrote:
#
On Feb 12, 2013, at 17:05 , Brian Lee Yung Rowe wrote:

            
I think not. Historically, it's more like "In statistics we have two kinds of variables, numerical and categorical. OK, so we have the occasional truly character-type variables like name and address, let's handle those as a special case".
#
On Feb 12, 2013, at 11:05 AM, Brian Lee Yung Rowe wrote:

            
Not really:
user  system elapsed 
 13.780   0.444  14.229
used (Mb) gc trigger   (Mb)  max used   (Mb)
Ncells  182113  9.8     407500   21.8    337655   18.1
Vcells 5789638 44.2  133982285 1022.3 163019778 1243.8
user  system elapsed 
 13.201   0.668  13.873 


But your test is bogus, because %in% uses match() which converts factors to character vectors anyway, so in your case you're just measuring noise in your system, character vectors are always faster in your example.

The reason is that in R strings are hashed so character vectors are technically very similar to factors just with faster access (because they don't need to go through the integer indirection). On 32-bit strings are in theory always faster than factors, on 64-bit they use double the size so they may or may not be faster depending on how you hit the cache etc. Anyway, in modern R versions you're much better off using character vectors than factors for any processing, so stringsAsFactors=FALSE is what I use exclusively.

Cheers,
Simon
#
On 02/12/2013 08:20 AM, peter dalgaard wrote:
<sarcasm>

Since character vectors are sooooo bad and people use them where
they should instead use a factor, I propose to go all the way and
by adding the stringsAsFactors arg to character() too. That way
people are put on the right track from the very start.

</sarcasm>

No seriously, if my variable is categorical, it's already in a factor
and that's how I pass it to data.frame(). But if I have it in a
character vector, it's because that's how I want it. It's my choice.
How could anybody ever think that having data.frame() alter his/her
data is a good thing?

Please *remove* the stringsAsFactors arg of data.frame() in R 3.0.
You'll do a big favor to your user base.

Thanks,
H.

  
    
#
On 12/02/2013 1:47 PM, Herv? Pag?s wrote:
I think you are misreading what Peter wrote.  He wasn't defending that 
point of view, he was describing it.
That's a really bad suggestion -- it would break code for people who set 
stringsAsFactors=FALSE as well as those who rely on the current default 
behaviour.   We certainly won't do that.

Duncan Murdoch
#
Hi Duncan,
On 02/12/2013 11:19 AM, Duncan Murdoch wrote:
I was answering to the thread, not to Peter in particular. Sorry if it
sounded otherwise.
But since there seems to be a discussion about doing some changes to
the stringsAsFactors "feature", I was hoping you would consider that
one too.  Doing the right thing sometimes requires breaking people's
code, sadly!

Cheers,
H.

  
    
#
On Feb 12, 2013, at 20:19 , Duncan Murdoch wrote:

            
Yes. However, that being said, there is the point that the whole thing has been designed to work within the paradigm that I described, and, for better or worse, things are reasonably coherent and consistent within that framework.

The thing that always worries me, when people get bothered by some aspect of software design, is that, if you change only that aspect, you may find yourself with something that is incoherent and inconsistent. I have quite a few times found myself realizing that "Uncle John was right after all".  

For instance, if you change the paradigm to say that "character variables are character, unless explicitly turned into factors", and then ameliorate the inconvenience by changing code that relies on factors to convert character variables on the fly, then you will lose the otherwise automatic consistency of level sets between subsets of data. (So, the math department not only has zero female professors, the entire female gender ceases to exist for that subgroup.)
#
On 13-02-13 7:25 AM, peter dalgaard wrote:
Sure, if I have a file that contains a column named Sex and it is all M,
I can't expect R to automatically know that there is another
possibility.  That's always been a problem.  If we automatically convert
the data to factors when we read, then maybe we'll be lucky and some
other part of that file that we're planning to throw away will contain
an F, and we'll automatically construct the right factor.
(Except we don't:  lm and glm will throw away the F level if there are
none in the subset we pass to them, factor or not, because they use
drop.unused.levels=TRUE in their call to model.frame().)

There's also the possibility that there will be m and f in there, and
we'll get it wrong.

In R 2.15.2, we do the automatic conversion with a warning, but we do it
wrong, which leads to the inconsistency that Bill Dunlap reported.
R-devel drops the warning and comes closer to getting it right, but it's
really an impossible problem:  if we never see an F, we'll never set the
levels of the factor properly.  If we see a typo like m or f and don't
realize it's a typo, we'll have more than two Sex values.

The current R-devel implementation delays the conversion as much as it
can, and maybe it delays it too far.  It allows model.frame() to
continue to return character columns, as it does in 2.15.2.  This was to
support xtabs(), which treats character columns differently from
factors, and other unforeseen uses.  Another possibility would be to add
an argument ("stringsAsFactors"?) to model.frame() to let modelling
functions choose whether they want factors or not.  xtabs() would say
no, lm() and glm() would say yes.  I think the current implementation is
preferable because it won't require changes to well written existing
functions.

With the current R-devel implementation, it is easier than in 2.15.2 to
get errors thrown when the auto-conversion goes wrong.  I don't know of
any examples where you get incorrect results.  I think this is an
improvement.

I'd appreciate hearing of any bugs in it.

Duncan Murdoch