Skip to content

Incorrect p value for binom.test?

4 messages · Michael Grant, Peter Dalgaard, Albyn Jones +1 more

#
Michael Grant wrote:
Yes. Or maybe, it is a matter of definition. The problem is that

 > (0:25)[dbinom(0:25,25,.061) <= dbinom(10,25,.061)]
  [1] 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

so with R's definition of "more extreme", all such values are in the 
upper tail.

Actually, if you look at the actual distribution, I think you'll agree 
that it is rather difficult to define a lower tail with positive 
probability that corresponds to X >= 10.

 > round(dbinom(0:25,25,.061),6)
  [1] 0.207319 0.336701 0.262476 0.130726 0.046708 0.012744
  [7] 0.002760 0.000487 0.000071 0.000009 0.000001 0.000000
[13] 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
[19] 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
[25] 0.000000 0.000000

In any case, you would be hard pressed to find a subset of 0:25 that has 
the probability that SAS and your textbook claims as the p value.

  
    
#
The computation 2*sum(dbinom(c(10:25),25,0.061)) does not correspond
to any reasonable definition of p-value.  For a symmetric
distribution, it is fine to use 2 times the tail area of one tail.
For an asymetric distribution, this is silly.

The standard definition given in elementary texts is usually somthing like

     the probability of observing a test statistic at least as 
      extreme as the observed value

or more formally as 

      the smallest significance level at which the observed result would
      lead to rejection of the null hypothesis

Either definition requires further decisions (what does "at least as
extreme" mean?).  In an asymetric distribution, "at least as far from
E(X|H0)"  is not a good interpretation, since deviations in one direction 
may be much less probable than deviations in the other direction.  

Peter's interpretation corresponds both to the interpretation of "at
least as extreme" as "at least as improbable", and also to the
"smallest significance level" interpretation for the test implemented
in binom.test, ie the Clopper-Pearson "exact" test.  2 times the upper
tail area corresponds to neither.  The fact that it is implemented in
SAS and appears in a text do not rescue it from that fundamental
failure to make sense.

albyn
On Thu, Feb 05, 2009 at 09:48:11PM +0100, Peter Dalgaard wrote:
#
On Thu, 5 Feb 2009, Albyn Jones wrote:

            
"Silly" is much too strong. There is a perfectly good reason to compare 2*sum(dbinom(c(10:25),25,0.061)) to a two-sided test threshold.

The argument is that what we are really doing in usual two-sided location tests is two one-sided tests at alpha/2 rather than one two-sided test at alpha. The null hypothesis is being compared to two different alternatives (better or worse vs same) and the decisions about the future would be different depending on which tail we ended up using.

This argument says that we we should compare a one-sided tail area such as sum(dbinom(c(10:25),25,0.061)) to alpha/2; equivalently that we should compare 2*sum(dbinom(c(10:25),25,0.061)) to alpha [or to informal standards for strength of evidence or whatever you typically do with p-values]. I'm not saying that this is the only sensible way to handle and interpret p-values in two-sided tests, but I really don't think it can be dismissed as 'silly'.


      -thomas

Thomas Lumley			Assoc. Professor, Biostatistics
tlumley at u.washington.edu	University of Washington, Seattle


PS: Daniel Dennett has described as an occupational hazard for philosophers the tendency to go from "I can't imagine X" to "No one can imagine X" to "X is inconceivable".  The transition from "I can't imagine how X would be used" to "X is useless" is somewhat similar, as is the Extreme Bayesian transition from "X wasn't derived by a formal consideration of posterior expected loss" to "X can't be derived by a formal consideration of posterior expected loss" to "X is incoherent". Why, yes, I am grumpy about a reviewer. How did you guess?