Skip to content

normality test

15 messages · Romain Francois, Pieter Provoost, Frank E Harrell Jr +6 more

#
Le 28.04.2005 13:16, Pieter Provoost a ??crit :
Hello,

You seem to know not far enougth.
Null hypothesis in shapiro.test is **normality**, if your p-value is 
very small, then the data is **not** normal.

Look carefully at ?shapiro.test and try again. Furthermore, normality 
tests are not very powerful. Consider using a ?qqnorm and ?qqline

Romain
#
----- Original Message -----
From: "Romain Francois" <francoisromain at free.fr>
To: "Pieter Provoost" <pieterprovoost at gmail.com>; "RHELP"
<R-help at stat.math.ethz.ch>
Sent: Thursday, April 28, 2005 2:03 PM
Subject: Re: [R] normality test
When I make a histogram of the data the distribution doesn't seem to be
normal at all (rather lognormal), but still no matter what test I use
(Shapiro, Anderson-Darling,...) it returns a very small p value (which as
far as I know means that the distribution is normal).
Thanks, I thought null hypothesis for these tests was "no normality"...

Pieter
#
Romain Francois wrote:
Usually (but not always) doing tests of normality reflect a lack of 
understanding of the power of rank tests, and an assumption of high 
power for the tests (qq plots don't always help with that because of 
their subjectivity).  When possible it's good to choose a robust method. 
  Also, doing pre-testing for normality can affect the type I error of 
the overall analysis.
#
For my money,  Frank's comment should go into fortunes.  It seems a
rather Sisyphean battle to keep the lessons of robustness on the 
statistical table
but nevertheless well worthwhile.

url:    www.econ.uiuc.edu/~roger                Roger Koenker
email   rkoenker at uiuc.edu                       Department of Economics
vox:    217-333-4558                            University of Illinois
fax:    217-244-6678                            Champaign, IL 61820
On Apr 28, 2005, at 7:46 AM, Frank E Harrell Jr wrote:

            
#
On Thu, 28 Apr 2005 08:52:33 -0500 roger koenker wrote:

            
Added.

On more comment: maybe it's also worth noting that you don't necessarily
have to rank-transform the data. Instead you can also use a permutation
test based on the original observations.
<advertisment>
This approach is implemented in the coin package for conditional
inference.
</advertisment>

Z
#
Achim Zeileis wrote:
That deals with type I error but not necessarily type II error.  -Frank

  
    
#
Thanks all for your comments and hints. I will try to keep them in mind.
Since a number of people asked me what I'm trying to do: I want to apply
Bayesian inference to a simple ecological model I wrote, and therefore I
need to fit (uniform, normal or lognormal) distributions to sets of observed
data (to derive mean and sd). You probably have noticed that I'm quite new
to statistics, but I'm working on that...

Pieter

----- Original Message -----
From: "Achim Zeileis" <Achim.Zeileis at R-project.org>
To: "roger koenker" <roger at ysidro.econ.uiuc.edu>
Cc: <R-help at stat.math.ethz.ch>
Sent: Thursday, April 28, 2005 4:20 PM
Subject: Re: [R] normality test
http://www.R-project.org/posting-guide.html
#
Below.
Also, qqplots or any other kind of screening for normality can affect the
type I error.

Indeed, one might ask what type I error means in such circumstances. :-)

Indeed, one might ask what hypothesis testing means in such circumstances.

Cheers,
Bert
#
This is false. You do not need to fit anything to "derive mean and sd."
Perhaps you have not clearly stated what you mean.
And you want to use Bayesian methods?! 

I would strongly recommend that you seek a competent statistician to work
with. To paraphrase Frank Harrell (with appropriate apologies for
misattribution, if necessary), correspondence courses in brain surgery are
not a good idea.



-- Bert Gunter
Genentech Non-Clinical Statistics
South San Francisco, CA
 
"The business of the statistician is to catalyze the scientific learning
process."  - George E. P. Box
#
----- Original Message -----
From: "Berton Gunter" <gunter.berton at gene.com>
To: "'Pieter Provoost'" <pieterprovoost at gmail.com>;
<R-help at stat.math.ethz.ch>
Sent: Thursday, April 28, 2005 6:26 PM
Subject: RE: [R] normality test
The Bayesian methods I (will) use are implemented in the modelling
environment I'm using (FEMME). I'm supervised by the person that developed
the environment, and she asked me to fit a normal or lognormal distribution
to the observed data. The parameters of that distribution will then be used
for the Bayesian analysis. So I suppose my supervisor knows what very well
what she's doing, even though I don't (well... not yet).

http://www.nioo.knaw.nl/CEMO/FEMME/Index.htm (the Bayesian inference is a
recent addition and therefore not discussed in the manual)

Pieter
#
Bert wrote:
I was always under the impression that it's mostly a matter of mindset if
you go Bayesian or frequentist, not of your statistical skills.

[...]
And I find this quote a bit disturbing because a catalyst leaves the process
unchanged, yet as a statistician I might at least sometimes have learnt a
bit of the subject matter problem.

And no, I have nothing substantial to contribute any more tonight.

Greetings


Johannes
#
On 28-Apr-05 Pieter Provoost wrote:
And please continue to do so!

Let me try to be constructive. It is clearly established that
the data you posted are far from Normally distributed. The
simple qqnorm plot shows that immediately, and if you need it
the shapiro.test() with "p-value = 8.499e-11" settles it!

Going, however, a bit further, and looking at qqnorm(log(X))
(X being what I call your data series) suggests that it
departs systematically from a pure logNormal at least at the
6 highest values of X. And again, shapiro,test(log(X)) gives

  p-value = 0.00965

which is again a fairly strong indication.

Now, going back to your statement above, that you wrote a
"simple ecological model", I would like to know more about
that before proceeding further.

The rather clear break in slope in qqnorm(log(X)) suggests
to me the possibility that your data may represent a mixture
of two distinct, possibly though not necessarily logNormal,
distributions, one having a much longer upper tail than the
other but being a relative small proportion (say 1/3).

For example, with X denoting your data, compare

  qqnorm(log(X))

with

  set.seed(52341);Y1<-exp(rnorm(22,-3.26,0.69));
  Y2<-exp(rnorm(10,-1.75,2.35))
  qqnorm(log(c(Y1,Y2)))

They are not dissimilar (and I have not been trying very hard).

Another thing to look at is simply

  hist(log(X),breaks=0.5*(-12:4)

This also shows some interesting features: the very high peak
between -3.0 and -2.5 (and possibly an unduly high value between
-3.5 and -3.0), together with a rather thin and widely spread
upper tail above -2.0.

This could be quite consistent with the kind of mixture described
above, or could be due to observer error/bias in measurement.

In any case, it is clear that there is more than a simple
"(uniform, normal or lognormal)" distribution at play here.

In a real investigation, I would at this stage be concerned
to develop a realistic model of how the data are generated.

You do not say what these data represent.

Ths above was mostly written before you posted your second
email, explaining that

  "The Bayesian methods I (will) use are implemented in the
   modelling environment I'm using (FEMME). I'm supervised
   by the person that developed the environment, and she
   asked me to fit a normal or lognormal distribution to
   the observed data. The parameters of that distribution
   will then be used for the Bayesian analysis. So I suppose
   my supervisor knows what very well what she's doing, even
   though I don't (well... not yet)."

It may be speculated whether your supervisor has herself
seriously questioned the structure of these data, since what
she is asking you to do seems to presume that the above is
not relevant!

However, a mixture model would fit nicely into a Bayesian
framework, since (from the above) I suspect a simulation
or MCMC procedure will depend on the parameters to be
estimated for the distribution. For the mixture (e.g.
log(X) is a mixture of two normal distrbutions), you can
estimate the two parameters for each normal distribution
and the proportions p:(1-p) of each. Then, in sampling
from the mixture you first decide on component 1 with
probability p or component 2 with probability q = (1-p),
then sample from the corresonding lognormal distribution.

Best wishes,
Ted.


--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at nessie.mcc.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 29-Apr-05                                       Time: 15:41:36
------------------------------ XFMail ------------------------------
#
I looked carefully at ?shapiro.test and I did not see it state
anywhere what the null hypothesis is or what a low p-value means.  I
understand that I can run the example "shapiro.test(rnorm(100, mean =
5, sd = 3))" and deduce from its p-value of 0.0988 that the
null-hypothesis must be normality, but why can't the help page
explicitly state what the null hypothesis is.

I also understand that the help pages are not meant to "teach"
statistics, but stating the null hypothesis doesn't seem very
difficult given the already considerable amount of time that probably
went into creating these otherwise very good help pages.  Many people
who use this software took stats classes 10 or more years ago and this
stuff is easily forgotten.  Students frequently have trouble keeping
the null and alternative hypothesis straight.

Just my $0.02.

Thanks,

Roger
On 4/28/05, Romain Francois <francoisromain at free.fr> wrote:
#
On 29-Apr-05 roger bos wrote:
Hi Roger,

Well, the opening line is

  Description:
       Performs the Shapiro-Wilk test for normality.

which does pretty strongly suggest that the hypothesis being
tested by shapiro.test(X) is normality of the distribution of X.

It might be just a shade more unambiguous of it were worded

       Performs the Shapiro-Wilk test of normality

or

       Performs the Shapiro-Wilk test for non-normality.

since testing "for" something, like testing "for" contamination
tends to suggest testing for something exceptional, and testing
"for" contamination could equally be seen as a test "of" purity.
("Excuse me, sir. I just need to test your data for normality.
 And you're in trouble if they are.")

But all that is on the very margin of semantic finesse!
I think there's a general approach in the help pages that users
understand the basics of what the function is about, and it is
there to specify what is necessary in order to get it to work
correctly.

One can take your point about stating explicitly what the null
hypothesis of a test is, that it would be useful for people who
are not sure about that sort of thing, and would advance their
statistical understanding at the same time as their proficiency
in R.

However, while this might be feasible for simple matters like
the null hypothesis being tested by a simple function like
shapiro.test or t.test (which, by the way, does not even hint
at what the null hypothesis might be: you have to infer it
from the options available for the alternative hypothesis),
it could get out of hand for tests applicable to more complex
situations like ANOVA, mixed models, and so on. There is a
dangert, if the hypotheis were to be spelled out, that the
help page might become a small (or not so small) book on that
aspect of statistics.

A better place for such things is in documents like "Introductory
Statistics with R" and so on.

Best wishes,
Ted.


--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at nessie.mcc.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 29-Apr-05                                       Time: 17:54:19
------------------------------ XFMail ------------------------------