normality test

An embedded and charset-unspecified text was scrubbed...
Name: not available
Url: https://stat.ethz.ch/pipermail/r-help/attachments/20050428/90229203/attachment.pl
Le 28.04.2005 13:16, Pieter Provoost a ??crit :
Hi,

I have a small set of data on which I have tried some normality tests. When I make a histogram of the data the distribution doesn't seem to be normal at all (rather lognormal), but still no matter what test I use (Shapiro, Anderson-Darling,...) it returns a very small p value (which as far as I know means that the distribution is normal).  

Am I doing something wrong here?
Thanks
Pieter

Hello,

You seem to know not far enougth.
Null hypothesis in shapiro.test is **normality**, if your p-value is 
very small, then the data is **not** normal.

Look carefully at ?shapiro.test and try again. Furthermore, normality 
tests are not very powerful. Consider using a ?qqnorm and ?qqline

Romain
~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~
~~~~~~      Romain FRANCOIS - http://addictedtor.free.fr         ~~~~~~
~~~~        Etudiant  ISUP - CS3 - Industrie et Services           ~~~~
~~                http://www.isup.cicrp.jussieu.fr/                  ~~
~~~~           Stagiaire INRIA Futurs - Equipe SELECT              ~~~~
~~~~~~   http://www.inria.fr/recherche/equipes/select.fr.html    ~~~~~~
~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~
----- Original Message -----
From: "Romain Francois" <francoisromain at free.fr>
To: "Pieter Provoost" <pieterprovoost at gmail.com>; "RHELP"
<R-help at stat.math.ethz.ch>
Sent: Thursday, April 28, 2005 2:03 PM
Subject: Re: [R] normality test
Le 28.04.2005 13:16, Pieter Provoost a ??crit :

Hi,

I have a small set of data on which I have tried some normality tests.
When I make a histogram of the data the distribution doesn't seem to be
normal at all (rather lognormal), but still no matter what test I use
(Shapiro, Anderson-Darling,...) it returns a very small p value (which as
far as I know means that the distribution is normal).
Am I doing something wrong here?
Thanks
Pieter

Hello,

You seem to know not far enougth.
Null hypothesis in shapiro.test is **normality**, if your p-value is
very small, then the data is **not** normal.

Look carefully at ?shapiro.test and try again. Furthermore, normality
tests are not very powerful. Consider using a ?qqnorm and ?qqline

Romain
Thanks, I thought null hypothesis for these tests was "no normality"...

Pieter
Le 28.04.2005 13:16, Pieter Provoost a ??crit :

Hi,

I have a small set of data on which I have tried some normality tests. 
When I make a histogram of the data the distribution doesn't seem to 
be normal at all (rather lognormal), but still no matter what test I 
use (Shapiro, Anderson-Darling,...) it returns a very small p value 
(which as far as I know means that the distribution is normal). 
Am I doing something wrong here?
Thanks
Pieter

Hello,

You seem to know not far enougth.
Null hypothesis in shapiro.test is **normality**, if your p-value is 
very small, then the data is **not** normal.

Look carefully at ?shapiro.test and try again. Furthermore, normality 
tests are not very powerful. Consider using a ?qqnorm and ?qqline

Romain

Usually (but not always) doing tests of normality reflect a lack of 
understanding of the power of rank tests, and an assumption of high 
power for the tests (qq plots don't always help with that because of 
their subjectivity).  When possible it's good to choose a robust method. 
  Also, doing pre-testing for normality can affect the type I error of 
the overall analysis.
Frank E Harrell Jr   Professor and Chair           School of Medicine
                      Department of Biostatistics   Vanderbilt University
For my money,  Frank's comment should go into fortunes.  It seems a
rather Sisyphean battle to keep the lessons of robustness on the 
statistical table
but nevertheless well worthwhile.

url:    www.econ.uiuc.edu/~roger                Roger Koenker
email   rkoenker at uiuc.edu                       Department of Economics
vox:    217-333-4558                            University of Illinois
fax:    217-244-6678                            Champaign, IL 61820

Usually (but not always) doing tests of normality reflect a lack of 
understanding of the power of rank tests, and an assumption of high 
power for the tests (qq plots don't always help with that because of 
their subjectivity).  When possible it's good to choose a robust 
method.  Also, doing pre-testing for normality can affect the type I 
error of the overall analysis.

-- 
Frank E Harrell Jr   Professor and Chair           School of Medicine
                     Department of Biostatistics   Vanderbilt 
University

______________________________________________
R-help at stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! 
http://www.R-project.org/posting-guide.html

For my money,  Frank's comment should go into fortunes.  It seems a
rather Sisyphean battle to keep the lessons of robustness on the 
statistical table but nevertheless well worthwhile.
Added.

On more comment: maybe it's also worth noting that you don't necessarily
have to rank-transform the data. Instead you can also use a permutation
test based on the original observations.
<advertisment>
This approach is implemented in the coin package for conditional
inference.
</advertisment>

Z
url:    www.econ.uiuc.edu/~roger                Roger Koenker
email   rkoenker at uiuc.edu                       Department of
Economics vox:    217-333-4558                            University
of Illinois fax:    217-244-6678                            Champaign,
IL 61820

On Apr 28, 2005, at 7:46 AM, Frank E Harrell Jr wrote:

Usually (but not always) doing tests of normality reflect a lack of 
understanding of the power of rank tests, and an assumption of high 
power for the tests (qq plots don't always help with that because of
their subjectivity).  When possible it's good to choose a robust 
method.  Also, doing pre-testing for normality can affect the type I
error of the overall analysis.

-- 
Frank E Harrell Jr   Professor and Chair           School of
Medicine
                     Department of Biostatistics   Vanderbilt 
University

______________________________________________
R-help at stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! 
http://www.R-project.org/posting-guide.html

______________________________________________
R-help at stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide!
http://www.R-project.org/posting-guide.html

On Thu, 28 Apr 2005 08:52:33 -0500 roger koenker wrote:

For my money,  Frank's comment should go into fortunes.  It seems a
rather Sisyphean battle to keep the lessons of robustness on the 
statistical table but nevertheless well worthwhile.

Added.

On more comment: maybe it's also worth noting that you don't necessarily
have to rank-transform the data. Instead you can also use a permutation
test based on the original observations.
That deals with type I error but not necessarily type II error.  -Frank
<advertisment>
This approach is implemented in the coin package for conditional
inference.
</advertisment>

Z

url:    www.econ.uiuc.edu/~roger                Roger Koenker
email   rkoenker at uiuc.edu                       Department of
Economics vox:    217-333-4558                            University
of Illinois fax:    217-244-6678                            Champaign,
IL 61820

On Apr 28, 2005, at 7:46 AM, Frank E Harrell Jr wrote:

Usually (but not always) doing tests of normality reflect a lack of 
understanding of the power of rank tests, and an assumption of high 
power for the tests (qq plots don't always help with that because of
their subjectivity).  When possible it's good to choose a robust 
method.  Also, doing pre-testing for normality can affect the type I
error of the overall analysis.

-- 
Frank E Harrell Jr   Professor and Chair           School of
Medicine
                    Department of Biostatistics   Vanderbilt 
University

______________________________________________
R-help at stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! 
http://www.R-project.org/posting-guide.html

______________________________________________
R-help at stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide!
http://www.R-project.org/posting-guide.html

Frank E Harrell Jr   Professor and Chair           School of Medicine
                      Department of Biostatistics   Vanderbilt University
Thanks all for your comments and hints. I will try to keep them in mind.
Since a number of people asked me what I'm trying to do: I want to apply
Bayesian inference to a simple ecological model I wrote, and therefore I
need to fit (uniform, normal or lognormal) distributions to sets of observed
data (to derive mean and sd). You probably have noticed that I'm quite new
to statistics, but I'm working on that...

Pieter

----- Original Message -----
From: "Achim Zeileis" <Achim.Zeileis at R-project.org>
To: "roger koenker" <roger at ysidro.econ.uiuc.edu>
Cc: <R-help at stat.math.ethz.ch>
Sent: Thursday, April 28, 2005 4:20 PM
Subject: Re: [R] normality test
On Thu, 28 Apr 2005 08:52:33 -0500 roger koenker wrote:

For my money,  Frank's comment should go into fortunes.  It seems a
rather Sisyphean battle to keep the lessons of robustness on the
statistical table but nevertheless well worthwhile.
Added.

On more comment: maybe it's also worth noting that you don't necessarily
have to rank-transform the data. Instead you can also use a permutation
test based on the original observations.
<advertisment>
This approach is implemented in the coin package for conditional
inference.
</advertisment>

Z

url:    www.econ.uiuc.edu/~roger                Roger Koenker
email   rkoenker at uiuc.edu                       Department of
Economics vox:    217-333-4558                            University
of Illinois fax:    217-244-6678                            Champaign,
IL 61820

On Apr 28, 2005, at 7:46 AM, Frank E Harrell Jr wrote:

Usually (but not always) doing tests of normality reflect a lack of
understanding of the power of rank tests, and an assumption of high
power for the tests (qq plots don't always help with that because of
their subjectivity).  When possible it's good to choose a robust
method.  Also, doing pre-testing for normality can affect the type I
error of the overall analysis.

--
Frank E Harrell Jr   Professor and Chair           School of
Medicine
                     Department of Biostatistics   Vanderbilt
University

______________________________________________
R-help at stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide!
http://www.R-project.org/posting-guide.html

______________________________________________
R-help at stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide!
http://www.R-project.org/posting-guide.html
Below.
Usually (but not always) doing tests of normality reflect a lack of 
understanding of the power of rank tests, and an assumption of high 
power for the tests (qq plots don't always help with that because of 
their subjectivity).  When possible it's good to choose a 
robust method. 
  Also, doing pre-testing for normality can affect the type I 
error of 
the overall analysis.

-- 
Frank E Harrell Jr 
Also, qqplots or any other kind of screening for normality can affect the
type I error.

Indeed, one might ask what type I error means in such circumstances. :-)

Indeed, one might ask what hypothesis testing means in such circumstances.

Cheers,
Bert
-----Original Message-----
From: r-help-bounces at stat.math.ethz.ch 
[mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Pieter Provoost
Sent: Thursday, April 28, 2005 7:52 AM
To: R-help at stat.math.ethz.ch
Subject: Re: [R] normality test

Thanks all for your comments and hints. I will try to keep 
them in mind.
Since a number of people asked me what I'm trying to do: I 
want to apply
Bayesian inference to a simple ecological model I wrote, and 
therefore I
need to fit (uniform, normal or lognormal) distributions to 
sets of observed
data (to derive mean and sd). 
This is false. You do not need to fit anything to "derive mean and sd."
Perhaps you have not clearly stated what you mean.
You probably have noticed that 
I'm quite new
to statistics, but I'm working on that...

Pieter

And you want to use Bayesian methods?! 

I would strongly recommend that you seek a competent statistician to work
with. To paraphrase Frank Harrell (with appropriate apologies for
misattribution, if necessary), correspondence courses in brain surgery are
not a good idea.

-- Bert Gunter
Genentech Non-Clinical Statistics
South San Francisco, CA

"The business of the statistician is to catalyze the scientific learning
process."  - George E. P. Box
----- Original Message -----
From: "Berton Gunter" <gunter.berton at gene.com>
To: "'Pieter Provoost'" <pieterprovoost at gmail.com>;
<R-help at stat.math.ethz.ch>
Sent: Thursday, April 28, 2005 6:26 PM
Subject: RE: [R] normality test

-----Original Message-----
From: r-help-bounces at stat.math.ethz.ch
[mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Pieter Provoost
Sent: Thursday, April 28, 2005 7:52 AM
To: R-help at stat.math.ethz.ch
Subject: Re: [R] normality test

Thanks all for your comments and hints. I will try to keep
them in mind.
Since a number of people asked me what I'm trying to do: I
want to apply
Bayesian inference to a simple ecological model I wrote, and
therefore I
need to fit (uniform, normal or lognormal) distributions to
sets of observed
data (to derive mean and sd).
This is false. You do not need to fit anything to "derive mean and sd."
Perhaps you have not clearly stated what you mean.

You probably have noticed that
I'm quite new
to statistics, but I'm working on that...

Pieter

And you want to use Bayesian methods?!

I would strongly recommend that you seek a competent statistician to work
with. To paraphrase Frank Harrell (with appropriate apologies for
misattribution, if necessary), correspondence courses in brain surgery are
not a good idea.

The Bayesian methods I (will) use are implemented in the modelling
environment I'm using (FEMME). I'm supervised by the person that developed
the environment, and she asked me to fit a normal or lognormal distribution
to the observed data. The parameters of that distribution will then be used
for the Bayesian analysis. So I suppose my supervisor knows what very well
what she's doing, even though I don't (well... not yet).

http://www.nioo.knaw.nl/CEMO/FEMME/Index.htm (the Bayesian inference is a
recent addition and therefore not discussed in the manual)

Pieter
You probably have noticed that
I'm quite new
to statistics, but I'm working on that...

And you want to use Bayesian methods?!

I was always under the impression that it's mostly a matter of mindset if
you go Bayesian or frequentist, not of your statistical skills.

[...]
"The business of the statistician is to catalyze the scientific learning
process."  - George E. P. Box
And I find this quote a bit disturbing because a catalyst leaves the process
unchanged, yet as a statistician I might at least sometimes have learnt a
bit of the subject matter problem.

And no, I have nothing substantial to contribute any more tonight.

Greetings

Johannes
Thanks all for your comments and hints. I will try to
keep them in mind.
Since a number of people asked me what I'm trying to do:
I want to apply Bayesian inference to a simple ecological
model I wrote, and therefore I need to fit (uniform, normal
or lognormal) distributions to sets of observed data
(to derive mean and sd). You probably have noticed that I'm
quite new to statistics, but I'm working on that...

Pieter
And please continue to do so!

Let me try to be constructive. It is clearly established that
the data you posted are far from Normally distributed. The
simple qqnorm plot shows that immediately, and if you need it
the shapiro.test() with "p-value = 8.499e-11" settles it!

Going, however, a bit further, and looking at qqnorm(log(X))
(X being what I call your data series) suggests that it
departs systematically from a pure logNormal at least at the
6 highest values of X. And again, shapiro,test(log(X)) gives

  p-value = 0.00965

which is again a fairly strong indication.

Now, going back to your statement above, that you wrote a
"simple ecological model", I would like to know more about
that before proceeding further.

The rather clear break in slope in qqnorm(log(X)) suggests
to me the possibility that your data may represent a mixture
of two distinct, possibly though not necessarily logNormal,
distributions, one having a much longer upper tail than the
other but being a relative small proportion (say 1/3).

For example, with X denoting your data, compare

  qqnorm(log(X))

with

  set.seed(52341);Y1<-exp(rnorm(22,-3.26,0.69));
  Y2<-exp(rnorm(10,-1.75,2.35))
  qqnorm(log(c(Y1,Y2)))

They are not dissimilar (and I have not been trying very hard).

Another thing to look at is simply

  hist(log(X),breaks=0.5*(-12:4)

This also shows some interesting features: the very high peak
between -3.0 and -2.5 (and possibly an unduly high value between
-3.5 and -3.0), together with a rather thin and widely spread
upper tail above -2.0.

This could be quite consistent with the kind of mixture described
above, or could be due to observer error/bias in measurement.

In any case, it is clear that there is more than a simple
"(uniform, normal or lognormal)" distribution at play here.

In a real investigation, I would at this stage be concerned
to develop a realistic model of how the data are generated.

You do not say what these data represent.

Ths above was mostly written before you posted your second
email, explaining that

  "The Bayesian methods I (will) use are implemented in the
   modelling environment I'm using (FEMME). I'm supervised
   by the person that developed the environment, and she
   asked me to fit a normal or lognormal distribution to
   the observed data. The parameters of that distribution
   will then be used for the Bayesian analysis. So I suppose
   my supervisor knows what very well what she's doing, even
   though I don't (well... not yet)."

It may be speculated whether your supervisor has herself
seriously questioned the structure of these data, since what
she is asking you to do seems to presume that the above is
not relevant!

However, a mixture model would fit nicely into a Bayesian
framework, since (from the above) I suspect a simulation
or MCMC procedure will depend on the parameters to be
estimated for the distribution. For the mixture (e.g.
log(X) is a mixture of two normal distrbutions), you can
estimate the two parameters for each normal distribution
and the proportions p:(1-p) of each. Then, in sampling
from the mixture you first decide on component 1 with
probability p or component 2 with probability q = (1-p),
then sample from the corresonding lognormal distribution.

Best wishes,
Ted.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at nessie.mcc.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 29-Apr-05                                       Time: 15:41:36
------------------------------ XFMail ------------------------------
I looked carefully at ?shapiro.test and I did not see it state
anywhere what the null hypothesis is or what a low p-value means.  I
understand that I can run the example "shapiro.test(rnorm(100, mean =
5, sd = 3))" and deduce from its p-value of 0.0988 that the
null-hypothesis must be normality, but why can't the help page
explicitly state what the null hypothesis is.

I also understand that the help pages are not meant to "teach"
statistics, but stating the null hypothesis doesn't seem very
difficult given the already considerable amount of time that probably
went into creating these otherwise very good help pages.  Many people
who use this software took stats classes 10 or more years ago and this
stuff is easily forgotten.  Students frequently have trouble keeping
the null and alternative hypothesis straight.

Just my $0.02.

Thanks,

Roger
Le 28.04.2005 13:16, Pieter Provoost a ??crit :

Hi,

I have a small set of data on which I have tried some normality tests. When I make a histogram of the data the distribution doesn't seem to be normal at all (rather lognormal), but still no matter what test I use (Shapiro, Anderson-Darling,...) it returns a very small p value (which as far as I know means that the distribution is normal).

Am I doing something wrong here?
Thanks
Pieter

Hello,

You seem to know not far enougth.
Null hypothesis in shapiro.test is **normality**, if your p-value is
very small, then the data is **not** normal.

Look carefully at ?shapiro.test and try again. Furthermore, normality
tests are not very powerful. Consider using a ?qqnorm and ?qqline

Romain

--
~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~
~~~~~~      Romain FRANCOIS - http://addictedtor.free.fr         ~~~~~~
~~~~        Etudiant  ISUP - CS3 - Industrie et Services           ~~~~
~~                http://www.isup.cicrp.jussieu.fr/                  ~~
~~~~           Stagiaire INRIA Futurs - Equipe SELECT              ~~~~
~~~~~~   http://www.inria.fr/recherche/equipes/select.fr.html    ~~~~~~
~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~

______________________________________________
R-help at stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

I looked carefully at ?shapiro.test and I did not see it state
anywhere what the null hypothesis is or what a low p-value means.  I
understand that I can run the example "shapiro.test(rnorm(100, mean =
5, sd = 3))" and deduce from its p-value of 0.0988 that the
null-hypothesis must be normality, but why can't the help page
explicitly state what the null hypothesis is.
Hi Roger,

Well, the opening line is

  Description:
       Performs the Shapiro-Wilk test for normality.

which does pretty strongly suggest that the hypothesis being
tested by shapiro.test(X) is normality of the distribution of X.

It might be just a shade more unambiguous of it were worded

       Performs the Shapiro-Wilk test of normality

or

       Performs the Shapiro-Wilk test for non-normality.

since testing "for" something, like testing "for" contamination
tends to suggest testing for something exceptional, and testing
"for" contamination could equally be seen as a test "of" purity.
("Excuse me, sir. I just need to test your data for normality.
 And you're in trouble if they are.")

But all that is on the very margin of semantic finesse!
I also understand that the help pages are not meant to "teach"
statistics, but stating the null hypothesis doesn't seem very
difficult given the already considerable amount of time that probably
went into creating these otherwise very good help pages.  Many people
who use this software took stats classes 10 or more years ago and this
stuff is easily forgotten.  Students frequently have trouble keeping
the null and alternative hypothesis straight.

Just my $0.02.
I think there's a general approach in the help pages that users
understand the basics of what the function is about, and it is
there to specify what is necessary in order to get it to work
correctly.

One can take your point about stating explicitly what the null
hypothesis of a test is, that it would be useful for people who
are not sure about that sort of thing, and would advance their
statistical understanding at the same time as their proficiency
in R.

However, while this might be feasible for simple matters like
the null hypothesis being tested by a simple function like
shapiro.test or t.test (which, by the way, does not even hint
at what the null hypothesis might be: you have to infer it
from the options available for the alternative hypothesis),
it could get out of hand for tests applicable to more complex
situations like ANOVA, mixed models, and so on. There is a
dangert, if the hypotheis were to be spelled out, that the
help page might become a small (or not so small) book on that
aspect of statistics.

A better place for such things is in documents like "Introductory
Statistics with R" and so on.

Best wishes,
Ted.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at nessie.mcc.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 29-Apr-05                                       Time: 17:54:19
------------------------------ XFMail ------------------------------