An embedded and charset-unspecified text was scrubbed... Name: not available Url: https://stat.ethz.ch/pipermail/r-help/attachments/20050428/90229203/attachment.pl
normality test
15 messages · Romain Francois, Pieter Provoost, Frank E Harrell Jr +6 more
Le 28.04.2005 13:16, Pieter Provoost a ??crit :
Hi, I have a small set of data on which I have tried some normality tests. When I make a histogram of the data the distribution doesn't seem to be normal at all (rather lognormal), but still no matter what test I use (Shapiro, Anderson-Darling,...) it returns a very small p value (which as far as I know means that the distribution is normal). Am I doing something wrong here? Thanks Pieter
Hello, You seem to know not far enougth. Null hypothesis in shapiro.test is **normality**, if your p-value is very small, then the data is **not** normal. Look carefully at ?shapiro.test and try again. Furthermore, normality tests are not very powerful. Consider using a ?qqnorm and ?qqline Romain
~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~ ~~~~~~ Romain FRANCOIS - http://addictedtor.free.fr ~~~~~~ ~~~~ Etudiant ISUP - CS3 - Industrie et Services ~~~~ ~~ http://www.isup.cicrp.jussieu.fr/ ~~ ~~~~ Stagiaire INRIA Futurs - Equipe SELECT ~~~~ ~~~~~~ http://www.inria.fr/recherche/equipes/select.fr.html ~~~~~~ ~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~
----- Original Message ----- From: "Romain Francois" <francoisromain at free.fr> To: "Pieter Provoost" <pieterprovoost at gmail.com>; "RHELP" <R-help at stat.math.ethz.ch> Sent: Thursday, April 28, 2005 2:03 PM Subject: Re: [R] normality test
Le 28.04.2005 13:16, Pieter Provoost a ??crit :
Hi, I have a small set of data on which I have tried some normality tests.
When I make a histogram of the data the distribution doesn't seem to be normal at all (rather lognormal), but still no matter what test I use (Shapiro, Anderson-Darling,...) it returns a very small p value (which as far as I know means that the distribution is normal).
Am I doing something wrong here? Thanks Pieter
Hello, You seem to know not far enougth. Null hypothesis in shapiro.test is **normality**, if your p-value is very small, then the data is **not** normal. Look carefully at ?shapiro.test and try again. Furthermore, normality tests are not very powerful. Consider using a ?qqnorm and ?qqline Romain
Thanks, I thought null hypothesis for these tests was "no normality"... Pieter
Romain Francois wrote:
Le 28.04.2005 13:16, Pieter Provoost a ??crit :
Hi, I have a small set of data on which I have tried some normality tests. When I make a histogram of the data the distribution doesn't seem to be normal at all (rather lognormal), but still no matter what test I use (Shapiro, Anderson-Darling,...) it returns a very small p value (which as far as I know means that the distribution is normal). Am I doing something wrong here? Thanks Pieter
Hello, You seem to know not far enougth. Null hypothesis in shapiro.test is **normality**, if your p-value is very small, then the data is **not** normal. Look carefully at ?shapiro.test and try again. Furthermore, normality tests are not very powerful. Consider using a ?qqnorm and ?qqline Romain
Usually (but not always) doing tests of normality reflect a lack of understanding of the power of rank tests, and an assumption of high power for the tests (qq plots don't always help with that because of their subjectivity). When possible it's good to choose a robust method. Also, doing pre-testing for normality can affect the type I error of the overall analysis.
Frank E Harrell Jr Professor and Chair School of Medicine
Department of Biostatistics Vanderbilt University
For my money, Frank's comment should go into fortunes. It seems a rather Sisyphean battle to keep the lessons of robustness on the statistical table but nevertheless well worthwhile. url: www.econ.uiuc.edu/~roger Roger Koenker email rkoenker at uiuc.edu Department of Economics vox: 217-333-4558 University of Illinois fax: 217-244-6678 Champaign, IL 61820
On Apr 28, 2005, at 7:46 AM, Frank E Harrell Jr wrote:
Usually (but not always) doing tests of normality reflect a lack of
understanding of the power of rank tests, and an assumption of high
power for the tests (qq plots don't always help with that because of
their subjectivity). When possible it's good to choose a robust
method. Also, doing pre-testing for normality can affect the type I
error of the overall analysis.
--
Frank E Harrell Jr Professor and Chair School of Medicine
Department of Biostatistics Vanderbilt
University
______________________________________________ R-help at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
On Thu, 28 Apr 2005 08:52:33 -0500 roger koenker wrote:
For my money, Frank's comment should go into fortunes. It seems a rather Sisyphean battle to keep the lessons of robustness on the statistical table but nevertheless well worthwhile.
Added. On more comment: maybe it's also worth noting that you don't necessarily have to rank-transform the data. Instead you can also use a permutation test based on the original observations. <advertisment> This approach is implemented in the coin package for conditional inference. </advertisment> Z
url: www.econ.uiuc.edu/~roger Roger Koenker email rkoenker at uiuc.edu Department of Economics vox: 217-333-4558 University of Illinois fax: 217-244-6678 Champaign, IL 61820 On Apr 28, 2005, at 7:46 AM, Frank E Harrell Jr wrote:
Usually (but not always) doing tests of normality reflect a lack of
understanding of the power of rank tests, and an assumption of high
power for the tests (qq plots don't always help with that because of
their subjectivity). When possible it's good to choose a robust
method. Also, doing pre-testing for normality can affect the type I
error of the overall analysis.
--
Frank E Harrell Jr Professor and Chair School of
Medicine
Department of Biostatistics Vanderbilt
University
______________________________________________ R-help at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
______________________________________________ R-help at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Achim Zeileis wrote:
On Thu, 28 Apr 2005 08:52:33 -0500 roger koenker wrote:
For my money, Frank's comment should go into fortunes. It seems a rather Sisyphean battle to keep the lessons of robustness on the statistical table but nevertheless well worthwhile.
Added. On more comment: maybe it's also worth noting that you don't necessarily have to rank-transform the data. Instead you can also use a permutation test based on the original observations.
That deals with type I error but not necessarily type II error. -Frank
<advertisment> This approach is implemented in the coin package for conditional inference. </advertisment> Z
url: www.econ.uiuc.edu/~roger Roger Koenker email rkoenker at uiuc.edu Department of Economics vox: 217-333-4558 University of Illinois fax: 217-244-6678 Champaign, IL 61820 On Apr 28, 2005, at 7:46 AM, Frank E Harrell Jr wrote:
Usually (but not always) doing tests of normality reflect a lack of
understanding of the power of rank tests, and an assumption of high
power for the tests (qq plots don't always help with that because of
their subjectivity). When possible it's good to choose a robust
method. Also, doing pre-testing for normality can affect the type I
error of the overall analysis.
--
Frank E Harrell Jr Professor and Chair School of
Medicine
Department of Biostatistics Vanderbilt
University
______________________________________________ R-help at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
______________________________________________ R-help at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Frank E Harrell Jr Professor and Chair School of Medicine
Department of Biostatistics Vanderbilt University
Thanks all for your comments and hints. I will try to keep them in mind. Since a number of people asked me what I'm trying to do: I want to apply Bayesian inference to a simple ecological model I wrote, and therefore I need to fit (uniform, normal or lognormal) distributions to sets of observed data (to derive mean and sd). You probably have noticed that I'm quite new to statistics, but I'm working on that... Pieter ----- Original Message ----- From: "Achim Zeileis" <Achim.Zeileis at R-project.org> To: "roger koenker" <roger at ysidro.econ.uiuc.edu> Cc: <R-help at stat.math.ethz.ch> Sent: Thursday, April 28, 2005 4:20 PM Subject: Re: [R] normality test
On Thu, 28 Apr 2005 08:52:33 -0500 roger koenker wrote:
For my money, Frank's comment should go into fortunes. It seems a rather Sisyphean battle to keep the lessons of robustness on the statistical table but nevertheless well worthwhile.
Added. On more comment: maybe it's also worth noting that you don't necessarily have to rank-transform the data. Instead you can also use a permutation test based on the original observations. <advertisment> This approach is implemented in the coin package for conditional inference. </advertisment> Z
url: www.econ.uiuc.edu/~roger Roger Koenker email rkoenker at uiuc.edu Department of Economics vox: 217-333-4558 University of Illinois fax: 217-244-6678 Champaign, IL 61820 On Apr 28, 2005, at 7:46 AM, Frank E Harrell Jr wrote:
Usually (but not always) doing tests of normality reflect a lack of
understanding of the power of rank tests, and an assumption of high
power for the tests (qq plots don't always help with that because of
their subjectivity). When possible it's good to choose a robust
method. Also, doing pre-testing for normality can affect the type I
error of the overall analysis.
--
Frank E Harrell Jr Professor and Chair School of
Medicine
Department of Biostatistics Vanderbilt
University
______________________________________________ R-help at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
______________________________________________ R-help at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
______________________________________________ R-help at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide!
http://www.R-project.org/posting-guide.html
Below.
Usually (but not always) doing tests of normality reflect a lack of understanding of the power of rank tests, and an assumption of high power for the tests (qq plots don't always help with that because of their subjectivity). When possible it's good to choose a robust method. Also, doing pre-testing for normality can affect the type I error of the overall analysis. -- Frank E Harrell Jr
Also, qqplots or any other kind of screening for normality can affect the type I error. Indeed, one might ask what type I error means in such circumstances. :-) Indeed, one might ask what hypothesis testing means in such circumstances. Cheers, Bert
-----Original Message----- From: r-help-bounces at stat.math.ethz.ch [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Pieter Provoost Sent: Thursday, April 28, 2005 7:52 AM To: R-help at stat.math.ethz.ch Subject: Re: [R] normality test Thanks all for your comments and hints. I will try to keep them in mind. Since a number of people asked me what I'm trying to do: I want to apply Bayesian inference to a simple ecological model I wrote, and therefore I need to fit (uniform, normal or lognormal) distributions to sets of observed data (to derive mean and sd).
This is false. You do not need to fit anything to "derive mean and sd." Perhaps you have not clearly stated what you mean.
You probably have noticed that I'm quite new to statistics, but I'm working on that... Pieter
And you want to use Bayesian methods?! I would strongly recommend that you seek a competent statistician to work with. To paraphrase Frank Harrell (with appropriate apologies for misattribution, if necessary), correspondence courses in brain surgery are not a good idea. -- Bert Gunter Genentech Non-Clinical Statistics South San Francisco, CA "The business of the statistician is to catalyze the scientific learning process." - George E. P. Box
----- Original Message ----- From: "Berton Gunter" <gunter.berton at gene.com> To: "'Pieter Provoost'" <pieterprovoost at gmail.com>; <R-help at stat.math.ethz.ch> Sent: Thursday, April 28, 2005 6:26 PM Subject: RE: [R] normality test
-----Original Message----- From: r-help-bounces at stat.math.ethz.ch [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Pieter Provoost Sent: Thursday, April 28, 2005 7:52 AM To: R-help at stat.math.ethz.ch Subject: Re: [R] normality test Thanks all for your comments and hints. I will try to keep them in mind. Since a number of people asked me what I'm trying to do: I want to apply Bayesian inference to a simple ecological model I wrote, and therefore I need to fit (uniform, normal or lognormal) distributions to sets of observed data (to derive mean and sd).
This is false. You do not need to fit anything to "derive mean and sd." Perhaps you have not clearly stated what you mean.
You probably have noticed that I'm quite new to statistics, but I'm working on that... Pieter
And you want to use Bayesian methods?! I would strongly recommend that you seek a competent statistician to work with. To paraphrase Frank Harrell (with appropriate apologies for misattribution, if necessary), correspondence courses in brain surgery are not a good idea.
The Bayesian methods I (will) use are implemented in the modelling environment I'm using (FEMME). I'm supervised by the person that developed the environment, and she asked me to fit a normal or lognormal distribution to the observed data. The parameters of that distribution will then be used for the Bayesian analysis. So I suppose my supervisor knows what very well what she's doing, even though I don't (well... not yet). http://www.nioo.knaw.nl/CEMO/FEMME/Index.htm (the Bayesian inference is a recent addition and therefore not discussed in the manual) Pieter
Bert wrote:
You probably have noticed that I'm quite new to statistics, but I'm working on that...
And you want to use Bayesian methods?!
I was always under the impression that it's mostly a matter of mindset if you go Bayesian or frequentist, not of your statistical skills. [...]
"The business of the statistician is to catalyze the scientific learning process." - George E. P. Box
And I find this quote a bit disturbing because a catalyst leaves the process unchanged, yet as a statistician I might at least sometimes have learnt a bit of the subject matter problem. And no, I have nothing substantial to contribute any more tonight. Greetings Johannes
On 28-Apr-05 Pieter Provoost wrote:
Thanks all for your comments and hints. I will try to keep them in mind. Since a number of people asked me what I'm trying to do: I want to apply Bayesian inference to a simple ecological model I wrote, and therefore I need to fit (uniform, normal or lognormal) distributions to sets of observed data (to derive mean and sd). You probably have noticed that I'm quite new to statistics, but I'm working on that... Pieter
And please continue to do so! Let me try to be constructive. It is clearly established that the data you posted are far from Normally distributed. The simple qqnorm plot shows that immediately, and if you need it the shapiro.test() with "p-value = 8.499e-11" settles it! Going, however, a bit further, and looking at qqnorm(log(X)) (X being what I call your data series) suggests that it departs systematically from a pure logNormal at least at the 6 highest values of X. And again, shapiro,test(log(X)) gives p-value = 0.00965 which is again a fairly strong indication. Now, going back to your statement above, that you wrote a "simple ecological model", I would like to know more about that before proceeding further. The rather clear break in slope in qqnorm(log(X)) suggests to me the possibility that your data may represent a mixture of two distinct, possibly though not necessarily logNormal, distributions, one having a much longer upper tail than the other but being a relative small proportion (say 1/3). For example, with X denoting your data, compare qqnorm(log(X)) with set.seed(52341);Y1<-exp(rnorm(22,-3.26,0.69)); Y2<-exp(rnorm(10,-1.75,2.35)) qqnorm(log(c(Y1,Y2))) They are not dissimilar (and I have not been trying very hard). Another thing to look at is simply hist(log(X),breaks=0.5*(-12:4) This also shows some interesting features: the very high peak between -3.0 and -2.5 (and possibly an unduly high value between -3.5 and -3.0), together with a rather thin and widely spread upper tail above -2.0. This could be quite consistent with the kind of mixture described above, or could be due to observer error/bias in measurement. In any case, it is clear that there is more than a simple "(uniform, normal or lognormal)" distribution at play here. In a real investigation, I would at this stage be concerned to develop a realistic model of how the data are generated. You do not say what these data represent. Ths above was mostly written before you posted your second email, explaining that "The Bayesian methods I (will) use are implemented in the modelling environment I'm using (FEMME). I'm supervised by the person that developed the environment, and she asked me to fit a normal or lognormal distribution to the observed data. The parameters of that distribution will then be used for the Bayesian analysis. So I suppose my supervisor knows what very well what she's doing, even though I don't (well... not yet)." It may be speculated whether your supervisor has herself seriously questioned the structure of these data, since what she is asking you to do seems to presume that the above is not relevant! However, a mixture model would fit nicely into a Bayesian framework, since (from the above) I suspect a simulation or MCMC procedure will depend on the parameters to be estimated for the distribution. For the mixture (e.g. log(X) is a mixture of two normal distrbutions), you can estimate the two parameters for each normal distribution and the proportions p:(1-p) of each. Then, in sampling from the mixture you first decide on component 1 with probability p or component 2 with probability q = (1-p), then sample from the corresonding lognormal distribution. Best wishes, Ted. -------------------------------------------------------------------- E-Mail: (Ted Harding) <Ted.Harding at nessie.mcc.ac.uk> Fax-to-email: +44 (0)870 094 0861 Date: 29-Apr-05 Time: 15:41:36 ------------------------------ XFMail ------------------------------
I looked carefully at ?shapiro.test and I did not see it state anywhere what the null hypothesis is or what a low p-value means. I understand that I can run the example "shapiro.test(rnorm(100, mean = 5, sd = 3))" and deduce from its p-value of 0.0988 that the null-hypothesis must be normality, but why can't the help page explicitly state what the null hypothesis is. I also understand that the help pages are not meant to "teach" statistics, but stating the null hypothesis doesn't seem very difficult given the already considerable amount of time that probably went into creating these otherwise very good help pages. Many people who use this software took stats classes 10 or more years ago and this stuff is easily forgotten. Students frequently have trouble keeping the null and alternative hypothesis straight. Just my $0.02. Thanks, Roger
On 4/28/05, Romain Francois <francoisromain at free.fr> wrote:
Le 28.04.2005 13:16, Pieter Provoost a ??crit :
Hi, I have a small set of data on which I have tried some normality tests. When I make a histogram of the data the distribution doesn't seem to be normal at all (rather lognormal), but still no matter what test I use (Shapiro, Anderson-Darling,...) it returns a very small p value (which as far as I know means that the distribution is normal). Am I doing something wrong here? Thanks Pieter
Hello, You seem to know not far enougth. Null hypothesis in shapiro.test is **normality**, if your p-value is very small, then the data is **not** normal. Look carefully at ?shapiro.test and try again. Furthermore, normality tests are not very powerful. Consider using a ?qqnorm and ?qqline Romain -- ~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~ ~~~~~~ Romain FRANCOIS - http://addictedtor.free.fr ~~~~~~ ~~~~ Etudiant ISUP - CS3 - Industrie et Services ~~~~ ~~ http://www.isup.cicrp.jussieu.fr/ ~~ ~~~~ Stagiaire INRIA Futurs - Equipe SELECT ~~~~ ~~~~~~ http://www.inria.fr/recherche/equipes/select.fr.html ~~~~~~ ~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~
______________________________________________ R-help at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
On 29-Apr-05 roger bos wrote:
I looked carefully at ?shapiro.test and I did not see it state anywhere what the null hypothesis is or what a low p-value means. I understand that I can run the example "shapiro.test(rnorm(100, mean = 5, sd = 3))" and deduce from its p-value of 0.0988 that the null-hypothesis must be normality, but why can't the help page explicitly state what the null hypothesis is.
Hi Roger,
Well, the opening line is
Description:
Performs the Shapiro-Wilk test for normality.
which does pretty strongly suggest that the hypothesis being
tested by shapiro.test(X) is normality of the distribution of X.
It might be just a shade more unambiguous of it were worded
Performs the Shapiro-Wilk test of normality
or
Performs the Shapiro-Wilk test for non-normality.
since testing "for" something, like testing "for" contamination
tends to suggest testing for something exceptional, and testing
"for" contamination could equally be seen as a test "of" purity.
("Excuse me, sir. I just need to test your data for normality.
And you're in trouble if they are.")
But all that is on the very margin of semantic finesse!
I also understand that the help pages are not meant to "teach" statistics, but stating the null hypothesis doesn't seem very difficult given the already considerable amount of time that probably went into creating these otherwise very good help pages. Many people who use this software took stats classes 10 or more years ago and this stuff is easily forgotten. Students frequently have trouble keeping the null and alternative hypothesis straight. Just my $0.02.
I think there's a general approach in the help pages that users understand the basics of what the function is about, and it is there to specify what is necessary in order to get it to work correctly. One can take your point about stating explicitly what the null hypothesis of a test is, that it would be useful for people who are not sure about that sort of thing, and would advance their statistical understanding at the same time as their proficiency in R. However, while this might be feasible for simple matters like the null hypothesis being tested by a simple function like shapiro.test or t.test (which, by the way, does not even hint at what the null hypothesis might be: you have to infer it from the options available for the alternative hypothesis), it could get out of hand for tests applicable to more complex situations like ANOVA, mixed models, and so on. There is a dangert, if the hypotheis were to be spelled out, that the help page might become a small (or not so small) book on that aspect of statistics. A better place for such things is in documents like "Introductory Statistics with R" and so on. Best wishes, Ted. -------------------------------------------------------------------- E-Mail: (Ted Harding) <Ted.Harding at nessie.mcc.ac.uk> Fax-to-email: +44 (0)870 094 0861 Date: 29-Apr-05 Time: 17:54:19 ------------------------------ XFMail ------------------------------