shapiro.test

An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20140221/55851b0a/attachment.pl>
Hello,

Not answering directly to your question, if the sample size is a 
documented problem with shapiro.test and you want a normality test, why 
don't you use ?ks.test?

m <- mean(HP_TrinityK25$V2)
s <- sd(HP_TrinityK25$V2)

ks.test(HP_TrinityK25$V2, "pnorm", m, s)

Hope this helps,

Rui Barradas

Em 21-02-2014 15:59, Gonzalo Villarino Pizarro escreveu:
Dear R users,
Please help with with this maybe basic question. I am trying to see if my
data is normal but is a large file and the test does not work.
I keep getting the message : "Error in shapiro.test(x = HP_TrinityK25$V2)
:  sample size must be between 3 and 5000"
thanks!

  shapiro.test(x=HP_TrinityK25$V2)
Error in shapiro.test(x = HP_TrinityK25$V2) : sample size must be between 3
and 5000

##Note:
HP_TrinityK25= my file
HP_TrinityK25$V2= data in my file

	[[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Rui,

Note this quote from the last paragraph of the Details section of ?ks.test:

"If a single-sample test is used, the parameters specified in '...'
     must be pre-specified and not estimated from the data."

Which is the exact opposite of your example.

Gonzalo,

Why are you testing your data for normality?  For large sample sizes
the normality tests often give a meaningful answer to a meaningless
question (for small samples they give a meaningless answer to a
meaningful question).

If you really feel the need for a p-value then
SnowsPenultimateNormalityTest in the TeachingDemos package will work
for large sample sizes.  But note that the documentation for that
function is considered more useful than the function itself.
Hello,

Not answering directly to your question, if the sample size is a documented
problem with shapiro.test and you want a normality test, why don't you use
?ks.test?

m <- mean(HP_TrinityK25$V2)
s <- sd(HP_TrinityK25$V2)

ks.test(HP_TrinityK25$V2, "pnorm", m, s)

Hope this helps,

Rui Barradas

Em 21-02-2014 15:59, Gonzalo Villarino Pizarro escreveu:

Dear R users,
Please help with with this maybe basic question. I am trying to see if my
data is normal but is a large file and the test does not work.
I keep getting the message : "Error in shapiro.test(x = HP_TrinityK25$V2)
:  sample size must be between 3 and 5000"
thanks!

  shapiro.test(x=HP_TrinityK25$V2)
Error in shapiro.test(x = HP_TrinityK25$V2) : sample size must be between
3
and 5000

##Note:
HP_TrinityK25= my file
HP_TrinityK25$V2= data in my file

        [[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Gregory (Greg) L. Snow Ph.D.
538280 at gmail.com
Hello,

Not answering directly to your question, if the sample size is a
documented problem with shapiro.test and you want a normality test, why
don't you use ?ks.test?

m <- mean(HP_TrinityK25$V2)
s <- sd(HP_TrinityK25$V2)

ks.test(HP_TrinityK25$V2, "pnorm", m, s)
Strictly speaking this is not a valid test.  The KS test is used for 
testing against a *completely specified* distribution.  If there are 
parameters to be estimated, the null distribution is no longer 
applicable.  This may not be a "real" problem if the parameters are 
*well* estimated, as they would be in this instance (given that the 
sample size is over-large).  I'm not sure about this.

The "Lilliefors" test is theoretically available in this context when
mu and sigma are estimated, but according to the Wikipedia article, the 
Lilliefors distribution is not known analytically and the critical 
values must be determined by Monte Carlo methods.  There is a 
"LillieTest" function in the "DescTools" package which makes use of some 
approximations to get p-values.

However I think that a better approach would be to use a chi-squared 
goodness of fit test whereby you can adjust for estimated parameters 
simply by reducing the degrees of freedom.  I believe that the 
chi-squared test is somewhat low in power, but with a very large sample 
this should not be a problem.

The difficulty with the chi-squared test is that the choice of "bins" is 
somewhat arbitrary.  I believe the best approach is to take the bin 
boundaries to be the quantiles of the normal distribution (with 
parameters "m" and "s") corresponding to equispaced probabilities on 
[0,1], with the number of such probabilities being k+1 where
k = floor(n/5), n being the sample size.  This makes the expected counts 
all equal to n/k >= 5 so that the chi-squared test is "valid".  The 
degrees of freedom are then k-3 (k - 1 - #estimated parameters).

One last comment:  I believe that it is generally considered that 
testing for normality is a waste of time and a pseudo-intellectual 
exercise of academic interest at best.

cheers,

Rolf Turner

Hope this helps,

Rui Barradas

Em 21-02-2014 15:59, Gonzalo Villarino Pizarro escreveu:
Dear R users,
Please help with with this maybe basic question. I am trying to see if my
data is normal but is a large file and the test does not work.
I keep getting the message : "Error in shapiro.test(x = HP_TrinityK25$V2)
:  sample size must be between 3 and 5000"
thanks!

  shapiro.test(x=HP_TrinityK25$V2)
Error in shapiro.test(x = HP_TrinityK25$V2) : sample size must be
between 3
and 5000

##Note:
HP_TrinityK25= my file
HP_TrinityK25$V2= data in my file

    [[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
<SNIP>
Why are you testing your data for normality?  For large sample sizes
the normality tests often give a meaningful answer to a meaningless
question (for small samples they give a meaningless answer to a
meaningful question).
<SNIP>

Fortune!!!

cheers,

Rolf
Second!!

-- Bert

Bert Gunter
Genentech Nonclinical Biostatistics
(650) 467-7374

"Data is not information. Information is not knowledge. And knowledge
is certainly not wisdom."
H. Gilbert Welch
On 22/02/14 11:53, Greg Snow wrote:

<SNIP>

Why are you testing your data for normality?  For large sample sizes
the normality tests often give a meaningful answer to a meaningless
question (for small samples they give a meaningless answer to a
meaningful question).

<SNIP>

Fortune!!!

cheers,

Rolf

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Greg,

I really like that TeachingDemos::SnowsPenultimateNormalityTest()? even the tortuous way to always return a p-value == 0:

# the following function works for current implementations of R
# to my knowledge, eventually it may need to be expanded
is.rational <- function(x){
    rep( TRUE, length(x) )
}

tmp.p <- if( any(is.rational(x))) {
     0
} else {
     # current implementation will not get here if length
     # of x is positive.  This part is reserved for the
     # ultimate test
     1
}

(p.value is then returned as tmp.p). Also, the nice and sexy printing of that p-value in R as:

p-value < 2.2e-16

which looks much more serious than 'p-value = 0'? Here you has nothing to do. The stats::format.pval() function called from stats:::print.htest() already does the job for you!

I am just curious? Are there teachers out there pointing to that test? If yes, what fraction of the students realise what happens? I guess, it is closer to zero than to one, unfortunately. Wait? I need another SnowsPenultimateXxxxTest() here to check the null hypothesis that all my students are doing what they are supposed to do when discovering a new statistical tool!

Best,

Philippe Grosjean

Rui,

Note this quote from the last paragraph of the Details section of ?ks.test:

"If a single-sample test is used, the parameters specified in '...'
    must be pre-specified and not estimated from the data."

Which is the exact opposite of your example.

Gonzalo,

Why are you testing your data for normality?  For large sample sizes
the normality tests often give a meaningful answer to a meaningless
question (for small samples they give a meaningless answer to a
meaningful question).

If you really feel the need for a p-value then
SnowsPenultimateNormalityTest in the TeachingDemos package will work
for large sample sizes.  But note that the documentation for that
function is considered more useful than the function itself.

On Fri, Feb 21, 2014 at 3:04 PM, Rui Barradas <ruipbarradas at sapo.pt> wrote:
Hello,

Not answering directly to your question, if the sample size is a documented
problem with shapiro.test and you want a normality test, why don't you use
?ks.test?

m <- mean(HP_TrinityK25$V2)
s <- sd(HP_TrinityK25$V2)

ks.test(HP_TrinityK25$V2, "pnorm", m, s)

Hope this helps,

Rui Barradas

Em 21-02-2014 15:59, Gonzalo Villarino Pizarro escreveu:

Dear R users,
Please help with with this maybe basic question. I am trying to see if my
data is normal but is a large file and the test does not work.
I keep getting the message : "Error in shapiro.test(x = HP_TrinityK25$V2)
:  sample size must be between 3 and 5000"
thanks!

 shapiro.test(x=HP_TrinityK25$V2)
Error in shapiro.test(x = HP_TrinityK25$V2) : sample size must be between
3
and 5000

##Note:
HP_TrinityK25= my file
HP_TrinityK25$V2= data in my file

       [[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

-- 
Gregory (Greg) L. Snow Ph.D.
538280 at gmail.com

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Hello,

Inline

Em 21-02-2014 23:13, Rolf Turner escreveu:
On 22/02/14 11:04, Rui Barradas wrote:
Hello,

Not answering directly to your question, if the sample size is a
documented problem with shapiro.test and you want a normality test, why
don't you use ?ks.test?

m <- mean(HP_TrinityK25$V2)
s <- sd(HP_TrinityK25$V2)

ks.test(HP_TrinityK25$V2, "pnorm", m, s)
Strictly speaking this is not a valid test.  The KS test is used for
testing against a *completely specified* distribution.  If there are
parameters to be estimated, the null distribution is no longer
applicable.  This may not be a "real" problem if the parameters are
*well* estimated, as they would be in this instance (given that the
sample size is over-large).  I'm not sure about this.
Yes, you're right. I hesitated before posting my answer precisely 
because of this, the parameters must be pre-determined constants, not 
computed from the data. Like Greg pointed out in his reply, the help 
page for ?ks.test also explicitly refers to it (which I had missed).

The chi-squared gof test seems to be a good choice, given the sample size.

Rui Barradas
The "Lilliefors" test is theoretically available in this context when
mu and sigma are estimated, but according to the Wikipedia article, the
Lilliefors distribution is not known analytically and the critical
values must be determined by Monte Carlo methods.  There is a
"LillieTest" function in the "DescTools" package which makes use of some
approximations to get p-values.

However I think that a better approach would be to use a chi-squared
goodness of fit test whereby you can adjust for estimated parameters
simply by reducing the degrees of freedom.  I believe that the
chi-squared test is somewhat low in power, but with a very large sample
this should not be a problem.

The difficulty with the chi-squared test is that the choice of "bins" is
somewhat arbitrary.  I believe the best approach is to take the bin
boundaries to be the quantiles of the normal distribution (with
parameters "m" and "s") corresponding to equispaced probabilities on
[0,1], with the number of such probabilities being k+1 where
k = floor(n/5), n being the sample size.  This makes the expected counts
all equal to n/k >= 5 so that the chi-squared test is "valid".  The
degrees of freedom are then k-3 (k - 1 - #estimated parameters).

One last comment:  I believe that it is generally considered that
testing for normality is a waste of time and a pseudo-intellectual
exercise of academic interest at best.

cheers,

Rolf Turner

Hope this helps,

Rui Barradas

Em 21-02-2014 15:59, Gonzalo Villarino Pizarro escreveu:
Dear R users,
Please help with with this maybe basic question. I am trying to see
if my
data is normal but is a large file and the test does not work.
I keep getting the message : "Error in shapiro.test(x =
HP_TrinityK25$V2)
:  sample size must be between 3 and 5000"
thanks!

  shapiro.test(x=HP_TrinityK25$V2)
Error in shapiro.test(x = HP_TrinityK25$V2) : sample size must be
between 3
and 5000

##Note:
HP_TrinityK25= my file
HP_TrinityK25$V2= data in my file

    [[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Second.

Rui Barradas

Em 21-02-2014 23:44, Rolf Turner escreveu:
On 22/02/14 11:53, Greg Snow wrote:

<SNIP>

Why are you testing your data for normality?  For large sample sizes
the normality tests often give a meaningful answer to a meaningless
question (for small samples they give a meaningless answer to a
meaningful question).
<SNIP>

Fortune!!!

cheers,

Rolf

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Philippe,

replies inline

On Sat, Feb 22, 2014 at 12:29 AM, Philippe Grosjean
Greg,

I really like that TeachingDemos::SnowsPenultimateNormalityTest()...
If you like that function then you may appreciate
TeachingDemos::SnowsCorrectlySizedButOtherwiseUselessTestOfAnything,
which I suspect (but have been to lazy to check) may be the longest
exported function name in a CRAN package.  I justify the names of
these 2 functions using the same logic that suggests short and simple
names for functions that you would expect to be used often.
even the tortuous way to always return a p-value == 0:

It turns out (discovered by accident and then brought to my attention)
that if you run SnowsPenultimateNormalityTest on a vector of length 0
then it does return a p-value of 1.  I have not yet decided if this is
a bug or a feature.  On one hand it makes sense that a sample of size
0 is perfectly consistent with the assumption that you chose 0
observations from a normal distribution, on the other hand, if it is
an integer or double vector of length 0 that would still be
information that the numbers (or lack thereof) are rational.

[snip]
I am just curious... Are there teachers out there pointing to that test? If yes, what fraction of the students realise what happens? I guess, it is closer to zero than to one, unfortunately. Wait... I need another SnowsPenultimateXxxxTest() here to check the null hypothesis that all my students are doing what they are supposed to do when discovering a new statistical tool!
I don't know of any teachers pointing to the test, I would want to be
careful which class to bring it up in.  For some students it could
result in an epiphany, others may just blindly use it, and still
others may have their heads explode if they have to think to hard
about it.

I was originally considering naming the test SnowsAntepenultimeateTest
to give a little more room for follow-up tests, but at the time I
could not remember if it was Ante (before) or Anti (opposite).  I
learned the word Antepenultimate in terms of pages in a book, where
the 3rd to last page (the Antepenultimate page) is directly opposite
(Anti-) the Penultimate page.

Just in case that is not confusing enough, the ultimate page of a
cheap detective novel is the last page where the hero realizes that
since the motive for the murder was to cover up the murderer's
embezzlement of the family fortune to pay off his bookie, the hero
will not be paid after all and will still need to continue avoiding
his loan shark.  The penultimate page is the second to last page where
in response to the hero's listing of circumstantial evidence the
murderer conveniently confesses and fills in all the missing details
saving the embarrassment to the hero if he had just lawyer-ed up and
been acquitted due to lack of hard evidence.  And the antepenultimate
page is the 3rd to last where the hero utters the cliche phrase "You
are probably wondering why I gathered you all here".  I don't know
what the 4th to last page would be called (could add another ante-, or
in R just use tail(book,4)).
Gregory (Greg) L. Snow Ph.D.
538280 at gmail.com
Greg,

For some authors the 4th page from the back should be the first page.

Not so for you, however.

Clint

Clint Bowman			INTERNET:	clint at ecy.wa.gov
Air Quality Modeler		INTERNET:	clint at math.utah.edu
Department of Ecology		VOICE:		(360) 407-6815
PO Box 47600			FAX:		(360) 407-7534
Olympia, WA 98504-7600

         USPS:           PO Box 47600, Olympia, WA 98504-7600
         Parcels:        300 Desmond Drive, Lacey, WA 98503-1274

Philippe,

replies inline

On Sat, Feb 22, 2014 at 12:29 AM, Philippe Grosjean
<phgrosjean at sciviews.org> wrote:
Greg,

I really like that TeachingDemos::SnowsPenultimateNormalityTest()...
If you like that function then you may appreciate
TeachingDemos::SnowsCorrectlySizedButOtherwiseUselessTestOfAnything,
which I suspect (but have been to lazy to check) may be the longest
exported function name in a CRAN package.  I justify the names of
these 2 functions using the same logic that suggests short and simple
names for functions that you would expect to be used often.

even the tortuous way to always return a p-value == 0:

It turns out (discovered by accident and then brought to my attention)
that if you run SnowsPenultimateNormalityTest on a vector of length 0
then it does return a p-value of 1.  I have not yet decided if this is
a bug or a feature.  On one hand it makes sense that a sample of size
0 is perfectly consistent with the assumption that you chose 0
observations from a normal distribution, on the other hand, if it is
an integer or double vector of length 0 that would still be
information that the numbers (or lack thereof) are rational.

[snip]

I am just curious... Are there teachers out there pointing to that test? If yes, what fraction of the students realise what happens? I guess, it is closer to zero than to one, unfortunately. Wait... I need another SnowsPenultimateXxxxTest() here to check the null hypothesis that all my students are doing what they are supposed to do when discovering a new statistical tool!
I don't know of any teachers pointing to the test, I would want to be
careful which class to bring it up in.  For some students it could
result in an epiphany, others may just blindly use it, and still
others may have their heads explode if they have to think to hard
about it.

I was originally considering naming the test SnowsAntepenultimeateTest
to give a little more room for follow-up tests, but at the time I
could not remember if it was Ante (before) or Anti (opposite).  I
learned the word Antepenultimate in terms of pages in a book, where
the 3rd to last page (the Antepenultimate page) is directly opposite
(Anti-) the Penultimate page.

Just in case that is not confusing enough, the ultimate page of a
cheap detective novel is the last page where the hero realizes that
since the motive for the murder was to cover up the murderer's
embezzlement of the family fortune to pay off his bookie, the hero
will not be paid after all and will still need to continue avoiding
his loan shark.  The penultimate page is the second to last page where
in response to the hero's listing of circumstantial evidence the
murderer conveniently confesses and fills in all the missing details
saving the embarrassment to the hero if he had just lawyer-ed up and
been acquitted due to lack of hard evidence.  And the antepenultimate
page is the 3rd to last where the hero utters the cliche phrase "You
are probably wondering why I gathered you all here".  I don't know
what the 4th to last page would be called (could add another ante-, or
in R just use tail(book,4)).

-- 
Gregory (Greg) L. Snow Ph.D.
538280 at gmail.com

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.