An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20140221/55851b0a/attachment.pl>
shapiro.test
11 messages · Gonzalo Villarino Pizarro, Rui Barradas, Rolf Turner +4 more
Hello, Not answering directly to your question, if the sample size is a documented problem with shapiro.test and you want a normality test, why don't you use ?ks.test? m <- mean(HP_TrinityK25$V2) s <- sd(HP_TrinityK25$V2) ks.test(HP_TrinityK25$V2, "pnorm", m, s) Hope this helps, Rui Barradas Em 21-02-2014 15:59, Gonzalo Villarino Pizarro escreveu:
Dear R users, Please help with with this maybe basic question. I am trying to see if my data is normal but is a large file and the test does not work. I keep getting the message : "Error in shapiro.test(x = HP_TrinityK25$V2) : sample size must be between 3 and 5000" thanks! shapiro.test(x=HP_TrinityK25$V2) Error in shapiro.test(x = HP_TrinityK25$V2) : sample size must be between 3 and 5000 ##Note: HP_TrinityK25= my file HP_TrinityK25$V2= data in my file [[alternative HTML version deleted]]
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Rui,
Note this quote from the last paragraph of the Details section of ?ks.test:
"If a single-sample test is used, the parameters specified in '...'
must be pre-specified and not estimated from the data."
Which is the exact opposite of your example.
Gonzalo,
Why are you testing your data for normality? For large sample sizes
the normality tests often give a meaningful answer to a meaningless
question (for small samples they give a meaningless answer to a
meaningful question).
If you really feel the need for a p-value then
SnowsPenultimateNormalityTest in the TeachingDemos package will work
for large sample sizes. But note that the documentation for that
function is considered more useful than the function itself.
On Fri, Feb 21, 2014 at 3:04 PM, Rui Barradas <ruipbarradas at sapo.pt> wrote:
Hello, Not answering directly to your question, if the sample size is a documented problem with shapiro.test and you want a normality test, why don't you use ?ks.test? m <- mean(HP_TrinityK25$V2) s <- sd(HP_TrinityK25$V2) ks.test(HP_TrinityK25$V2, "pnorm", m, s) Hope this helps, Rui Barradas Em 21-02-2014 15:59, Gonzalo Villarino Pizarro escreveu:
Dear R users,
Please help with with this maybe basic question. I am trying to see if my
data is normal but is a large file and the test does not work.
I keep getting the message : "Error in shapiro.test(x = HP_TrinityK25$V2)
: sample size must be between 3 and 5000"
thanks!
shapiro.test(x=HP_TrinityK25$V2)
Error in shapiro.test(x = HP_TrinityK25$V2) : sample size must be between
3
and 5000
##Note:
HP_TrinityK25= my file
HP_TrinityK25$V2= data in my file
[[alternative HTML version deleted]]
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Gregory (Greg) L. Snow Ph.D. 538280 at gmail.com
On 22/02/14 11:04, Rui Barradas wrote:
Hello, Not answering directly to your question, if the sample size is a documented problem with shapiro.test and you want a normality test, why don't you use ?ks.test? m <- mean(HP_TrinityK25$V2) s <- sd(HP_TrinityK25$V2) ks.test(HP_TrinityK25$V2, "pnorm", m, s)
Strictly speaking this is not a valid test. The KS test is used for testing against a *completely specified* distribution. If there are parameters to be estimated, the null distribution is no longer applicable. This may not be a "real" problem if the parameters are *well* estimated, as they would be in this instance (given that the sample size is over-large). I'm not sure about this. The "Lilliefors" test is theoretically available in this context when mu and sigma are estimated, but according to the Wikipedia article, the Lilliefors distribution is not known analytically and the critical values must be determined by Monte Carlo methods. There is a "LillieTest" function in the "DescTools" package which makes use of some approximations to get p-values. However I think that a better approach would be to use a chi-squared goodness of fit test whereby you can adjust for estimated parameters simply by reducing the degrees of freedom. I believe that the chi-squared test is somewhat low in power, but with a very large sample this should not be a problem. The difficulty with the chi-squared test is that the choice of "bins" is somewhat arbitrary. I believe the best approach is to take the bin boundaries to be the quantiles of the normal distribution (with parameters "m" and "s") corresponding to equispaced probabilities on [0,1], with the number of such probabilities being k+1 where k = floor(n/5), n being the sample size. This makes the expected counts all equal to n/k >= 5 so that the chi-squared test is "valid". The degrees of freedom are then k-3 (k - 1 - #estimated parameters). One last comment: I believe that it is generally considered that testing for normality is a waste of time and a pseudo-intellectual exercise of academic interest at best. cheers, Rolf Turner
Hope this helps, Rui Barradas Em 21-02-2014 15:59, Gonzalo Villarino Pizarro escreveu:
Dear R users,
Please help with with this maybe basic question. I am trying to see if my
data is normal but is a large file and the test does not work.
I keep getting the message : "Error in shapiro.test(x = HP_TrinityK25$V2)
: sample size must be between 3 and 5000"
thanks!
shapiro.test(x=HP_TrinityK25$V2)
Error in shapiro.test(x = HP_TrinityK25$V2) : sample size must be
between 3
and 5000
##Note:
HP_TrinityK25= my file
HP_TrinityK25$V2= data in my file
[[alternative HTML version deleted]]
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
On 22/02/14 11:53, Greg Snow wrote:
<SNIP>
Why are you testing your data for normality? For large sample sizes the normality tests often give a meaningful answer to a meaningless question (for small samples they give a meaningless answer to a meaningful question).
<SNIP> Fortune!!! cheers, Rolf
Second!! -- Bert Bert Gunter Genentech Nonclinical Biostatistics (650) 467-7374 "Data is not information. Information is not knowledge. And knowledge is certainly not wisdom." H. Gilbert Welch
On Fri, Feb 21, 2014 at 3:44 PM, Rolf Turner <r.turner at auckland.ac.nz> wrote:
On 22/02/14 11:53, Greg Snow wrote: <SNIP>
Why are you testing your data for normality? For large sample sizes the normality tests often give a meaningful answer to a meaningless question (for small samples they give a meaningless answer to a meaningful question).
<SNIP> Fortune!!! cheers, Rolf
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Greg,
I really like that TeachingDemos::SnowsPenultimateNormalityTest()? even the tortuous way to always return a p-value == 0:
# the following function works for current implementations of R
# to my knowledge, eventually it may need to be expanded
is.rational <- function(x){
rep( TRUE, length(x) )
}
tmp.p <- if( any(is.rational(x))) {
0
} else {
# current implementation will not get here if length
# of x is positive. This part is reserved for the
# ultimate test
1
}
(p.value is then returned as tmp.p). Also, the nice and sexy printing of that p-value in R as:
p-value < 2.2e-16
which looks much more serious than 'p-value = 0'? Here you has nothing to do. The stats::format.pval() function called from stats:::print.htest() already does the job for you!
I am just curious? Are there teachers out there pointing to that test? If yes, what fraction of the students realise what happens? I guess, it is closer to zero than to one, unfortunately. Wait? I need another SnowsPenultimateXxxxTest() here to check the null hypothesis that all my students are doing what they are supposed to do when discovering a new statistical tool!
Best,
Philippe Grosjean
On 21 Feb 2014, at 23:53, Greg Snow <538280 at gmail.com> wrote:
Rui,
Note this quote from the last paragraph of the Details section of ?ks.test:
"If a single-sample test is used, the parameters specified in '...'
must be pre-specified and not estimated from the data."
Which is the exact opposite of your example.
Gonzalo,
Why are you testing your data for normality? For large sample sizes
the normality tests often give a meaningful answer to a meaningless
question (for small samples they give a meaningless answer to a
meaningful question).
If you really feel the need for a p-value then
SnowsPenultimateNormalityTest in the TeachingDemos package will work
for large sample sizes. But note that the documentation for that
function is considered more useful than the function itself.
On Fri, Feb 21, 2014 at 3:04 PM, Rui Barradas <ruipbarradas at sapo.pt> wrote:
Hello, Not answering directly to your question, if the sample size is a documented problem with shapiro.test and you want a normality test, why don't you use ?ks.test? m <- mean(HP_TrinityK25$V2) s <- sd(HP_TrinityK25$V2) ks.test(HP_TrinityK25$V2, "pnorm", m, s) Hope this helps, Rui Barradas Em 21-02-2014 15:59, Gonzalo Villarino Pizarro escreveu:
Dear R users,
Please help with with this maybe basic question. I am trying to see if my
data is normal but is a large file and the test does not work.
I keep getting the message : "Error in shapiro.test(x = HP_TrinityK25$V2)
: sample size must be between 3 and 5000"
thanks!
shapiro.test(x=HP_TrinityK25$V2)
Error in shapiro.test(x = HP_TrinityK25$V2) : sample size must be between
3
and 5000
##Note:
HP_TrinityK25= my file
HP_TrinityK25$V2= data in my file
[[alternative HTML version deleted]]
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
-- Gregory (Greg) L. Snow Ph.D. 538280 at gmail.com
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Hello, Inline Em 21-02-2014 23:13, Rolf Turner escreveu:
On 22/02/14 11:04, Rui Barradas wrote:
Hello, Not answering directly to your question, if the sample size is a documented problem with shapiro.test and you want a normality test, why don't you use ?ks.test? m <- mean(HP_TrinityK25$V2) s <- sd(HP_TrinityK25$V2) ks.test(HP_TrinityK25$V2, "pnorm", m, s)
Strictly speaking this is not a valid test. The KS test is used for testing against a *completely specified* distribution. If there are parameters to be estimated, the null distribution is no longer applicable. This may not be a "real" problem if the parameters are *well* estimated, as they would be in this instance (given that the sample size is over-large). I'm not sure about this.
Yes, you're right. I hesitated before posting my answer precisely because of this, the parameters must be pre-determined constants, not computed from the data. Like Greg pointed out in his reply, the help page for ?ks.test also explicitly refers to it (which I had missed). The chi-squared gof test seems to be a good choice, given the sample size. Rui Barradas
The "Lilliefors" test is theoretically available in this context when mu and sigma are estimated, but according to the Wikipedia article, the Lilliefors distribution is not known analytically and the critical values must be determined by Monte Carlo methods. There is a "LillieTest" function in the "DescTools" package which makes use of some approximations to get p-values. However I think that a better approach would be to use a chi-squared goodness of fit test whereby you can adjust for estimated parameters simply by reducing the degrees of freedom. I believe that the chi-squared test is somewhat low in power, but with a very large sample this should not be a problem. The difficulty with the chi-squared test is that the choice of "bins" is somewhat arbitrary. I believe the best approach is to take the bin boundaries to be the quantiles of the normal distribution (with parameters "m" and "s") corresponding to equispaced probabilities on [0,1], with the number of such probabilities being k+1 where k = floor(n/5), n being the sample size. This makes the expected counts all equal to n/k >= 5 so that the chi-squared test is "valid". The degrees of freedom are then k-3 (k - 1 - #estimated parameters). One last comment: I believe that it is generally considered that testing for normality is a waste of time and a pseudo-intellectual exercise of academic interest at best. cheers, Rolf Turner
Hope this helps, Rui Barradas Em 21-02-2014 15:59, Gonzalo Villarino Pizarro escreveu:
Dear R users,
Please help with with this maybe basic question. I am trying to see
if my
data is normal but is a large file and the test does not work.
I keep getting the message : "Error in shapiro.test(x =
HP_TrinityK25$V2)
: sample size must be between 3 and 5000"
thanks!
shapiro.test(x=HP_TrinityK25$V2)
Error in shapiro.test(x = HP_TrinityK25$V2) : sample size must be
between 3
and 5000
##Note:
HP_TrinityK25= my file
HP_TrinityK25$V2= data in my file
[[alternative HTML version deleted]]
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Second. Rui Barradas Em 21-02-2014 23:44, Rolf Turner escreveu:
On 22/02/14 11:53, Greg Snow wrote: <SNIP>
Why are you testing your data for normality? For large sample sizes the normality tests often give a meaningful answer to a meaningless question (for small samples they give a meaningless answer to a meaningful question).
<SNIP> Fortune!!! cheers, Rolf
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
2 days later
Philippe, replies inline On Sat, Feb 22, 2014 at 12:29 AM, Philippe Grosjean
<phgrosjean at sciviews.org> wrote:
Greg, I really like that TeachingDemos::SnowsPenultimateNormalityTest()...
If you like that function then you may appreciate TeachingDemos::SnowsCorrectlySizedButOtherwiseUselessTestOfAnything, which I suspect (but have been to lazy to check) may be the longest exported function name in a CRAN package. I justify the names of these 2 functions using the same logic that suggests short and simple names for functions that you would expect to be used often.
even the tortuous way to always return a p-value == 0:
It turns out (discovered by accident and then brought to my attention) that if you run SnowsPenultimateNormalityTest on a vector of length 0 then it does return a p-value of 1. I have not yet decided if this is a bug or a feature. On one hand it makes sense that a sample of size 0 is perfectly consistent with the assumption that you chose 0 observations from a normal distribution, on the other hand, if it is an integer or double vector of length 0 that would still be information that the numbers (or lack thereof) are rational. [snip]
I am just curious... Are there teachers out there pointing to that test? If yes, what fraction of the students realise what happens? I guess, it is closer to zero than to one, unfortunately. Wait... I need another SnowsPenultimateXxxxTest() here to check the null hypothesis that all my students are doing what they are supposed to do when discovering a new statistical tool!
I don't know of any teachers pointing to the test, I would want to be careful which class to bring it up in. For some students it could result in an epiphany, others may just blindly use it, and still others may have their heads explode if they have to think to hard about it. I was originally considering naming the test SnowsAntepenultimeateTest to give a little more room for follow-up tests, but at the time I could not remember if it was Ante (before) or Anti (opposite). I learned the word Antepenultimate in terms of pages in a book, where the 3rd to last page (the Antepenultimate page) is directly opposite (Anti-) the Penultimate page. Just in case that is not confusing enough, the ultimate page of a cheap detective novel is the last page where the hero realizes that since the motive for the murder was to cover up the murderer's embezzlement of the family fortune to pay off his bookie, the hero will not be paid after all and will still need to continue avoiding his loan shark. The penultimate page is the second to last page where in response to the hero's listing of circumstantial evidence the murderer conveniently confesses and fills in all the missing details saving the embarrassment to the hero if he had just lawyer-ed up and been acquitted due to lack of hard evidence. And the antepenultimate page is the 3rd to last where the hero utters the cliche phrase "You are probably wondering why I gathered you all here". I don't know what the 4th to last page would be called (could add another ante-, or in R just use tail(book,4)).
Gregory (Greg) L. Snow Ph.D. 538280 at gmail.com
Greg,
For some authors the 4th page from the back should be the first page.
Not so for you, however.
Clint
Clint Bowman INTERNET: clint at ecy.wa.gov
Air Quality Modeler INTERNET: clint at math.utah.edu
Department of Ecology VOICE: (360) 407-6815
PO Box 47600 FAX: (360) 407-7534
Olympia, WA 98504-7600
USPS: PO Box 47600, Olympia, WA 98504-7600
Parcels: 300 Desmond Drive, Lacey, WA 98503-1274
On Mon, 24 Feb 2014, Greg Snow wrote:
Philippe, replies inline On Sat, Feb 22, 2014 at 12:29 AM, Philippe Grosjean <phgrosjean at sciviews.org> wrote:
Greg, I really like that TeachingDemos::SnowsPenultimateNormalityTest()...
If you like that function then you may appreciate TeachingDemos::SnowsCorrectlySizedButOtherwiseUselessTestOfAnything, which I suspect (but have been to lazy to check) may be the longest exported function name in a CRAN package. I justify the names of these 2 functions using the same logic that suggests short and simple names for functions that you would expect to be used often.
even the tortuous way to always return a p-value == 0:
It turns out (discovered by accident and then brought to my attention) that if you run SnowsPenultimateNormalityTest on a vector of length 0 then it does return a p-value of 1. I have not yet decided if this is a bug or a feature. On one hand it makes sense that a sample of size 0 is perfectly consistent with the assumption that you chose 0 observations from a normal distribution, on the other hand, if it is an integer or double vector of length 0 that would still be information that the numbers (or lack thereof) are rational. [snip]
I am just curious... Are there teachers out there pointing to that test? If yes, what fraction of the students realise what happens? I guess, it is closer to zero than to one, unfortunately. Wait... I need another SnowsPenultimateXxxxTest() here to check the null hypothesis that all my students are doing what they are supposed to do when discovering a new statistical tool!
I don't know of any teachers pointing to the test, I would want to be careful which class to bring it up in. For some students it could result in an epiphany, others may just blindly use it, and still others may have their heads explode if they have to think to hard about it. I was originally considering naming the test SnowsAntepenultimeateTest to give a little more room for follow-up tests, but at the time I could not remember if it was Ante (before) or Anti (opposite). I learned the word Antepenultimate in terms of pages in a book, where the 3rd to last page (the Antepenultimate page) is directly opposite (Anti-) the Penultimate page. Just in case that is not confusing enough, the ultimate page of a cheap detective novel is the last page where the hero realizes that since the motive for the murder was to cover up the murderer's embezzlement of the family fortune to pay off his bookie, the hero will not be paid after all and will still need to continue avoiding his loan shark. The penultimate page is the second to last page where in response to the hero's listing of circumstantial evidence the murderer conveniently confesses and fills in all the missing details saving the embarrassment to the hero if he had just lawyer-ed up and been acquitted due to lack of hard evidence. And the antepenultimate page is the 3rd to last where the hero utters the cliche phrase "You are probably wondering why I gathered you all here". I don't know what the 4th to last page would be called (could add another ante-, or in R just use tail(book,4)). -- Gregory (Greg) L. Snow Ph.D. 538280 at gmail.com
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.