I've been attempting to understand the one-sample run test for
randomness. I've found run.test{tseries} and run.test{lawstat}. Both
use a large sample approximation for distribution of the total number
of runs in a sample of n1 observations of one type and n2 observations
of another type.
I've been unable to find R code to generate the exact distribution and
would like to see how this could be done (not homework).
For example, given the data:
dtemp <- c(12, 13, 12, 11, 5, 2, -1, 2, -1, 3, 2, -6, -7, -7, -12, -9,
6, 7, 10, 6, 1, 1, 3, 7, -2, -6, -6, -5, -2, -1)
The Monte Carlo permutation approach seems to get me part way.
# calculate the number of runs in the data vector
nruns <- function(x) {
signs <- sign(x)
runs <- rle(signs)
r <- length(runs$lengths)
return(r)
}
MC.runs <- function(x, nperm) {
RUNS <- numeric(nperm)
for (i in 1:nperm) {
RUNS[i] <- nruns(sample(x))
}
cdf <- cumsum(table(RUNS))/nperm
return(list(RUNS=RUNS, cdf=cdf, nperm=nperm))
}
MC.runs(dtemp, 100000)
Thanks. --Dale
Code find exact distribution for runs test?
8 messages · Greg Snow, Dale Steele, Peter Ehlers
I am not an expert in this area, but here are some thoughts that may get you started towards an answer. First, there are 2 ways (possibly more) that can lead to the data for a runs test that lead to different theoretical distributions: 1. We have a true or hypothesized value of the median that we subtracted from the data, therefore each value has 50% probability of being positive/negative and all are independent of each other (assuming being exactly equal to the median is impossible or discarded). 2. We have subtracted the sample median from each sample value (and discarded any 0's) leaving us with exactly half positive and half negative and not having independence. In case 1, the 1st observation will always start a run. The second observation has a 50% chance of being the same sign (F) or different sign (S), with the probability being 0.5 for each new observation and them all being independent (under assumption of random selections from population with known/hypothesized median) and the number of runs equaling the number of S's, this looks like a binomial to me (with some '-1's inserted in appropriate places. In case 2, this looks like a hypergeometric distribution, there would be n!/((n/2)!(n/2)!) possible permutations, just need to compute how many of those permutations result in x runs to get the probability. There is probably a way to think about this in terms of balls and urns, but I have not worked that out yet. Hope this helps,
Gregory (Greg) L. Snow Ph.D.
Statistical Data Center
Intermountain Healthcare
greg.snow at imail.org
801.408.8111
> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-
> project.org] On Behalf Of Dale Steele
> Sent: Wednesday, February 10, 2010 6:16 PM
> To: R-help at r-project.org
> Subject: [R] Code find exact distribution for runs test?
>
> I've been attempting to understand the one-sample run test for
> randomness. I've found run.test{tseries} and run.test{lawstat}. Both
> use a large sample approximation for distribution of the total number
> of runs in a sample of n1 observations of one type and n2 observations
> of another type.
>
> I've been unable to find R code to generate the exact distribution and
> would like to see how this could be done (not homework).
>
> For example, given the data:
>
> dtemp <- c(12, 13, 12, 11, 5, 2, -1, 2, -1, 3, 2, -6, -7, -7, -12, -9,
> 6, 7, 10, 6, 1, 1, 3, 7, -2, -6, -6, -5, -2, -1)
>
> The Monte Carlo permutation approach seems to get me part way.
>
>
> # calculate the number of runs in the data vector
> nruns <- function(x) {
> signs <- sign(x)
> runs <- rle(signs)
> r <- length(runs$lengths)
> return(r)
> }
>
> MC.runs <- function(x, nperm) {
> RUNS <- numeric(nperm)
> for (i in 1:nperm) {
> RUNS[i] <- nruns(sample(x))
> }
> cdf <- cumsum(table(RUNS))/nperm
> return(list(RUNS=RUNS, cdf=cdf, nperm=nperm))
> }
>
> MC.runs(dtemp, 100000)
>
> Thanks. --Dale
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-
> guide.html
> and provide commented, minimal, self-contained, reproducible code.
OK, I think I have it worked out for both cases:
library(vcd)
druns <- function(x, n) { # x runs in n data points, not vectorized
# based on sample median
if( n%%2 ) stop('n must be even')
if( x %% 2 ) { # odd x
choose( n/2 - 1, n/2-(x+1)/2 ) * choose( n/2 - 1, n/2-(x-1)/2 )/
choose(n,n/2) * 2
} else { # even x
choose( n/2 - 1, n/2-x/2 ) * choose( n/2 - 1, n/2-x/2 )/
choose(n,n/2) * 2
}
}
## population median
out1 <- replicate( 100000, {x <- rnorm(10); length(rle(sign(x))$lengths) } )
rootogram( table(out1), dbinom(0:9, size=9, prob=0.5)*100000 )
chisq.test( table(out1), p=dbinom(0:9, size=9, prob=0.5) )
## sample median
out2 <- replicate( 100000, {x <- rnorm(10); x <- x - median(x); length(rle(sign(x))$lengths) } )
fit <- sapply( 2:10, druns, n=10 )
rootogram( table(out2), fit * 100000 )
chisq.test( table(out2), p=fit )
The tests only fail to prove me wrong, not a firm proof that I am correct. But given the simulation size I am at least pretty close. I can see why people worked out approximations before we had good computers, I would not want to calculate the above by hand.
Enjoy,
Gregory (Greg) L. Snow Ph.D.
Statistical Data Center
Intermountain Healthcare
greg.snow at imail.org
801.408.8111
> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-
> project.org] On Behalf Of Greg Snow
> Sent: Thursday, February 11, 2010 12:19 PM
> To: Dale Steele; R-help at r-project.org
> Subject: Re: [R] Code find exact distribution for runs test?
>
> I am not an expert in this area, but here are some thoughts that may
> get you started towards an answer.
>
> First, there are 2 ways (possibly more) that can lead to the data for a
> runs test that lead to different theoretical distributions:
>
> 1. We have a true or hypothesized value of the median that we
> subtracted from the data, therefore each value has 50% probability of
> being positive/negative and all are independent of each other (assuming
> being exactly equal to the median is impossible or discarded).
>
> 2. We have subtracted the sample median from each sample value (and
> discarded any 0's) leaving us with exactly half positive and half
> negative and not having independence.
>
> In case 1, the 1st observation will always start a run. The second
> observation has a 50% chance of being the same sign (F) or different
> sign (S), with the probability being 0.5 for each new observation and
> them all being independent (under assumption of random selections from
> population with known/hypothesized median) and the number of runs
> equaling the number of S's, this looks like a binomial to me (with some
> '-1's inserted in appropriate places.
>
> In case 2, this looks like a hypergeometric distribution, there would
> be n!/((n/2)!(n/2)!) possible permutations, just need to compute how
> many of those permutations result in x runs to get the probability.
> There is probably a way to think about this in terms of balls and urns,
> but I have not worked that out yet.
>
> Hope this helps,
>
> --
> Gregory (Greg) L. Snow Ph.D.
> Statistical Data Center
> Intermountain Healthcare
> greg.snow at imail.org
> 801.408.8111
>
>
> > -----Original Message-----
> > From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-
> > project.org] On Behalf Of Dale Steele
> > Sent: Wednesday, February 10, 2010 6:16 PM
> > To: R-help at r-project.org
> > Subject: [R] Code find exact distribution for runs test?
> >
> > I've been attempting to understand the one-sample run test for
> > randomness. I've found run.test{tseries} and run.test{lawstat}.
> Both
> > use a large sample approximation for distribution of the total number
> > of runs in a sample of n1 observations of one type and n2
> observations
> > of another type.
> >
> > I've been unable to find R code to generate the exact distribution
> and
> > would like to see how this could be done (not homework).
> >
> > For example, given the data:
> >
> > dtemp <- c(12, 13, 12, 11, 5, 2, -1, 2, -1, 3, 2, -6, -7, -7, -12, -
> 9,
> > 6, 7, 10, 6, 1, 1, 3, 7, -2, -6, -6, -5, -2, -1)
> >
> > The Monte Carlo permutation approach seems to get me part way.
> >
> >
> > # calculate the number of runs in the data vector
> > nruns <- function(x) {
> > signs <- sign(x)
> > runs <- rle(signs)
> > r <- length(runs$lengths)
> > return(r)
> > }
> >
> > MC.runs <- function(x, nperm) {
> > RUNS <- numeric(nperm)
> > for (i in 1:nperm) {
> > RUNS[i] <- nruns(sample(x))
> > }
> > cdf <- cumsum(table(RUNS))/nperm
> > return(list(RUNS=RUNS, cdf=cdf, nperm=nperm))
> > }
> >
> > MC.runs(dtemp, 100000)
> >
> > Thanks. --Dale
> >
> > ______________________________________________
> > R-help at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting-
> > guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-
> guide.html
> and provide commented, minimal, self-contained, reproducible code.
Thanks for this! My original query was probably unclear. I think you
have answered a different, possibly more interesting question. My
goal was to find an exact distribution of a total number of runs R in
samples of two types of size (n1, n2) under the null hypothesis of
randomness.
The horribly inefficient code below generates results which match
Table 10 in Wackerly, Mendehall and Scheaffer for the distribution of
the total number of runs R in samples of size (n1, n2); P(R <= r),
under the null hypothesis of randomness. Horribly inefficient because
couldn't figure out how to generate only the distinguishable
permutations of my sample vector. Hope this brute force approach
illustrates what I am after.
nruns <- function(x) {
signs <- sign(x)
runs <- rle(signs)
r <- length(runs$lengths)
return(r)
}
library(combinat)
# for example (n1=3, n2=4)
n1 <- 3; n2 <- 4; n <- n1+n2
vect <- rep( c(-1,1), c(3,4))
vect
# ! 'nruns(vect)' generates factorial(7) permutations, only
choose(7,3) are distinguishable
exp.r <- table(unlist(permn(vect, nruns )))
cumsum(dist/factorial(7))
# Result agrees with Table 10, pg 814 (n1=3, n2=4)
#in Wackerly, Mendenhall and Scheaffer, 2002, presumably done using
# exact calculations.
Thanks. --Dale
On Thu, Feb 11, 2010 at 6:08 PM, Greg Snow <Greg.Snow at imail.org> wrote:
OK, I think I have it worked out for both cases:
library(vcd)
druns <- function(x, n) { # x runs in n data points, not vectorized
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?# based on sample median
? ? ? ?if( n%%2 ) stop('n must be even')
? ? ? ?if( x %% 2 ) { # odd x
? ? ? ? ? ? ? ?choose( n/2 - 1, n/2-(x+1)/2 ) * choose( n/2 - 1, n/2-(x-1)/2 )/
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?choose(n,n/2) * 2
? ? ? ?} else { ? ? ? ? ? ? ? ?# even x
? ? ? ? ? ? ? ?choose( n/2 - 1, n/2-x/2 ) * choose( n/2 - 1, n/2-x/2 )/
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?choose(n,n/2) * 2
? ? ? ?}
}
## population median
out1 <- replicate( 100000, {x <- rnorm(10); length(rle(sign(x))$lengths) } )
rootogram( table(out1), dbinom(0:9, size=9, prob=0.5)*100000 )
chisq.test( table(out1), p=dbinom(0:9, size=9, prob=0.5) )
## sample median
out2 <- replicate( 100000, {x <- rnorm(10); x <- x - median(x); length(rle(sign(x))$lengths) } )
fit <- sapply( 2:10, druns, n=10 )
rootogram( table(out2), fit * 100000 )
chisq.test( table(out2), p=fit )
The tests only fail to prove me wrong, not a firm proof that I am correct. ?But given the simulation size I am at least pretty close. ?I can see why people worked out approximations before we had good computers, I would not want to calculate the above by hand.
Enjoy,
--
Gregory (Greg) L. Snow Ph.D.
Statistical Data Center
Intermountain Healthcare
greg.snow at imail.org
801.408.8111
-----Original Message----- From: r-help-bounces at r-project.org [mailto:r-help-bounces at r- project.org] On Behalf Of Greg Snow Sent: Thursday, February 11, 2010 12:19 PM To: Dale Steele; R-help at r-project.org Subject: Re: [R] Code find exact distribution for runs test? I am not an expert in this area, but here are some thoughts that may get you started towards an answer. First, there are 2 ways (possibly more) that can lead to the data for a runs test that lead to different theoretical distributions: 1. We have a true or hypothesized value of the median that we subtracted from the data, therefore each value has 50% probability of being positive/negative and all are independent of each other (assuming being exactly equal to the median is impossible or discarded). 2. We have subtracted the sample median from each sample value (and discarded any 0's) leaving us with exactly half positive and half negative and not having independence. In case 1, the 1st observation will always start a run. ?The second observation has a 50% chance of being the same sign (F) or different sign (S), with the probability being 0.5 for each new observation and them all being independent (under assumption of random selections from population with known/hypothesized median) and the number of runs equaling the number of S's, this looks like a binomial to me (with some '-1's inserted in appropriate places. In case 2, this looks like a hypergeometric distribution, there would be n!/((n/2)!(n/2)!) possible permutations, just need to compute how many of those permutations result in x runs to get the probability. There is probably a way to think about this in terms of balls and urns, but I have not worked that out yet. Hope this helps, -- Gregory (Greg) L. Snow Ph.D. Statistical Data Center Intermountain Healthcare greg.snow at imail.org 801.408.8111
-----Original Message-----
From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-
project.org] On Behalf Of Dale Steele
Sent: Wednesday, February 10, 2010 6:16 PM
To: R-help at r-project.org
Subject: [R] Code find exact distribution for runs test?
I've been attempting to understand the one-sample run test for
randomness. ?I've found run.test{tseries} and run.test{lawstat}.
Both
use a large sample approximation for distribution of the total number of runs in a sample of n1 observations of one type and n2
observations
of another type. I've been unable to find R code to generate the exact distribution
and
would like to see how this could be done (not homework). For example, given the data: dtemp <- c(12, 13, 12, 11, 5, 2, -1, 2, -1, 3, 2, -6, -7, -7, -12, -
9,
6, 7, 10, 6, 1, 1, 3, 7, -2, -6, -6, -5, -2, -1)
The Monte Carlo permutation approach seems to get me part way.
# calculate the number of runs in the data vector
nruns <- function(x) {
? ? signs <- sign(x)
? ? runs <- rle(signs)
? ? r <- length(runs$lengths)
? ? return(r)
}
MC.runs <- function(x, nperm) {
RUNS <- numeric(nperm)
for (i in ?1:nperm) {
? ? RUNS[i] <- nruns(sample(x))
}
? ? cdf <- cumsum(table(RUNS))/nperm
? ? return(list(RUNS=RUNS, cdf=cdf, nperm=nperm))
}
MC.runs(dtemp, 100000)
Thanks. ?--Dale
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting- guide.html and provide commented, minimal, self-contained, reproducible code.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting- guide.html and provide commented, minimal, self-contained, reproducible code.
Try this:
druns2 <- function(x, n1, n2) {
if( x %% 2 ) { # odd x
choose( n1 - 1, n1 - (x+1)/2 ) * choose( n2 - 1, n2 - (x-1)/2 ) / choose( n1+n2, n1 ) +
choose( n2 - 1, n2 - (x+1)/2 ) * choose( n1 - 1, n1 - (x-1)/2 ) / choose( n1+n2, n2 )
} else { # even x
choose( n1 - 1, n1 - x/2 ) * choose( n2 - 1, n2 - x/2 ) / choose( n1 + n2, n1 ) +
choose( n2 - 1, n2 - x/2 ) * choose( n1 - 1, n1 - x/2 ) / choose( n1 + n2, n2 )
}
}
exp.2 <- sapply( 2:7, druns2, n1=3, n2=4 )
exp.2 - exp.r/factorial(7)
Gregory (Greg) L. Snow Ph.D.
Statistical Data Center
Intermountain Healthcare
greg.snow at imail.org
801.408.8111
> -----Original Message-----
> From: Dale Steele [mailto:dale.w.steele at gmail.com]
> Sent: Thursday, February 11, 2010 5:28 PM
> To: Greg Snow
> Cc: R-help at r-project.org
> Subject: Re: [R] Code find exact distribution for runs test?
>
> Thanks for this! My original query was probably unclear. I think you
> have answered a different, possibly more interesting question. My
> goal was to find an exact distribution of a total number of runs R in
> samples of two types of size (n1, n2) under the null hypothesis of
> randomness.
>
> The horribly inefficient code below generates results which match
> Table 10 in Wackerly, Mendehall and Scheaffer for the distribution of
> the total number of runs R in samples of size (n1, n2); P(R <= r),
> under the null hypothesis of randomness. Horribly inefficient because
> couldn't figure out how to generate only the distinguishable
> permutations of my sample vector. Hope this brute force approach
> illustrates what I am after.
>
>
> nruns <- function(x) {
> signs <- sign(x)
> runs <- rle(signs)
> r <- length(runs$lengths)
> return(r)
> }
>
> library(combinat)
> # for example (n1=3, n2=4)
> n1 <- 3; n2 <- 4; n <- n1+n2
> vect <- rep( c(-1,1), c(3,4))
> vect
>
> # ! 'nruns(vect)' generates factorial(7) permutations, only
> choose(7,3) are distinguishable
>
> exp.r <- table(unlist(permn(vect, nruns )))
> cumsum(dist/factorial(7))
>
> # Result agrees with Table 10, pg 814 (n1=3, n2=4)
> #in Wackerly, Mendenhall and Scheaffer, 2002, presumably done using
> # exact calculations.
>
> Thanks. --Dale
>
> On Thu, Feb 11, 2010 at 6:08 PM, Greg Snow <Greg.Snow at imail.org> wrote:
> > OK, I think I have it worked out for both cases:
> >
> > library(vcd)
> >
> > druns <- function(x, n) { # x runs in n data points, not vectorized
> > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?# based on sample
> median
> >
> > ? ? ? ?if( n%%2 ) stop('n must be even')
> >
> > ? ? ? ?if( x %% 2 ) { # odd x
> > ? ? ? ? ? ? ? ?choose( n/2 - 1, n/2-(x+1)/2 ) * choose( n/2 - 1, n/2-
> (x-1)/2 )/
> > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?choose(n,n/2) * 2
> > ? ? ? ?} else { ? ? ? ? ? ? ? ?# even x
> > ? ? ? ? ? ? ? ?choose( n/2 - 1, n/2-x/2 ) * choose( n/2 - 1, n/2-x/2
> )/
> > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?choose(n,n/2) * 2
> > ? ? ? ?}
> > }
> >
> > ## population median
> > out1 <- replicate( 100000, {x <- rnorm(10);
> length(rle(sign(x))$lengths) } )
> >
> > rootogram( table(out1), dbinom(0:9, size=9, prob=0.5)*100000 )
> > chisq.test( table(out1), p=dbinom(0:9, size=9, prob=0.5) )
> >
> >
> > ## sample median
> > out2 <- replicate( 100000, {x <- rnorm(10); x <- x - median(x);
> length(rle(sign(x))$lengths) } )
> >
> > fit <- sapply( 2:10, druns, n=10 )
> >
> > rootogram( table(out2), fit * 100000 )
> > chisq.test( table(out2), p=fit )
> >
> >
> > The tests only fail to prove me wrong, not a firm proof that I am
> correct. ?But given the simulation size I am at least pretty close. ?I
> can see why people worked out approximations before we had good
> computers, I would not want to calculate the above by hand.
> >
> > Enjoy,
> >
> > --
> > Gregory (Greg) L. Snow Ph.D.
> > Statistical Data Center
> > Intermountain Healthcare
> > greg.snow at imail.org
> > 801.408.8111
> >
> >
> >> -----Original Message-----
> >> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-
> >> project.org] On Behalf Of Greg Snow
> >> Sent: Thursday, February 11, 2010 12:19 PM
> >> To: Dale Steele; R-help at r-project.org
> >> Subject: Re: [R] Code find exact distribution for runs test?
> >>
> >> I am not an expert in this area, but here are some thoughts that may
> >> get you started towards an answer.
> >>
> >> First, there are 2 ways (possibly more) that can lead to the data
> for a
> >> runs test that lead to different theoretical distributions:
> >>
> >> 1. We have a true or hypothesized value of the median that we
> >> subtracted from the data, therefore each value has 50% probability
> of
> >> being positive/negative and all are independent of each other
> (assuming
> >> being exactly equal to the median is impossible or discarded).
> >>
> >> 2. We have subtracted the sample median from each sample value (and
> >> discarded any 0's) leaving us with exactly half positive and half
> >> negative and not having independence.
> >>
> >> In case 1, the 1st observation will always start a run. ?The second
> >> observation has a 50% chance of being the same sign (F) or different
> >> sign (S), with the probability being 0.5 for each new observation
> and
> >> them all being independent (under assumption of random selections
> from
> >> population with known/hypothesized median) and the number of runs
> >> equaling the number of S's, this looks like a binomial to me (with
> some
> >> '-1's inserted in appropriate places.
> >>
> >> In case 2, this looks like a hypergeometric distribution, there
> would
> >> be n!/((n/2)!(n/2)!) possible permutations, just need to compute how
> >> many of those permutations result in x runs to get the probability.
> >> There is probably a way to think about this in terms of balls and
> urns,
> >> but I have not worked that out yet.
> >>
> >> Hope this helps,
> >>
> >> --
> >> Gregory (Greg) L. Snow Ph.D.
> >> Statistical Data Center
> >> Intermountain Healthcare
> >> greg.snow at imail.org
> >> 801.408.8111
> >>
> >>
> >> > -----Original Message-----
> >> > From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-
> >> > project.org] On Behalf Of Dale Steele
> >> > Sent: Wednesday, February 10, 2010 6:16 PM
> >> > To: R-help at r-project.org
> >> > Subject: [R] Code find exact distribution for runs test?
> >> >
> >> > I've been attempting to understand the one-sample run test for
> >> > randomness. ?I've found run.test{tseries} and run.test{lawstat}.
> >> Both
> >> > use a large sample approximation for distribution of the total
> number
> >> > of runs in a sample of n1 observations of one type and n2
> >> observations
> >> > of another type.
> >> >
> >> > I've been unable to find R code to generate the exact distribution
> >> and
> >> > would like to see how this could be done (not homework).
> >> >
> >> > For example, given the data:
> >> >
> >> > dtemp <- c(12, 13, 12, 11, 5, 2, -1, 2, -1, 3, 2, -6, -7, -7, -12,
> -
> >> 9,
> >> > 6, 7, 10, 6, 1, 1, 3, 7, -2, -6, -6, -5, -2, -1)
> >> >
> >> > The Monte Carlo permutation approach seems to get me part way.
> >> >
> >> >
> >> > # calculate the number of runs in the data vector
> >> > nruns <- function(x) {
> >> > ? ? signs <- sign(x)
> >> > ? ? runs <- rle(signs)
> >> > ? ? r <- length(runs$lengths)
> >> > ? ? return(r)
> >> > }
> >> >
> >> > MC.runs <- function(x, nperm) {
> >> > RUNS <- numeric(nperm)
> >> > for (i in ?1:nperm) {
> >> > ? ? RUNS[i] <- nruns(sample(x))
> >> > }
> >> > ? ? cdf <- cumsum(table(RUNS))/nperm
> >> > ? ? return(list(RUNS=RUNS, cdf=cdf, nperm=nperm))
> >> > }
> >> >
> >> > MC.runs(dtemp, 100000)
> >> >
> >> > Thanks. ?--Dale
> >> >
> >> > ______________________________________________
> >> > R-help at r-project.org mailing list
> >> > https://stat.ethz.ch/mailman/listinfo/r-help
> >> > PLEASE do read the posting guide http://www.R-project.org/posting-
> >> > guide.html
> >> > and provide commented, minimal, self-contained, reproducible code.
> >>
> >> ______________________________________________
> >> R-help at r-project.org mailing list
> >> https://stat.ethz.ch/mailman/listinfo/r-help
> >> PLEASE do read the posting guide http://www.R-project.org/posting-
> >> guide.html
> >> and provide commented, minimal, self-contained, reproducible code.
> >
Wow! Your reply amply demonstrates the power of understanding the math (or finding someone who does) before turning on the computer. One last R question...how could I efficiently enumerate all distinguishable permutations of a vector of signs? vect <- c( -1, -1, 1, 1, 1) permn(vect) #length: factorial(length(vect)) ???? #length: choose(n, n1) Best Regards. --Dale
On Thu, Feb 11, 2010 at 10:04 PM, Greg Snow <Greg.Snow at imail.org> wrote:
Try this:
druns2 <- function(x, n1, n2) {
? ? ? ?if( x %% 2 ) { # odd x
? ? ? ? ? ? ? ?choose( n1 - 1, n1 - (x+1)/2 ) * choose( n2 - 1, n2 - (x-1)/2 ) / choose( n1+n2, n1 ) +
? ? ? ? ? ? ? ?choose( n2 - 1, n2 - (x+1)/2 ) * choose( n1 - 1, n1 - (x-1)/2 ) / choose( n1+n2, n2 )
? ? ? ?} else { # even x
? ? ? ? ? ? ? ?choose( n1 - 1, n1 - x/2 ) * choose( n2 - 1, n2 - x/2 ) / choose( n1 + n2, n1 ) +
? ? ? ? ? ? ? ?choose( n2 - 1, n2 - x/2 ) * choose( n1 - 1, n1 - x/2 ) / choose( n1 + n2, n2 )
? ? ? ?}
}
exp.2 <- sapply( 2:7, druns2, n1=3, n2=4 )
exp.2 - exp.r/factorial(7)
--
Gregory (Greg) L. Snow Ph.D.
Statistical Data Center
Intermountain Healthcare
greg.snow at imail.org
801.408.8111
-----Original Message-----
From: Dale Steele [mailto:dale.w.steele at gmail.com]
Sent: Thursday, February 11, 2010 5:28 PM
To: Greg Snow
Cc: R-help at r-project.org
Subject: Re: [R] Code find exact distribution for runs test?
Thanks for this! ?My original query was probably unclear. ?I think you
have answered a different, possibly more interesting question. ?My
goal was to find an exact distribution of a total number of runs R in
samples of two types of size (n1, n2) under the null hypothesis of
randomness.
The horribly inefficient code below generates results which match
Table 10 in Wackerly, Mendehall and Scheaffer for the distribution of
the total number of runs R in samples of ?size (n1, n2); P(R <= r),
under the null hypothesis of randomness. ?Horribly inefficient because
couldn't figure out how to generate only the distinguishable
permutations of my sample vector. Hope this brute force approach
illustrates what I am after.
nruns <- function(x) {
? ? signs <- sign(x)
? ? runs <- rle(signs)
? ? r <- length(runs$lengths)
? ? return(r)
}
library(combinat)
# for example (n1=3, n2=4)
n1 <- 3; ?n2 <- 4; n <- n1+n2
vect <- rep( c(-1,1), c(3,4))
vect
# ! 'nruns(vect)' generates factorial(7) permutations, only
choose(7,3) are distinguishable
exp.r <- table(unlist(permn(vect, nruns )))
cumsum(dist/factorial(7))
# Result agrees with Table 10, pg 814 (n1=3, n2=4)
#in Wackerly, Mendenhall and Scheaffer, 2002, presumably done using
# exact calculations.
Thanks. ?--Dale
On Thu, Feb 11, 2010 at 6:08 PM, Greg Snow <Greg.Snow at imail.org> wrote:
OK, I think I have it worked out for both cases:
library(vcd)
druns <- function(x, n) { # x runs in n data points, not vectorized
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?# based on sample
median
? ? ? ?if( n%%2 ) stop('n must be even')
? ? ? ?if( x %% 2 ) { # odd x
? ? ? ? ? ? ? ?choose( n/2 - 1, n/2-(x+1)/2 ) * choose( n/2 - 1, n/2-
(x-1)/2 )/
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?choose(n,n/2) * 2
? ? ? ?} else { ? ? ? ? ? ? ? ?# even x
? ? ? ? ? ? ? ?choose( n/2 - 1, n/2-x/2 ) * choose( n/2 - 1, n/2-x/2
)/
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?choose(n,n/2) * 2
? ? ? ?}
}
## population median
out1 <- replicate( 100000, {x <- rnorm(10);
length(rle(sign(x))$lengths) } )
rootogram( table(out1), dbinom(0:9, size=9, prob=0.5)*100000 )
chisq.test( table(out1), p=dbinom(0:9, size=9, prob=0.5) )
## sample median
out2 <- replicate( 100000, {x <- rnorm(10); x <- x - median(x);
length(rle(sign(x))$lengths) } )
fit <- sapply( 2:10, druns, n=10 ) rootogram( table(out2), fit * 100000 ) chisq.test( table(out2), p=fit ) The tests only fail to prove me wrong, not a firm proof that I am
correct. ?But given the simulation size I am at least pretty close. ?I can see why people worked out approximations before we had good computers, I would not want to calculate the above by hand.
Enjoy, -- Gregory (Greg) L. Snow Ph.D. Statistical Data Center Intermountain Healthcare greg.snow at imail.org 801.408.8111
-----Original Message----- From: r-help-bounces at r-project.org [mailto:r-help-bounces at r- project.org] On Behalf Of Greg Snow Sent: Thursday, February 11, 2010 12:19 PM To: Dale Steele; R-help at r-project.org Subject: Re: [R] Code find exact distribution for runs test? I am not an expert in this area, but here are some thoughts that may get you started towards an answer. First, there are 2 ways (possibly more) that can lead to the data
for a
runs test that lead to different theoretical distributions: 1. We have a true or hypothesized value of the median that we subtracted from the data, therefore each value has 50% probability
of
being positive/negative and all are independent of each other
(assuming
being exactly equal to the median is impossible or discarded). 2. We have subtracted the sample median from each sample value (and discarded any 0's) leaving us with exactly half positive and half negative and not having independence. In case 1, the 1st observation will always start a run. ?The second observation has a 50% chance of being the same sign (F) or different sign (S), with the probability being 0.5 for each new observation
and
them all being independent (under assumption of random selections
from
population with known/hypothesized median) and the number of runs equaling the number of S's, this looks like a binomial to me (with
some
'-1's inserted in appropriate places. In case 2, this looks like a hypergeometric distribution, there
would
be n!/((n/2)!(n/2)!) possible permutations, just need to compute how many of those permutations result in x runs to get the probability. There is probably a way to think about this in terms of balls and
urns,
but I have not worked that out yet. Hope this helps, -- Gregory (Greg) L. Snow Ph.D. Statistical Data Center Intermountain Healthcare greg.snow at imail.org 801.408.8111
-----Original Message-----
From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-
project.org] On Behalf Of Dale Steele
Sent: Wednesday, February 10, 2010 6:16 PM
To: R-help at r-project.org
Subject: [R] Code find exact distribution for runs test?
I've been attempting to understand the one-sample run test for
randomness. ?I've found run.test{tseries} and run.test{lawstat}.
Both
use a large sample approximation for distribution of the total
number
of runs in a sample of n1 observations of one type and n2
observations
of another type. I've been unable to find R code to generate the exact distribution
and
would like to see how this could be done (not homework). For example, given the data: dtemp <- c(12, 13, 12, 11, 5, 2, -1, 2, -1, 3, 2, -6, -7, -7, -12,
-
9,
6, 7, 10, 6, 1, 1, 3, 7, -2, -6, -6, -5, -2, -1)
The Monte Carlo permutation approach seems to get me part way.
# calculate the number of runs in the data vector
nruns <- function(x) {
? ? signs <- sign(x)
? ? runs <- rle(signs)
? ? r <- length(runs$lengths)
? ? return(r)
}
MC.runs <- function(x, nperm) {
RUNS <- numeric(nperm)
for (i in ?1:nperm) {
? ? RUNS[i] <- nruns(sample(x))
}
? ? cdf <- cumsum(table(RUNS))/nperm
? ? return(list(RUNS=RUNS, cdf=cdf, nperm=nperm))
}
MC.runs(dtemp, 100000)
Thanks. ?--Dale
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting- guide.html and provide commented, minimal, self-contained, reproducible code.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting- guide.html and provide commented, minimal, self-contained, reproducible code.
Here is one quick way using the combinat package:
library(combinat)
tmpfun <- function(x) {
+ tmp <- rep(1,5) + tmp[x] <- -1 + tmp + }
combn(5,2, tmpfun)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [1,] -1 -1 -1 -1 1 1 1 1 1 1 [2,] -1 1 1 1 -1 -1 -1 1 1 1 [3,] 1 -1 1 1 -1 1 1 -1 -1 1 [4,] 1 1 -1 1 1 -1 1 -1 1 -1 [5,] 1 1 1 -1 1 1 -1 1 -1 -1
Of course in this case the tmpfun function needs to be rewritten for each vector size, so is not generalizable.
Gregory (Greg) L. Snow Ph.D.
Statistical Data Center
Intermountain Healthcare
greg.snow at imail.org
801.408.8111
> -----Original Message-----
> From: Dale Steele [mailto:dale.w.steele at gmail.com]
> Sent: Friday, February 12, 2010 6:34 AM
> To: Greg Snow
> Cc: R-help at r-project.org
> Subject: Re: [R] Code find exact distribution for runs test?
>
> Wow! Your reply amply demonstrates the power of understanding the
> math (or finding someone who does) before turning on the computer.
>
> One last R question...how could I efficiently enumerate all
> distinguishable permutations of a vector of signs?
>
> vect <- c( -1, -1, 1, 1, 1)
>
> permn(vect) #length: factorial(length(vect))
> ???? #length: choose(n, n1)
>
> Best Regards. --Dale
>
>
> On Thu, Feb 11, 2010 at 10:04 PM, Greg Snow <Greg.Snow at imail.org>
> wrote:
> > Try this:
> >
> > druns2 <- function(x, n1, n2) {
> >
> > ? ? ? ?if( x %% 2 ) { # odd x
> > ? ? ? ? ? ? ? ?choose( n1 - 1, n1 - (x+1)/2 ) * choose( n2 - 1, n2 -
> (x-1)/2 ) / choose( n1+n2, n1 ) +
> > ? ? ? ? ? ? ? ?choose( n2 - 1, n2 - (x+1)/2 ) * choose( n1 - 1, n1 -
> (x-1)/2 ) / choose( n1+n2, n2 )
> > ? ? ? ?} else { # even x
> > ? ? ? ? ? ? ? ?choose( n1 - 1, n1 - x/2 ) * choose( n2 - 1, n2 - x/2
> ) / choose( n1 + n2, n1 ) +
> > ? ? ? ? ? ? ? ?choose( n2 - 1, n2 - x/2 ) * choose( n1 - 1, n1 - x/2
> ) / choose( n1 + n2, n2 )
> > ? ? ? ?}
> > }
> >
> > exp.2 <- sapply( 2:7, druns2, n1=3, n2=4 )
> > exp.2 - exp.r/factorial(7)
> >
> >
> >
> > --
> > Gregory (Greg) L. Snow Ph.D.
> > Statistical Data Center
> > Intermountain Healthcare
> > greg.snow at imail.org
> > 801.408.8111
> >
> >
> >> -----Original Message-----
> >> From: Dale Steele [mailto:dale.w.steele at gmail.com]
> >> Sent: Thursday, February 11, 2010 5:28 PM
> >> To: Greg Snow
> >> Cc: R-help at r-project.org
> >> Subject: Re: [R] Code find exact distribution for runs test?
> >>
> >> Thanks for this! ?My original query was probably unclear. ?I think
> you
> >> have answered a different, possibly more interesting question. ?My
> >> goal was to find an exact distribution of a total number of runs R
> in
> >> samples of two types of size (n1, n2) under the null hypothesis of
> >> randomness.
> >>
> >> The horribly inefficient code below generates results which match
> >> Table 10 in Wackerly, Mendehall and Scheaffer for the distribution
> of
> >> the total number of runs R in samples of ?size (n1, n2); P(R <= r),
> >> under the null hypothesis of randomness. ?Horribly inefficient
> because
> >> couldn't figure out how to generate only the distinguishable
> >> permutations of my sample vector. Hope this brute force approach
> >> illustrates what I am after.
> >>
> >>
> >> nruns <- function(x) {
> >> ? ? signs <- sign(x)
> >> ? ? runs <- rle(signs)
> >> ? ? r <- length(runs$lengths)
> >> ? ? return(r)
> >> }
> >>
> >> library(combinat)
> >> # for example (n1=3, n2=4)
> >> n1 <- 3; ?n2 <- 4; n <- n1+n2
> >> vect <- rep( c(-1,1), c(3,4))
> >> vect
> >>
> >> # ! 'nruns(vect)' generates factorial(7) permutations, only
> >> choose(7,3) are distinguishable
> >>
> >> exp.r <- table(unlist(permn(vect, nruns )))
> >> cumsum(dist/factorial(7))
> >>
> >> # Result agrees with Table 10, pg 814 (n1=3, n2=4)
> >> #in Wackerly, Mendenhall and Scheaffer, 2002, presumably done using
> >> # exact calculations.
> >>
> >> Thanks. ?--Dale
> >>
> >> On Thu, Feb 11, 2010 at 6:08 PM, Greg Snow <Greg.Snow at imail.org>
> wrote:
> >> > OK, I think I have it worked out for both cases:
> >> >
> >> > library(vcd)
> >> >
> >> > druns <- function(x, n) { # x runs in n data points, not
> vectorized
> >> > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?# based on sample
> >> median
> >> >
> >> > ? ? ? ?if( n%%2 ) stop('n must be even')
> >> >
> >> > ? ? ? ?if( x %% 2 ) { # odd x
> >> > ? ? ? ? ? ? ? ?choose( n/2 - 1, n/2-(x+1)/2 ) * choose( n/2 - 1,
> n/2-
> >> (x-1)/2 )/
> >> > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?choose(n,n/2) * 2
> >> > ? ? ? ?} else { ? ? ? ? ? ? ? ?# even x
> >> > ? ? ? ? ? ? ? ?choose( n/2 - 1, n/2-x/2 ) * choose( n/2 - 1, n/2-
> x/2
> >> )/
> >> > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?choose(n,n/2) * 2
> >> > ? ? ? ?}
> >> > }
> >> >
> >> > ## population median
> >> > out1 <- replicate( 100000, {x <- rnorm(10);
> >> length(rle(sign(x))$lengths) } )
> >> >
> >> > rootogram( table(out1), dbinom(0:9, size=9, prob=0.5)*100000 )
> >> > chisq.test( table(out1), p=dbinom(0:9, size=9, prob=0.5) )
> >> >
> >> >
> >> > ## sample median
> >> > out2 <- replicate( 100000, {x <- rnorm(10); x <- x - median(x);
> >> length(rle(sign(x))$lengths) } )
> >> >
> >> > fit <- sapply( 2:10, druns, n=10 )
> >> >
> >> > rootogram( table(out2), fit * 100000 )
> >> > chisq.test( table(out2), p=fit )
> >> >
> >> >
> >> > The tests only fail to prove me wrong, not a firm proof that I am
> >> correct. ?But given the simulation size I am at least pretty close.
> ?I
> >> can see why people worked out approximations before we had good
> >> computers, I would not want to calculate the above by hand.
> >> >
> >> > Enjoy,
> >> >
> >> > --
> >> > Gregory (Greg) L. Snow Ph.D.
> >> > Statistical Data Center
> >> > Intermountain Healthcare
> >> > greg.snow at imail.org
> >> > 801.408.8111
> >> >
> >> >
> >> >> -----Original Message-----
> >> >> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-
> >> >> project.org] On Behalf Of Greg Snow
> >> >> Sent: Thursday, February 11, 2010 12:19 PM
> >> >> To: Dale Steele; R-help at r-project.org
> >> >> Subject: Re: [R] Code find exact distribution for runs test?
> >> >>
> >> >> I am not an expert in this area, but here are some thoughts that
> may
> >> >> get you started towards an answer.
> >> >>
> >> >> First, there are 2 ways (possibly more) that can lead to the data
> >> for a
> >> >> runs test that lead to different theoretical distributions:
> >> >>
> >> >> 1. We have a true or hypothesized value of the median that we
> >> >> subtracted from the data, therefore each value has 50%
> probability
> >> of
> >> >> being positive/negative and all are independent of each other
> >> (assuming
> >> >> being exactly equal to the median is impossible or discarded).
> >> >>
> >> >> 2. We have subtracted the sample median from each sample value
> (and
> >> >> discarded any 0's) leaving us with exactly half positive and half
> >> >> negative and not having independence.
> >> >>
> >> >> In case 1, the 1st observation will always start a run. ?The
> second
> >> >> observation has a 50% chance of being the same sign (F) or
> different
> >> >> sign (S), with the probability being 0.5 for each new observation
> >> and
> >> >> them all being independent (under assumption of random selections
> >> from
> >> >> population with known/hypothesized median) and the number of runs
> >> >> equaling the number of S's, this looks like a binomial to me
> (with
> >> some
> >> >> '-1's inserted in appropriate places.
> >> >>
> >> >> In case 2, this looks like a hypergeometric distribution, there
> >> would
> >> >> be n!/((n/2)!(n/2)!) possible permutations, just need to compute
> how
> >> >> many of those permutations result in x runs to get the
> probability.
> >> >> There is probably a way to think about this in terms of balls and
> >> urns,
> >> >> but I have not worked that out yet.
> >> >>
> >> >> Hope this helps,
> >> >>
> >> >> --
> >> >> Gregory (Greg) L. Snow Ph.D.
> >> >> Statistical Data Center
> >> >> Intermountain Healthcare
> >> >> greg.snow at imail.org
> >> >> 801.408.8111
> >> >>
> >> >>
> >> >> > -----Original Message-----
> >> >> > From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-
> >> >> > project.org] On Behalf Of Dale Steele
> >> >> > Sent: Wednesday, February 10, 2010 6:16 PM
> >> >> > To: R-help at r-project.org
> >> >> > Subject: [R] Code find exact distribution for runs test?
> >> >> >
> >> >> > I've been attempting to understand the one-sample run test for
> >> >> > randomness. ?I've found run.test{tseries} and
> run.test{lawstat}.
> >> >> Both
> >> >> > use a large sample approximation for distribution of the total
> >> number
> >> >> > of runs in a sample of n1 observations of one type and n2
> >> >> observations
> >> >> > of another type.
> >> >> >
> >> >> > I've been unable to find R code to generate the exact
> distribution
> >> >> and
> >> >> > would like to see how this could be done (not homework).
> >> >> >
> >> >> > For example, given the data:
> >> >> >
> >> >> > dtemp <- c(12, 13, 12, 11, 5, 2, -1, 2, -1, 3, 2, -6, -7, -7, -
> 12,
> >> -
> >> >> 9,
> >> >> > 6, 7, 10, 6, 1, 1, 3, 7, -2, -6, -6, -5, -2, -1)
> >> >> >
> >> >> > The Monte Carlo permutation approach seems to get me part way.
> >> >> >
> >> >> >
> >> >> > # calculate the number of runs in the data vector
> >> >> > nruns <- function(x) {
> >> >> > ? ? signs <- sign(x)
> >> >> > ? ? runs <- rle(signs)
> >> >> > ? ? r <- length(runs$lengths)
> >> >> > ? ? return(r)
> >> >> > }
> >> >> >
> >> >> > MC.runs <- function(x, nperm) {
> >> >> > RUNS <- numeric(nperm)
> >> >> > for (i in ?1:nperm) {
> >> >> > ? ? RUNS[i] <- nruns(sample(x))
> >> >> > }
> >> >> > ? ? cdf <- cumsum(table(RUNS))/nperm
> >> >> > ? ? return(list(RUNS=RUNS, cdf=cdf, nperm=nperm))
> >> >> > }
> >> >> >
> >> >> > MC.runs(dtemp, 100000)
> >> >> >
> >> >> > Thanks. ?--Dale
> >> >> >
> >> >> > ______________________________________________
> >> >> > R-help at r-project.org mailing list
> >> >> > https://stat.ethz.ch/mailman/listinfo/r-help
> >> >> > PLEASE do read the posting guide http://www.R-
> project.org/posting-
> >> >> > guide.html
> >> >> > and provide commented, minimal, self-contained, reproducible
> code.
> >> >>
> >> >> ______________________________________________
> >> >> R-help at r-project.org mailing list
> >> >> https://stat.ethz.ch/mailman/listinfo/r-help
> >> >> PLEASE do read the posting guide http://www.R-
> project.org/posting-
> >> >> guide.html
> >> >> and provide commented, minimal, self-contained, reproducible
> code.
> >> >
> >
Here's a simple function (same idea):
f <- function(x){
lenx <- length(x)
negx <- sum(x < 0)
mat <- matrix(1, lenx, choose(lenx,negx))
for(i in seq_len(choose(lenx, negx))){
mat[combn(lenx,negx)[, i], i] <- -1
}
mat
}
x <- c(-1,-1,-1,1,1,1,1)
f(x)
[combn() is now in utils]
-Peter Ehlers
Greg Snow wrote:
Here is one quick way using the combinat package:
library(combinat)
tmpfun <- function(x) {
+ tmp <- rep(1,5) + tmp[x] <- -1 + tmp + }
combn(5,2, tmpfun)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [1,] -1 -1 -1 -1 1 1 1 1 1 1 [2,] -1 1 1 1 -1 -1 -1 1 1 1 [3,] 1 -1 1 1 -1 1 1 -1 -1 1 [4,] 1 1 -1 1 1 -1 1 -1 1 -1 [5,] 1 1 1 -1 1 1 -1 1 -1 -1 Of course in this case the tmpfun function needs to be rewritten for each vector size, so is not generalizable.