Permutation or Bootstrap to obtain p-value for one sample
On Oct 9, 2011, at 12:00 , francesca casalino wrote:
Thank you very much to both Ken and Peter for the very helpful explanations.
Just to understand this better (sorry for repeating but I am also new in statistics?so please correct me where I am wrong):
Ken' method:
Random sampling of the mean, and then using these means to construct a distribution of means (the 'null' distribution), and I can then use this normal distribution and compare the population mean to my mean using, for example, z-score.
Of note: The initial distributions are not normal, so I thought I needed to base my calculations on the median, but I can use the mean to construct a normal distribution.
This would be defined a bootstrap test.
Peter's method:
Random sampling of the mean, and then comparing each sampled mean with the population mean and see if it is higher or equal to the difference between my sample and the population mean. This is a permutation test, but to actually get CI and a p-value I would need bootstrap method.
Did I understand this correctly?
I tried to start with Ken's approach for now, and followed his steps, but:
1) I get a lot of NaN in the sampling distribution, is this normal?
2) I think I am doing again something wrong when I try to find a p-value?This is what I did:
nreps=10000
mean.dist=rep(NA,nreps)
for(replication in 1:nreps)
{
my.sample=sample(population$Y, 250, replace=F)
#Peter mentioned that this sampling should be without replacement, so I went for that---
mean.for.rep=mean(my.sample) #mean for this replication
mean.dist[replication]=mean.for.rep #store the mean
}
hist(mean.dist,main="Null Dist of Means", col="chartreuse")
#Show the means in a nifty color
mean_dist= mean(mean.dist, na.rm=TRUE)
sd_pop= sd(mean.dist, na.rm=TRUE)
mean_sample= mean(population$Y, na.rm=TRUE)
You mean sample$Y, I hope.
z_stat= (mean_sample - mean_dist)/(sd_pop/sqrt(2089)) p_value= 2*pnorm(-abs(z_stat)) Is this correct?
No. First, notice that replicate() does all the "red tape" of the for loop for you. It's not a grave error to do what you're doing, but I did give you code, so maybe you should try it... Worse: The whole point of doing a permutation test is to compare the observed value directly to the actual permutation distribution. So you take your observed mean and see whether it falls in the middle or in the tails of the histogram. The p value can be estimated as the relative frequency of simulated means falling as far or further out in the tails as the observed one. You do not want to approximate the simulated distribution with a normal distribution. If that's what you wanted, you could do it with no simulation at all -- there are simple formulas relating the mean and variance of the sample mean to the mean and variance of the population, and you'd more or less wind up reinventing the two-sample t-test.
THANK YOU SO MUCH FOR ALL YOUR HELP!!
2011/10/9 Ken Hutchison <vicvoncastle at gmail.com>
Hi Francy,
A bootstrap test would likely be sufficient for this problem, but a one-sample t-test isn't advisable or necessary in my opinion. If you use a t-test multiple times you are making assumptions about the distribution of your data; more importantly, your probability of Type 1 error will be increased with each test. So, a valid thing to do would be to sample (computation for this problem won't be expensive so do alotta reps) and compare your mean to the null distribution of means. I.E.
nreps=10000
mean.dist=rep(NA,nreps)
for(replication in 1:nreps)
{
my.sample=sample(population, 2500, replace=T)
#replace could be false, depends on preference
mean.for.rep=mean(my.sample) #mean for this replication
mean.dist[replication]=mean.for.rep #store the mean
}
hist(mean.dist,main="Null Dist of Means", col="chartreuse")
#Show the means in a nifty color
You can then perform various tests given the null distribution, or infer from where your sample mean lies within the distribution or something to that effect. Also, if the distribution is normal, which is somewhat likely since it is a distribution of means: (shapiro.test or require(nortest) ad.test will let you know) you should be able to make inference from that using parametric methods (once) which will fit the truth a bit better than a t.test.
Hope that's helpful,
Ken Hutchison
On Sat, Oct 8, 2011 at 10:04 AM, francy <francy.casalino at gmail.com> wrote:
Hi,
I am having trouble understanding how to approach a simulation:
I have a sample of n=250 from a population of N=2,000 individuals, and I
would like to use either permutation test or bootstrap to test whether this
particular sample is significantly different from the values of any other
random samples of the same population. I thought I needed to take random
samples (but I am not sure how many simulations I need to do) of n=250 from
the N=2,000 population and maybe do a one-sample t-test to compare the mean
score of all the simulated samples, + the one sample I am trying to prove
that is different from any others, to the mean value of the population. But
I don't know:
(1) whether this one-sample t-test would be the right way to do it, and how
to go about doing this in R
(2) whether a permutation test or bootstrap methods are more appropriate
This is the data frame that I have, which is to be sampled:
df<-
i.e.
x y
1 2
3 4
5 6
7 8
. .
. .
. .
2,000
I have this sample from df, and would like to test whether it is has extreme
values of y.
sample1<-
i.e.
x y
3 4
7 8
. .
. .
. .
250
For now I only have this:
R=999 #Number of simulations, but I don't know how many...
t.values =numeric(R) #creates a numeric vector with 999 elements, which
will hold the results of each simulation.
for (i in 1:R) {
sample1 <- df[sample(nrow(df), 250, replace=TRUE),]
But I don't know how to continue the loop: do I calculate the mean for each
simulation and compare it to the population mean?
Any help you could give me would be very appreciated,
Thank you.
--
View this message in context: http://r.789695.n4.nabble.com/Permutation-or-Bootstrap-to-obtain-p-value-for-one-sample-tp3885118p3885118.html
Sent from the R help mailing list archive at Nabble.com.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Peter Dalgaard, Professor, Center for Statistics, Copenhagen Business School Solbjerg Plads 3, 2000 Frederiksberg, Denmark Phone: (+45)38153501 Email: pd.mes at cbs.dk Priv: PDalgd at gmail.com