Survey - Cluster Sampling

Thomas Lumley · 2005-06-16T15:01:08Z

On Thu, 16 Jun 2005, Mark Hempelmann wrote: > Dear WizaRds, > > I am struggling to compute correctly a cluster sampling design. I want > to do one stage clustering with different parametric changes: > > Let M be the total number of clusters in the population, and m the > number sampled. Let N be the total of elements in the population and n > the number sampled. y are the values sampled. This is my example data: > > clus1 weight=rep

Thomas Lumley

Thu, Jun 16, 2005 8:01 AM

On Thu, 16 Jun 2005, Mark Hempelmann wrote:

Yes.

The fpc term should be the total number of clusters, so 23 rather than 72.
clus1$M<-rep(23,9)
dclus2a<-svydesign(id=~cluster, data=clus1, fpc=~M)
svymean(~y, dclus2a)

Now, this still gives 44.778, because each observation still has the same 
weight.  It describes a one-stage cluster sampling design where each 
cluster has only three elements.  This is an equal-probability sampling 
design. Any equal-probability sampling design will give the same estimated 
mean.

If your design was to take a simple random sample of three clusters and 
then take all the elements in each cluster then dclus2a is giving the 
correct mean (well, the one I wanted it to give). Estimates of the 
population total will be different, but not the mean.

Your expected estimate of the mean is also a reasonable one. In survey 
statistics there is often more than one reasonable estimator even for 
something as simple as the mean.  My estimator is 
sum(weights*y)/sum(weights), which has some practical advantages: it is 
easy to generalise to more complex designs (including things like 
post-stratification), it can be computed without knowing the sampling 
design (which is important when using replicate weights to compute 
variances), it is the definition of the mean that agrees with linear 
regression models, and it is what Stata uses, making it easier to compare 
results.

Your estimator uses the expected value of the denominator rather than the 
observed value. This probably implies that your estimator is 
design-unbiased and mine isn't.  Since there aren't design-unbiased 
estimators for most statistics more complicated than the mean I don't 
worry so much about it.


You might also have had a sampling design where you took a simple random 
sample of three clusters and then up to three elements from each cluster.
   dclus2b<-svydesign(id=~cluster+id, fpc=~M+nl, data=clus1)
This gives the same mean as dclus2a, because in fact you sampled 100% of 
each sampled cluster.

Again, fpc should be M rather than N. The help page says that the relevant 
population size is in "sampling units" (ie, clusters). It used to say PSUs 
before the package was extended to handle multistage fpcs, which was 
probably clearer but now wouldn't be true.

Apart from that you aren't doing anything wrong. The mean should still be 
the same as the unweighted mean because you are giving each observation 
the same weight. And it is.

The total won't be the same as dclus2a and dclus2b, because you are now 
telling R the population size in elements as well as in PSUs.


 	-thomas

Survey - Cluster Sampling

Thread (2 messages)