On Thu, May 03, 2012 at 03:08:00PM +0200, Kehl D?niel wrote:
Dear List-members,
I have a problem where I have to estimate a mean, or a sum of a
population but for some reason it contains a huge amount of zeros.
I cannot give real data but I constructed a toy example as follows
N1<- 100000
N2<- 3000
x1<- rep(0,N1)
x2<- rnorm(N2,300,100)
x<- c(x1,x2)
n<- 1000
x_sample<- sample(x,n,replace=FALSE)
I want to estimate the sum of x based on x_sample (not knowing N1 and N2
but their sum (N) only).
The sample mean has a huge standard deviation I am looking for a better
estimator.
Hi.
I do not know the exact answer, but let me formulate the following observation.
If the question is redefined to estimate the mean of nonzero numbers, then
an estimate is mean(x_sample[x_sample != 0]). Its standard deviation in your
situation may be estimated as
res<- rep(NA, times=1000)
for (i in seq.int(along=res)) {
x_sample<- sample(x,n,replace=FALSE)
res[i]<- mean(x_sample[x_sample != 0])
}
sd(res)
[1] 18.72677 # this varies with the seed a bit
The observation is that this cannot be improved much, since the estimate
is based on a very small sample. The average size of the sample of nonzero
values is N2/(N1+N2)*n = 29.1. So, the standard deviation should be
something close to 100/sqrt(29.1) = 18.5376.
Petr Savicky.