Physical or Statistical Explanation for the "Funnel" Plot? - R-help

Thu, Mar 26, 2009 7:44 PM #

The R code below produces (after running for a few minutes on a decent computer) the plot shown at the following location:

http://n2.nabble.com/Is-there-a-physical-and-quantitative-explanation-for-this-plot--td2542321.html

I'm just taking the mean of a given set of random variables, where the set size is increased.  There appears to be a quick convergence and then a pretty steady variance out to a set size of 10,0000.  

I'm just wondering if there is a statistical explanation out there for this convergence and it has been explored further.  Thanks again. 

# First case
N<-100000
X<-rnorm(N)
step_size<-1


# Groups
g<-rep(1:(N/step_size),each=step_size)

# The result
tmp_output<-tapply(X[1:length(g)],g,mean)

length_tmp_output<-length(tmp_output)
tmp_x_vals<-rep(step_size,length_tmp_output)
plot(tmp_x_vals, tmp_output, xlim=c(0,10000))
#points(tmp_x_vals, tmp_output)

for(ii in 1:10000)
{   
	step_size<-ii

	# Groups
	g<-rep(1:(N/step_size),each=step_size)

	# The result
	#tmp_output<-tapply(X,g,mean)
	tmp_output<-tapply(X[1:length(g)],g,mean)

	length_tmp_output<-length(tmp_output)
	tmp_x_vals<-rep(step_size,length_tmp_output)
	points(tmp_x_vals, tmp_output)
}

Mike Miller

Thu, Mar 26, 2009 9:34 PM #

On Thu, 26 Mar 2009, Jason Rupert wrote:

I don't have time to study your code, but it sounds like you are taking 
random normal variables with mean 0 and variance 1, but then taking the 
mean for sets of those.  We know exactly the distribution for the mean of 
the "set" (a.k.a., "sample").  The mean has a normal distribution with 
mean 0 and variance 1/N where N is the size of the sample.  When you allow 
N to vary, you produce a mixture of random normal variables all having 
mean 0 but with different variances.  The plot you show looks correct -- 
the distributions in the mixture that have small variance pile up in the 
middle, while those with greater variance form the long tails.  You could 
get a lot of different shapes depending ont he distribution of N.  But 
save yourself some time.  Instead of making N normal variables and taking 
the mean, just make one and divide it by sqrt(N) -- that will give you 
*exactly* the same result.

Your graph looks a little weird - first, why turn it sideways?  We 
normally plot density on the ordinate, not on the abscissa.  Second, there 
is a thick black bar on the left, but that seems to be an artifact because 
at least half of it is below zero -- how can that happen?

Mike

Thomas Lumley

Fri, Mar 27, 2009 12:55 AM #

On Thu, 26 Mar 2009, Jason Rupert wrote:

Part of the convergence is just that the standard devation of a mean of N observations is proportional to 1/sqrt(N). In your case the distributions are all exactly Normal; the same convergence would occur with other distributions, but you would also see the change in shape from left to right as the distribution converged to Normal.

There's also some plotting artifacts due to the size of the points.  The apparent stabilization at large N (and the wide vertical bar at zero that Marc Schwartz commented on) are due partly to the slow convergence of 1/sqrt(N) but largely because the width can't be smaller than the width of a point.

When I draw funnel plots like this for whole-genome association data I use the 'hexbin' package, which doesn't have these artifacts and is much faster and produces smaller graphics files.

     -thomas


Thomas Lumley			Assoc. Professor, Biostatistics
tlumley at u.washington.edu	University of Washington, Seattle