Skip to content

Approximating discrete distribution by continuous distribution

3 messages · Michael Haenlein, Brian Ripley, Peter Dalgaard

#
On 22/01/2013 11:49, Michael Haenlein wrote:
This is not really an R question, but a statistics one.  It is almost 
guesswork: if for example these were drivers in the UK, the answer is 0. 
  So you need to supply some information about the shape of the 
distribution of <18 year olds.

You have estimates of the cumulative distribution function at c(0, 18, 
35, 65, Inf) (or some better upper limit).  You want to interpolate it. 
  You could use linear interpolation (approx[fun]) or a monotone spline 
interpolation (spline[fun]) or any other interpolation method which 
meets your needs.  But whatever you use, you will supplying a lot of 
information not actually in your data.

  
    
#
On Jan 22, 2013, at 13:45 , Prof Brian Ripley wrote:

            
Agreed. The linear interpolation method is sometimes described as the "sum polygon", and sort of assumes that there is a uniform distribution of ages in each range. I.e., the number of 16 year olds would be 1/18 of the 0-17 y.o. However, I'd feel somewhat uneasy about doing this with such wide age-bands.

There is also the option of fitting a standard distribution like the Weibull to the data and using that. The mle() function should do this if you write out the log-likelihood using something like 

dmultinom(Age, prob=diff(pweibull(c(0,18,15,65,Inf), shape, scale), log=TRUE)

With a quarter of a billion observations, the fit might be less than perfect, but on the other hand, extracting more than two parameters from four data points sound a bit ominous.