Approximating discrete distribution by continuous distribution

An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20130122/56887822/attachment.pl>
Dear all,

I have a discrete distribution showing how age is distributed across a
population using a certain set of bands:

Age <- matrix(c(74045062, 71978405, 122718362, 40489415), ncol=1,
dimnames=list(c("<18", "18-34", "35-64", "65+"),c()))
Age_dist <- Age/sum(Age)

For example I know that 23.94% of all people are between 0-18 years, 23.28%
between 18-34 years and so forth.

I would like to find a continuous approximation of this discrete
distribution in order to estimate the probability that a person is for
example 16 years old.

Is there some automatic way in R through which this can be done? I tried a
Kernel density estimation of the histogram but this does not seem to
provide what I'm looking for.
This is not really an R question, but a statistics one.  It is almost 
guesswork: if for example these were drivers in the UK, the answer is 0. 
  So you need to supply some information about the shape of the 
distribution of <18 year olds.

You have estimates of the cumulative distribution function at c(0, 18, 
35, 65, Inf) (or some better upper limit).  You want to interpolate it. 
  You could use linear interpolation (approx[fun]) or a monotone spline 
interpolation (spline[fun]) or any other interpolation method which 
meets your needs.  But whatever you use, you will supplying a lot of 
information not actually in your data.
Thanks very much for your help,

Michael

Michael Haenlein
Associate Professor of Marketing
ESCP Europe
Paris, France

	[[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

On 22/01/2013 11:49, Michael Haenlein wrote:
Dear all,

I have a discrete distribution showing how age is distributed across a
population using a certain set of bands:

Age <- matrix(c(74045062, 71978405, 122718362, 40489415), ncol=1,
dimnames=list(c("<18", "18-34", "35-64", "65+"),c()))
Age_dist <- Age/sum(Age)

For example I know that 23.94% of all people are between 0-18 years, 23.28%
between 18-34 years and so forth.

I would like to find a continuous approximation of this discrete
distribution in order to estimate the probability that a person is for
example 16 years old.

Is there some automatic way in R through which this can be done? I tried a
Kernel density estimation of the histogram but this does not seem to
provide what I'm looking for.
This is not really an R question, but a statistics one.  It is almost guesswork: if for example these were drivers in the UK, the answer is 0.  So you need to supply some information about the shape of the distribution of <18 year olds.

You have estimates of the cumulative distribution function at c(0, 18, 35, 65, Inf) (or some better upper limit).  You want to interpolate it.  You could use linear interpolation (approx[fun]) or a monotone spline interpolation (spline[fun]) or any other interpolation method which meets your needs.  But whatever you use, you will supplying a lot of information not actually in your data.
Agreed. The linear interpolation method is sometimes described as the "sum polygon", and sort of assumes that there is a uniform distribution of ages in each range. I.e., the number of 16 year olds would be 1/18 of the 0-17 y.o. However, I'd feel somewhat uneasy about doing this with such wide age-bands.

There is also the option of fitting a standard distribution like the Weibull to the data and using that. The mle() function should do this if you write out the log-likelihood using something like 

dmultinom(Age, prob=diff(pweibull(c(0,18,15,65,Inf), shape, scale), log=TRUE)

With a quarter of a billion observations, the fit might be less than perfect, but on the other hand, extracting more than two parameters from four data points sound a bit ominous.
Peter Dalgaard, Professor
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com