An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20130122/56887822/attachment.pl>
Approximating discrete distribution by continuous distribution
3 messages · Michael Haenlein, Brian Ripley, Peter Dalgaard
On 22/01/2013 11:49, Michael Haenlein wrote:
Dear all,
I have a discrete distribution showing how age is distributed across a
population using a certain set of bands:
Age <- matrix(c(74045062, 71978405, 122718362, 40489415), ncol=1,
dimnames=list(c("<18", "18-34", "35-64", "65+"),c()))
Age_dist <- Age/sum(Age)
For example I know that 23.94% of all people are between 0-18 years, 23.28%
between 18-34 years and so forth.
I would like to find a continuous approximation of this discrete
distribution in order to estimate the probability that a person is for
example 16 years old.
Is there some automatic way in R through which this can be done? I tried a
Kernel density estimation of the histogram but this does not seem to
provide what I'm looking for.
This is not really an R question, but a statistics one. It is almost guesswork: if for example these were drivers in the UK, the answer is 0. So you need to supply some information about the shape of the distribution of <18 year olds. You have estimates of the cumulative distribution function at c(0, 18, 35, 65, Inf) (or some better upper limit). You want to interpolate it. You could use linear interpolation (approx[fun]) or a monotone spline interpolation (spline[fun]) or any other interpolation method which meets your needs. But whatever you use, you will supplying a lot of information not actually in your data.
Thanks very much for your help, Michael Michael Haenlein Associate Professor of Marketing ESCP Europe Paris, France [[alternative HTML version deleted]]
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595
On Jan 22, 2013, at 13:45 , Prof Brian Ripley wrote:
On 22/01/2013 11:49, Michael Haenlein wrote:
Dear all,
I have a discrete distribution showing how age is distributed across a
population using a certain set of bands:
Age <- matrix(c(74045062, 71978405, 122718362, 40489415), ncol=1,
dimnames=list(c("<18", "18-34", "35-64", "65+"),c()))
Age_dist <- Age/sum(Age)
For example I know that 23.94% of all people are between 0-18 years, 23.28%
between 18-34 years and so forth.
I would like to find a continuous approximation of this discrete
distribution in order to estimate the probability that a person is for
example 16 years old.
Is there some automatic way in R through which this can be done? I tried a
Kernel density estimation of the histogram but this does not seem to
provide what I'm looking for.
This is not really an R question, but a statistics one. It is almost guesswork: if for example these were drivers in the UK, the answer is 0. So you need to supply some information about the shape of the distribution of <18 year olds. You have estimates of the cumulative distribution function at c(0, 18, 35, 65, Inf) (or some better upper limit). You want to interpolate it. You could use linear interpolation (approx[fun]) or a monotone spline interpolation (spline[fun]) or any other interpolation method which meets your needs. But whatever you use, you will supplying a lot of information not actually in your data.
Agreed. The linear interpolation method is sometimes described as the "sum polygon", and sort of assumes that there is a uniform distribution of ages in each range. I.e., the number of 16 year olds would be 1/18 of the 0-17 y.o. However, I'd feel somewhat uneasy about doing this with such wide age-bands. There is also the option of fitting a standard distribution like the Weibull to the data and using that. The mle() function should do this if you write out the log-likelihood using something like dmultinom(Age, prob=diff(pweibull(c(0,18,15,65,Inf), shape, scale), log=TRUE) With a quarter of a billion observations, the fit might be less than perfect, but on the other hand, extracting more than two parameters from four data points sound a bit ominous.
Peter Dalgaard, Professor Center for Statistics, Copenhagen Business School Solbjerg Plads 3, 2000 Frederiksberg, Denmark Phone: (+45)38153501 Email: pd.mes at cbs.dk Priv: PDalgd at gmail.com