Skip to content

Distribution Fitting and fitdistr()

3 messages · Lorenzo Isella, Krishna Kumar

#
Dear All,
I think I understood how to use the fitdistr() function to fit random
draws to a certain distribution.
I first tested it on a unimodal Gaussian distribution:

rm(list=ls())
library(MASS)
set.seed(123)
mydata<-rnorm(10000,2.0,4.0)
gauden<-function(x,mu,sig)
{
1/(sqrt(2*pi)*sig)*exp(-1/(2*sig^2)*(x-mu)^2)
}
myfit<-fitdistr(mydata,gauden,start=list(mu=5.0,sig=5.0),upper=c(20,20))
myfit2<-fitdistr(mydata,"normal")

seeing the (expected) agreement between myfit and myfit2.

Then I tested it on a set of draws from a mixture made up of three
Gaussian distributions.
This time I used 9 fitting parameters:
- 3 weights N1,N2,N3 for the components of the Gaussian mixture
- 3 standard deviations sig1,sig2,sig3
- 3 means mu1,mu2,mu3

Here is the code I used:

rm(list=ls()) 
library(MASS)
set.seed(123)
NN<-10000
three<-3 #number of gaussian sequences I want to generate
mysample<-matrix(nrow=NN,ncol=three)
sdvec<-seq(1:three)
sdvec[1]<-4
sdvec[2]<-3
sdvec[3]<-3.5
meanvec<-seq(1:three)
meanvec[1]<-20
meanvec[2]<-50
meanvec[3]<-70

for (i in 1:three)
{
 mysample[ ,i] <-rnorm(NN,meanvec[i],sdvec[i])
}
dim(mysample)<-c(3*NN,1)
plot(density(mysample,kern="gaussian"),lwd=2,col=300)

mydistr<-function(x,mu1,sig1,N1,mu2,sig2,N2,mu3,sig3,N3)
{
N1/(sqrt(2*pi)*sig1)*exp(-1/(2*sig1^2)*(x-mu1)^2)+
N2/(sqrt(2*pi)*sig2)*exp(-1/(2*sig2^2)*(x-mu2)^2)+
N3/(sqrt(2*pi)*sig3)*exp(-1/(2*sig3^2)*(x-mu3)^2)
}
myfit<-fitdistr(mysample,mydistr,start=list(sig1=2,sig2=2,sig3=2,
N1=0.3,N2=0.3,N3=0.4,mu1=25,mu2=55,mu3=76))

But this time the fitdistr() is not able to carry out the optimization
and the warnings do not tell (me at least) that much.
Am I mistaking something?  Furthermore, how can I be sure that the
optimization will lead to weights such that N1+N2+N3=1?
I have not been able to try the mixdist package or any other tool, but
any suggestion about how to fit this threemodal distribution (which
should not be rocket science for someone more experienced than myself)
is welcome.
Best Regards

Lorenzo
#
Lorenzo Isella wrote:

            
This only works for distributions that are specified below
 >>>
Distributions '"beta"', '"cauchy"', '"chi-squared"',
          '"exponential"', '"f"', '"gamma"', '"geometric"',
          '"log-normal"', '"lognormal"', '"logistic"', '"negative
          binomial"', '"normal"', '"Poisson"', '"t"' and '"weibull"'
          are recognised, case being ignored.
 >>>>>>>>
The package nor1Mix has density , distribution and rng for mixtures but 
no fitting so here is a  fitting function..

mydistr<-function(x,mu,sig,wt)
{
sum(wt*dnorm(x,mu,sig))
}

mydistrmle<-function(x, y = x) {

#mu goes from x1:3 sig goes x 4:6 and wt goes x 7:9
        f = -sum(log(mydistr(y, c(x[1],x[2],x[3]),     
c(x[4],x[5],x[6]),     c(x[7],x[8],x[9])  )))
        cat("\n Function value:  ", -f)
        cat("\n  Estimated parameters:       ", 
c(x[1],x[2],x[3]),c(x[4],x[5],x[6]),c(x[7],x[8],x[9]), "\n")

        f
    }
   


#generate 1000 samples from norMix  mixture MW.nm2
x<-rnorMix(1000,MW.nm2)
hist(rnorMix(1000,MW.nm2))
#  Give it a starting values
mu<-c(-0.2,0.4,0.9)
sigma<-c(0.2, 0.2, 0.6)
wt<-c(0.3,0.3,0.2)
r<-optim(c(mu,sigma,wt),mydistrmle,control=list(maxit=50000))
print(matrix(r$par,3,3))
print(MW.nm2)

You will have to add additional constraints on the "wt" etc.  ( see 
?constrOptim)

Further I don't think there is a guaranteed unique decomposition of 
variance between the three gaussians. Just try different starting values
and also is there a reason to stop at 3 gaussians ?..

Krishna
#
On Mon, 2006-04-24 at 23:09 -0400, Krishna Kumar wrote:
Thanks. There is not a particular reason to stop at three Gaussians
apart from the fact that I know that the data I will soon be given show
a 3 modal distribution.
Apart from implementing some extra constrains, are these sort of
problems numerically stable? I get plenty of warnings when I run your
code, although it executes.
Are those warnings anything to worry about or do they simply mean that
in some iterations the optimizer rejects the solution (though they seem
to deal with dnorm())?
Cheers

Lorenzo