Dear All,
I think I understood how to use the fitdistr() function to fit random
draws to a certain distribution.
I first tested it on a unimodal Gaussian distribution:
rm(list=ls())
library(MASS)
set.seed(123)
mydata<-rnorm(10000,2.0,4.0)
gauden<-function(x,mu,sig)
{
1/(sqrt(2*pi)*sig)*exp(-1/(2*sig^2)*(x-mu)^2)
}
myfit<-fitdistr(mydata,gauden,start=list(mu=5.0,sig=5.0),upper=c(20,20))
myfit2<-fitdistr(mydata,"normal")
seeing the (expected) agreement between myfit and myfit2.
Then I tested it on a set of draws from a mixture made up of three
Gaussian distributions.
This time I used 9 fitting parameters:
- 3 weights N1,N2,N3 for the components of the Gaussian mixture
- 3 standard deviations sig1,sig2,sig3
- 3 means mu1,mu2,mu3
Here is the code I used:
rm(list=ls())
library(MASS)
set.seed(123)
NN<-10000
three<-3 #number of gaussian sequences I want to generate
mysample<-matrix(nrow=NN,ncol=three)
sdvec<-seq(1:three)
sdvec[1]<-4
sdvec[2]<-3
sdvec[3]<-3.5
meanvec<-seq(1:three)
meanvec[1]<-20
meanvec[2]<-50
meanvec[3]<-70
for (i in 1:three)
{
mysample[ ,i] <-rnorm(NN,meanvec[i],sdvec[i])
}
dim(mysample)<-c(3*NN,1)
plot(density(mysample,kern="gaussian"),lwd=2,col=300)
mydistr<-function(x,mu1,sig1,N1,mu2,sig2,N2,mu3,sig3,N3)
{
N1/(sqrt(2*pi)*sig1)*exp(-1/(2*sig1^2)*(x-mu1)^2)+
N2/(sqrt(2*pi)*sig2)*exp(-1/(2*sig2^2)*(x-mu2)^2)+
N3/(sqrt(2*pi)*sig3)*exp(-1/(2*sig3^2)*(x-mu3)^2)
}
myfit<-fitdistr(mysample,mydistr,start=list(sig1=2,sig2=2,sig3=2,
N1=0.3,N2=0.3,N3=0.4,mu1=25,mu2=55,mu3=76))
But this time the fitdistr() is not able to carry out the optimization
and the warnings do not tell (me at least) that much.
Am I mistaking something? Furthermore, how can I be sure that the
optimization will lead to weights such that N1+N2+N3=1?
I have not been able to try the mixdist package or any other tool, but
any suggestion about how to fit this threemodal distribution (which
should not be rocket science for someone more experienced than myself)
is welcome.
Best Regards
Lorenzo
Distribution Fitting and fitdistr()
3 messages · Lorenzo Isella, Krishna Kumar
Lorenzo Isella wrote:
Dear All, I think I understood how to use the fitdistr() function to fit random draws to a certain distribution. I first tested it on a unimodal Gaussian distribution:
This only works for distributions that are specified below
>>>
Distributions '"beta"', '"cauchy"', '"chi-squared"',
'"exponential"', '"f"', '"gamma"', '"geometric"',
'"log-normal"', '"lognormal"', '"logistic"', '"negative
binomial"', '"normal"', '"Poisson"', '"t"' and '"weibull"'
are recognised, case being ignored.
>>>>>>>>
Then I tested it on a set of draws from a mixture made up of three Gaussian distributions. This time I used 9 fitting parameters: - 3 weights N1,N2,N3 for the components of the Gaussian mixture - 3 standard deviations sig1,sig2,sig3 - 3 means mu1,mu2,mu3
The package nor1Mix has density , distribution and rng for mixtures but
no fitting so here is a fitting function..
mydistr<-function(x,mu,sig,wt)
{
sum(wt*dnorm(x,mu,sig))
}
mydistrmle<-function(x, y = x) {
#mu goes from x1:3 sig goes x 4:6 and wt goes x 7:9
f = -sum(log(mydistr(y, c(x[1],x[2],x[3]),
c(x[4],x[5],x[6]), c(x[7],x[8],x[9]) )))
cat("\n Function value: ", -f)
cat("\n Estimated parameters: ",
c(x[1],x[2],x[3]),c(x[4],x[5],x[6]),c(x[7],x[8],x[9]), "\n")
f
}
#generate 1000 samples from norMix mixture MW.nm2
x<-rnorMix(1000,MW.nm2)
hist(rnorMix(1000,MW.nm2))
# Give it a starting values
mu<-c(-0.2,0.4,0.9)
sigma<-c(0.2, 0.2, 0.6)
wt<-c(0.3,0.3,0.2)
r<-optim(c(mu,sigma,wt),mydistrmle,control=list(maxit=50000))
print(matrix(r$par,3,3))
print(MW.nm2)
You will have to add additional constraints on the "wt" etc. ( see
?constrOptim)
Further I don't think there is a guaranteed unique decomposition of
variance between the three gaussians. Just try different starting values
and also is there a reason to stop at 3 gaussians ?..
Krishna
On Mon, 2006-04-24 at 23:09 -0400, Krishna Kumar wrote:
Lorenzo Isella wrote:
Dear All, I think I understood how to use the fitdistr() function to fit random draws to a certain distribution. I first tested it on a unimodal Gaussian distribution:
This only works for distributions that are specified below
>>>
Distributions '"beta"', '"cauchy"', '"chi-squared"',
'"exponential"', '"f"', '"gamma"', '"geometric"',
'"log-normal"', '"lognormal"', '"logistic"', '"negative
binomial"', '"normal"', '"Poisson"', '"t"' and '"weibull"'
are recognised, case being ignored.
>>>>>>>>
Then I tested it on a set of draws from a mixture made up of three Gaussian distributions. This time I used 9 fitting parameters: - 3 weights N1,N2,N3 for the components of the Gaussian mixture - 3 standard deviations sig1,sig2,sig3 - 3 means mu1,mu2,mu3
The package nor1Mix has density , distribution and rng for mixtures but
no fitting so here is a fitting function..
mydistr<-function(x,mu,sig,wt)
{
sum(wt*dnorm(x,mu,sig))
}
mydistrmle<-function(x, y = x) {
#mu goes from x1:3 sig goes x 4:6 and wt goes x 7:9
f = -sum(log(mydistr(y, c(x[1],x[2],x[3]),
c(x[4],x[5],x[6]), c(x[7],x[8],x[9]) )))
cat("\n Function value: ", -f)
cat("\n Estimated parameters: ",
c(x[1],x[2],x[3]),c(x[4],x[5],x[6]),c(x[7],x[8],x[9]), "\n")
f
}
#generate 1000 samples from norMix mixture MW.nm2
x<-rnorMix(1000,MW.nm2)
hist(rnorMix(1000,MW.nm2))
# Give it a starting values
mu<-c(-0.2,0.4,0.9)
sigma<-c(0.2, 0.2, 0.6)
wt<-c(0.3,0.3,0.2)
r<-optim(c(mu,sigma,wt),mydistrmle,control=list(maxit=50000))
print(matrix(r$par,3,3))
print(MW.nm2)
You will have to add additional constraints on the "wt" etc. ( see
?constrOptim)
Further I don't think there is a guaranteed unique decomposition of
variance between the three gaussians. Just try different starting values
and also is there a reason to stop at 3 gaussians ?..
Krishna
Thanks. There is not a particular reason to stop at three Gaussians apart from the fact that I know that the data I will soon be given show a 3 modal distribution. Apart from implementing some extra constrains, are these sort of problems numerically stable? I get plenty of warnings when I run your code, although it executes. Are those warnings anything to worry about or do they simply mean that in some iterations the optimizer rejects the solution (though they seem to deal with dnorm())? Cheers Lorenzo