Dear friends, I am trying to separate bi- (and sometimes tri-) modal univariate mixtures of biological data, where the first component is left bounded (e.g. exponential or gamma) and the other(s) approximately Gaussian. After checking several packages, I'm not really clear what to do. Here is an example with "mixtools" that already works quite good, however, the left component is not Gaussian (and not symmetric). Any idea about a more adequate function or package for this problem? Thanks a lot! Thomas library(mixtools) set.seed(123) lambda <- c(0.25, 0.75) N <- 200 ## dist1 ~ gamma (or exponential as a special case) #dist1 <- rexp(lambda[1]*N, 1) dist1 <- rgamma(lambda[1]*N, 1, 1) ## dist2 ~ normal dist2 <- rnorm(lambda[2]*N, 12, 2) ## mixture x <- c(dist1, dist2) mix <- spEMsymloc(x, mu0=2, eps=1e-3, verbose=TRUE) plot(mix, xlim=c(0, 25)) summary(mix) -- Thomas Petzoldt TU Dresden, Institute of Hydrobiology http://www.tu-dresden.de/Members/thomas.petzoldt
separate mixture of gamma and normal (with mixtools or ??)
4 messages · Thomas Petzoldt, Bert Gunter, David Winsemius
Fitting multicomponent mixtures distributions -- and 3 is already a lot of components -- is inherently ill-conditioned. You may need to reassess your strategy. You might wish to post on stackexchange instead to discuss such statistical issues. Cheers, Bert Bert Gunter "The trouble with having an open mind is that people keep coming along and sticking things into it." -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
On Mon, Jan 23, 2017 at 2:32 PM, Thomas Petzoldt <thpe at simecol.de> wrote:
Dear friends, I am trying to separate bi- (and sometimes tri-) modal univariate mixtures of biological data, where the first component is left bounded (e.g. exponential or gamma) and the other(s) approximately Gaussian. After checking several packages, I'm not really clear what to do. Here is an example with "mixtools" that already works quite good, however, the left component is not Gaussian (and not symmetric). Any idea about a more adequate function or package for this problem? Thanks a lot! Thomas library(mixtools) set.seed(123) lambda <- c(0.25, 0.75) N <- 200 ## dist1 ~ gamma (or exponential as a special case) #dist1 <- rexp(lambda[1]*N, 1) dist1 <- rgamma(lambda[1]*N, 1, 1) ## dist2 ~ normal dist2 <- rnorm(lambda[2]*N, 12, 2) ## mixture x <- c(dist1, dist2) mix <- spEMsymloc(x, mu0=2, eps=1e-3, verbose=TRUE) plot(mix, xlim=c(0, 25)) summary(mix) -- Thomas Petzoldt TU Dresden, Institute of Hydrobiology http://www.tu-dresden.de/Members/thomas.petzoldt
______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Dear Bert,
thank you very much for your suggestion. You are right, ill-conditioning
was sometimes a problem for 3 components, but not in the two-component
case. The modes are well separated, and the sample size is high.
My main problem is (1) the shape of the distributions and (2) the
diversity of available packages and approaches to this topic.
In the mean time I made some progress in this matter by treating the
data as a mixture of gamma distributions (package mixdist, see below),
so what I want to know is the purely R technical question ;-)
Has someone else has ever stumbled across something like this and can
make a suggestion which package to use?
Thanks, Thomas
## Approximate an Exponential+Gaussian mixture with
## a mixture of Gammas
library("mixdist")
set.seed(123)
lambda <- c(0.25, 0.75)
N <- 2000
x <- c(rexp(lambda[1]*N, 1), rnorm(lambda[2]*N, 20, 4))
xx <- mixgroup(x, breaks=0:40)
pp <- mixparam(mu=c(1, 8), sigma=c(1, 3), pi=c(0.2, 0.5))
mix <- mix(xx, pp, dist="gamma", emsteps=10)
summary(mix)
p <- coef(mix)
beta <- with(p, sigma^2/mu)
alpha <- with(p, mu /beta)
lambda <- p$pi
plot(mix, xlim=c(0, 35))
x1 <- seq(0, 35, 0.1)
lines(x1, lambda[1]*dgamma(x1, alpha[1], 1/beta[1]),
col="orange", lwd=2)
lines(x1, lambda[2]*dgamma(x1, alpha[2], 1/beta[2]),
col="magenta", lwd=2)
Am 24.01.2017 um 00:34 schrieb Bert Gunter:
Fitting multicomponent mixtures distributions -- and 3 is already a lot of components -- is inherently ill-conditioned. You may need to reassess your strategy. You might wish to post on stackexchange instead to discuss such statistical issues. Cheers, Bert Bert Gunter "The trouble with having an open mind is that people keep coming along and sticking things into it." -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) On Mon, Jan 23, 2017 at 2:32 PM, Thomas Petzoldt <thpe at simecol.de> wrote:
Dear friends, I am trying to separate bi- (and sometimes tri-) modal univariate mixtures of biological data, where the first component is left bounded (e.g. exponential or gamma) and the other(s) approximately Gaussian. After checking several packages, I'm not really clear what to do. Here is an example with "mixtools" that already works quite good, however, the left component is not Gaussian (and not symmetric). Any idea about a more adequate function or package for this problem? Thanks a lot! Thomas library(mixtools) set.seed(123) lambda <- c(0.25, 0.75) N <- 200 ## dist1 ~ gamma (or exponential as a special case) #dist1 <- rexp(lambda[1]*N, 1) dist1 <- rgamma(lambda[1]*N, 1, 1) ## dist2 ~ normal dist2 <- rnorm(lambda[2]*N, 12, 2) ## mixture x <- c(dist1, dist2) mix <- spEMsymloc(x, mu0=2, eps=1e-3, verbose=TRUE) plot(mix, xlim=c(0, 25)) summary(mix) -- Thomas Petzoldt TU Dresden, Institute of Hydrobiology http://www.tu-dresden.de/Members/thomas.petzoldt
______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
2 days later
On Jan 23, 2017, at 11:47 PM, Thomas Petzoldt <thpe at simecol.de> wrote: Dear Bert, thank you very much for your suggestion. You are right, ill-conditioning was sometimes a problem for 3 components, but not in the two-component case. The modes are well separated, and the sample size is high. My main problem is (1) the shape of the distributions and (2) the diversity of available packages and approaches to this topic. In the mean time I made some progress in this matter by treating the data as a mixture of gamma distributions (package mixdist, see below), so what I want to know is the purely R technical question ;-) Has someone else has ever stumbled across something like this and can make a suggestion which package to use?
In survival analysis of cancer cases, the question of cure comes up often. Physicians sometimes have a naive notion of survival to 5 years after definitive treatment with no evident recurrence being equivalent to 'cure', despite the fact that there is great heterogeneity in the recurrence and survival distribution of different cancer types. I have see papers in the medical statistical literature that used mixtures of Weibull variates to model this problem. The cancer-specific survival is often exponential (mu=1) or "sub-exponential" (shape < 1) whereas non-cancer survival times are "super-exponential" (shape >> 1). When I ran your second simulation with dist='weibull' I get:
library("mixdist")
set.seed(123)
lambda <- c(0.25, 0.75)
N <- 2000
x <- c(rexp(lambda[1]*N, 1), rnorm(lambda[2]*N, 20, 4))
xx <- mixgroup(x, breaks=0:40)
pp <- mixparam(mu=c(.5, 8), sigma=c(1, 10), pi=c(0.2, 0.5))
mix <- mix(xx, pp, dist="weibull", emsteps=10)
summary(mix)
Parameters:
pi mu sigma
1 0.2492 1.016 0.8702
2 0.7508 20.079 4.2328
Standard Errors:
pi.se mu.se sigma.se
1 0.009683 0.04249 0.05137
2 0.009683 0.10972 0.06802
Analysis of Variance Table
Df Chisq Pr(>Chisq)
Residuals 29 80.407 1.01e-06 ***
---
Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1
p <- coef(mix) p
pi mu sigma 1 0.2492477 1.016319 0.8702055 2 0.7507523 20.079093 4.2327769 So the exponential parameter at least is well-estimated. I believe that Weibull variates with shape >> 1 are approximately normal, but I know that your mathematical sophistication exceeds mine by quite a bit, so consider this only a dilettante's comment.
David.
>
> Thanks, Thomas
>
>
> ## Approximate an Exponential+Gaussian mixture with
> ## a mixture of Gammas
>
> library("mixdist")
> set.seed(123)
> lambda <- c(0.25, 0.75)
> N <- 2000
> x <- c(rexp(lambda[1]*N, 1), rnorm(lambda[2]*N, 20, 4))
>
> xx <- mixgroup(x, breaks=0:40)
> pp <- mixparam(mu=c(1, 8), sigma=c(1, 3), pi=c(0.2, 0.5))
> mix <- mix(xx, pp, dist="gamma", emsteps=10)
>
> summary(mix)
> p <- coef(mix)
> beta <- with(p, sigma^2/mu)
> alpha <- with(p, mu /beta)
> lambda <- p$pi
>
> plot(mix, xlim=c(0, 35))
> x1 <- seq(0, 35, 0.1)
> lines(x1, lambda[1]*dgamma(x1, alpha[1], 1/beta[1]),
> col="orange", lwd=2)
> lines(x1, lambda[2]*dgamma(x1, alpha[2], 1/beta[2]),
> col="magenta", lwd=2)
>
>
>
> Am 24.01.2017 um 00:34 schrieb Bert Gunter:
>> Fitting multicomponent mixtures distributions -- and 3 is already a
>> lot of components -- is inherently ill-conditioned. You may need to
>> reassess your strategy. You might wish to post on stackexchange
>> instead to discuss such statistical issues.
>>
>> Cheers,
>> Bert
>> Bert Gunter
>>
>> "The trouble with having an open mind is that people keep coming along
>> and sticking things into it."
>> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>>
>>
>> On Mon, Jan 23, 2017 at 2:32 PM, Thomas Petzoldt <thpe at simecol.de> wrote:
>>> Dear friends,
>>>
>>> I am trying to separate bi- (and sometimes tri-) modal univariate mixtures
>>> of biological data, where the first component is left bounded (e.g.
>>> exponential or gamma) and the other(s) approximately Gaussian.
>>>
>>> After checking several packages, I'm not really clear what to do. Here is an
>>> example with "mixtools" that already works quite good, however, the left
>>> component is not Gaussian (and not symmetric).
>>>
>>> Any idea about a more adequate function or package for this problem?
>>>
>>> Thanks a lot!
>>>
>>> Thomas
>>>
>>>
>>>
>>> library(mixtools)
>>> set.seed(123)
>>>
>>> lambda <- c(0.25, 0.75)
>>> N <- 200
>>>
>>> ## dist1 ~ gamma (or exponential as a special case)
>>> #dist1 <- rexp(lambda[1]*N, 1)
>>> dist1 <- rgamma(lambda[1]*N, 1, 1)
>>>
>>> ## dist2 ~ normal
>>> dist2 <- rnorm(lambda[2]*N, 12, 2)
>>>
>>> ## mixture
>>> x <- c(dist1, dist2)
>>>
>>> mix <- spEMsymloc(x, mu0=2, eps=1e-3, verbose=TRUE)
>>> plot(mix, xlim=c(0, 25))
>>> summary(mix)
>>>
>>>
>>> --
>>> Thomas Petzoldt
>>> TU Dresden, Institute of Hydrobiology
>>> http://www.tu-dresden.de/Members/thomas.petzoldt
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
David Winsemius
Alameda, CA, USA