Skip to content

separate mixture of gamma and normal (with mixtools or ??)

4 messages · Thomas Petzoldt, Bert Gunter, David Winsemius

#
Dear friends,

I am trying to separate bi- (and sometimes tri-) modal univariate 
mixtures of biological data, where the first component is left bounded 
(e.g. exponential or gamma) and the other(s) approximately Gaussian.

After checking several packages, I'm not really clear what to do. Here 
is an example with "mixtools" that already works quite good, however, 
the left component is not Gaussian (and not symmetric).

Any idea about a more adequate function or package for this problem?

Thanks a lot!

Thomas



library(mixtools)
set.seed(123)

lambda <- c(0.25, 0.75)
N      <- 200

## dist1 ~ gamma (or exponential as a special case)
#dist1 <- rexp(lambda[1]*N, 1)
dist1 <- rgamma(lambda[1]*N, 1, 1)

## dist2 ~ normal
dist2 <- rnorm(lambda[2]*N, 12, 2)

## mixture
x <- c(dist1, dist2)

mix <-  spEMsymloc(x, mu0=2, eps=1e-3, verbose=TRUE)
plot(mix, xlim=c(0, 25))
summary(mix)


--
Thomas Petzoldt
TU Dresden, Institute of Hydrobiology
http://www.tu-dresden.de/Members/thomas.petzoldt
#
Fitting multicomponent mixtures distributions -- and 3 is already a
lot of components -- is inherently ill-conditioned. You may need to
reassess your strategy. You might wish to post on stackexchange
instead to discuss such statistical issues.

Cheers,
Bert
Bert Gunter

"The trouble with having an open mind is that people keep coming along
and sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
On Mon, Jan 23, 2017 at 2:32 PM, Thomas Petzoldt <thpe at simecol.de> wrote:
#
Dear Bert,

thank you very much for your suggestion. You are right, ill-conditioning 
was sometimes a problem for 3 components, but not in the two-component 
case. The modes are well separated, and the sample size is high.

My main problem is (1) the shape of the distributions and (2) the 
diversity of available packages and approaches to this topic.

In the mean time I made some progress in this matter by treating the 
data as a mixture of gamma distributions (package mixdist, see below), 
so what I want to know is the purely R technical question ;-)

Has someone else has ever stumbled across something like this and can 
make a suggestion which package to use?

Thanks, Thomas


## Approximate an Exponential+Gaussian mixture with
##   a mixture of Gammas

library("mixdist")
set.seed(123)
lambda <- c(0.25, 0.75)
N <- 2000
x <- c(rexp(lambda[1]*N, 1), rnorm(lambda[2]*N, 20, 4))

xx <- mixgroup(x, breaks=0:40)
pp <- mixparam(mu=c(1, 8), sigma=c(1, 3), pi=c(0.2, 0.5))
mix <-  mix(xx, pp, dist="gamma", emsteps=10)

summary(mix)
p <- coef(mix)
beta <- with(p, sigma^2/mu)
alpha <- with(p, mu /beta)
lambda <- p$pi

plot(mix, xlim=c(0, 35))
x1 <- seq(0, 35, 0.1)
lines(x1, lambda[1]*dgamma(x1, alpha[1], 1/beta[1]),
   col="orange", lwd=2)
lines(x1, lambda[2]*dgamma(x1, alpha[2], 1/beta[2]),
   col="magenta", lwd=2)



Am 24.01.2017 um 00:34 schrieb Bert Gunter:
2 days later
#
In survival analysis of cancer cases, the question of cure comes up often. Physicians sometimes have a naive notion of survival to 5 years after definitive treatment with no evident recurrence being equivalent to 'cure', despite the fact that there is great heterogeneity in the recurrence and survival distribution of different cancer types. I have see papers in the medical statistical literature that used mixtures of Weibull variates to model this problem. The cancer-specific survival is often exponential (mu=1) or "sub-exponential" (shape < 1) whereas non-cancer survival times are "super-exponential" (shape >> 1). When I ran your second simulation with dist='weibull' I get:
Parameters:
      pi     mu  sigma
1 0.2492  1.016 0.8702
2 0.7508 20.079 4.2328

Standard Errors:
     pi.se   mu.se sigma.se
1 0.009683 0.04249  0.05137
2 0.009683 0.10972  0.06802


Analysis of Variance Table

          Df  Chisq Pr(>Chisq)    
Residuals 29 80.407   1.01e-06 ***
---
Signif. codes:  0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1
pi        mu     sigma
1 0.2492477  1.016319 0.8702055
2 0.7507523 20.079093 4.2327769

So the exponential parameter at least is well-estimated. I believe that Weibull variates with  shape >> 1 are approximately normal, but I know that your mathematical sophistication exceeds mine by quite a bit, so consider this only a dilettante's comment.