Skip to content

Mixed model (with interaction) for gene expression and iteration

9 messages · Douglas Bates, Juan Pedro Steibel, Rolf Turner +1 more

#
Dear Douglas and list,

I am thinking about fitting a mixed model for a microarray experiment 
using lme4, since other specific software seems not suitable for my 
purposes. I'll briefly describe the model and kindly ask for suggestions 
on the model and the workflow I can use to get useful results.

My response variable Y is gene expression levels for a given gene (say 
g_i) from 120 samples.
The factor I want to include are:

Sex: fixed, two levels, M/F.
Line: 15 randomly picked genotypes from a large outbred population.

I am interested in:
- if the gene is differentially expressed in the 2 sexes (effect of 
sex), in the 15 lines (effect of line) and the interaction of the two.

- the variance component of line = how much of the variance is due to 
the genotype

- the variance component of the interaction = the genetic variation for 
sexual dimorphism.

Reading a bit of this mailing list, I came up with these three models:

m1 <- lmer(Y1 ~ sex + (1|line) + (1|sex:line))

or

m2 <- lmer(Y1 ~ sex + (sex|line))

or

m3 <- lmer(Y1 ~ sex + (0 + sex|line))

Which should all be the same model (and indeed they have all the same 
residuals) but different parametrization (see self-contained example 
below).

Now, in the light of my needs (see above), which model makes it easier 
to extract the components I need? Also, do they make different 
assumptions - as different levels of independency among levels of random 
factors?

I will need to be able to extract the variance component values in an 
iterative process (i have 18.000 genes): is VarCorr() the way to go?

VarCorr(m1)$'sex:line'[1]
VarCorr(m1)$'line'[1]

Last two question: what is the easier way to assess, in an iterative 
process, normality of residuals, and what is a sensible way to assess 
significant differential expression of genes (since I guess I can't get 
p-values and then apply FDR correction?)

Thanks a lot for reading so far and I'll be grateful for any kind of help.
paolo

Self-contained example:

Y1 <- as.numeric(
c("11.6625","11.3243","11.7819","11.5032","11.7578","11.9379","11.8491",
"11.9035","11.2042","11.0344","11.5137","11.1995","11.6327","11.7392",
"11.9869","11.6955","11.5631","11.7159","11.8435","11.5814","12.0756",
"12.3428","12.3342","11.9883","11.6067","11.6102","11.6517","11.4444",
"11.9567","12.0478","11.9683","11.8207","11.5860","11.6028","11.6522",
"11.6775","12.3570","12.2266","12.2444","12.1369","11.2573","11.4577",
"11.4432","11.2994","11.8486","11.9258","11.9864","11.9964","11.2806",
"11.2527","11.3672","11.0791","11.9501","11.7223","11.9825","11.8114",
"11.6116","11.4284","11.3731","11.6942","12.2153","12.0101","12.2150",
"12.1932","11.5617","11.3761","11.4813","11.7503","11.9889","12.1530",
"12.3299","12.4436","11.4566","11.4614","11.5527","11.3513","11.9764",
"11.8810","12.0999","11.9083","11.4870","11.6764","11.3973","11.4507",
"12.1141","11.9906","12.1118","11.9728","11.3382","11.4146","11.4590",
"11.2527","12.1101","12.0448","12.2191","11.8317","11.3982","11.3555",
"11.3897","11.7731","11.9749","11.8666","12.1984","12.0350","11.4642",
"11.4509","11.5552","11.4346","12.0714","11.7136","11.9019","11.8158",
"11.3132","11.3121","11.1612","11.2073","11.6658","11.7879","11.7847",
"11.5300"))
sex <- factor(rep(c("F","M"), 15, each=4))
line <- factor(rep(1:15, each=8))
m1 <- lmer(Y1 ~ sex + (1|line) + (1|sex:line))
m2 <- lmer(Y1 ~ sex + (sex|line))
m3 <- lmer(Y1 ~ sex + (0 + sex|line))
VarCorr(m1)$'sex:line'[1]
VarCorr(m1)$'line'[1]

Output:

  
    
#
On Tue, Jun 2, 2009 at 7:53 AM, Paolo Innocenti
<paolo.innocenti at ebc.uu.se> wrote:
Not really.  m2 and m3 are different parameterizations of the same
model but m1 is different.  In models m2 and m3 the random effects for
F and for M are allowed to be correlated within a line.  Thus there
are three parameters determining the variance-covariance of the random
effects.  In model m1 the unconditional distribution of the random
effects for the sex:line interaction is independent of the
distribution of the random effects for line.
I would use either m1 or m3 because I find the correlation structure
in m3 easier to understand than that of m2.
I think so.  I would definitely look into using refit to update the
fitted model rather than starting from scratch with each of the genes.

I would use the idiom
sex:line        line
0.002323722 0.016939242

to get the estimated variances.
#
On 3/06/2009, at 12:53 AM, Paolo Innocenti wrote:
<snip>
<snip>

	I don't think that the first two models are indeed the same model.

	My understanding --- which is very limited --- is that

		lmer(Y1 ~ sex + (1|line) + (1|sex:line))

	gives a model in which the effect of line j on sex 1 (say X_1j) and
	and the effect of line j on sex 2 (say X_2j) are uncorrelated.  I.e.
	X_1j and X_2j have covariance matrix of the form sigma^2 * I, where I
	is the 2 x 2 identity.  Thus one random effect parameter is  
contributed.

	In contrast,

		lmer(Y1 ~ sex + (sex|line))

	gives a model in which correlation between X_1j and X_2j, i.e. their
	covariance matrix is a ``general'' 2 x 2 positive definite matrix.
	Thus three random effect parameters are contributed.

	See

		http://www.nabble.com/lme-nesting-interaction-advice- 
td17131600i20.html#a17213604

	for the posting from Doug Bates upon which I am basing my  
understanding.

	Compare:

		lmer(score ~ Machine + (1|Worker/Machine), Machines)
	and

		lmer(score ~ Machine + (Machine|Worker), Machines)

	with your proposed models.

	I hope that I have not misinterpreted Prof. Bates' explanation.

		cheers,

			Rolf Turner

######################################################################
Attention:\ This e-mail message is privileged and confid...{{dropped:9}}
#
Hello Paolo,
We are also starting to use lmer for gene expression analysis (genetical 
genomics too) so here are my thoughts.

Having two equivalent parameterizations, I would go with the 
computationally fastest one (a couple more miliseconds per model, easily 
add up to many hours when analyzing highly dimensional datasets). You 
can time the analysis for, say, 100 transcripts and go from there.

~note: other users commented on your models not being equivalent~

For p-values:

You could use LRT to test for variance components.

This is standard practice in genetic epidemiology: fit a model with and 
without the random effect in question, then compare the log (residual) 
likelihood ratio to a chi-square statistic. FDR can be used on top of 
that to account for multiple tests. Of course, now you have to fit three 
models (one null models for each VC), so your cpu  time has just 
multiplied by almost 3. Definitely using refit and update will help with 
compute time when having so many models.

I use sobj<-summary(result) as an intermediate step to get the info 
(although this may add to the computational burden and other suggestions 
you got may be faster and more efficient),

then ask for slots:
@REmat
@coefs
...for getting estimates of variance components and fixed effects.

@coefs gives you a t-statistic for fixed effects too... you could take a 
stab at an approximated df method and compute an (approximated) p-value.
I know that doing so can attract a lot of criticism in this list, but 
when you are fitting a several million models (10000s of transcripts and 
1000s of genomic positions as in my case), the mcmc approximation is 
(unfortunately) not computationally feasible.


Hope this helps!
JP
Paolo Innocenti wrote:

  
    
#
Dear Douglas, Rolf, Juan and list,

thank you very much for your replies.
I now got a good working model, and the use of refit and VarCorr will 
definitely help.

I had a go with mcmcsamp(), and I must confirm that this approach is not 
feasible, both computationally and because if you get a "false 
convergence" for, say, 1 gene out of 20, it becomes impossible to go 
back and fix all the errors.

So, the alternative approach seems more promising. If I understand 
correctly, you suggest to calculate a p-value for random effects out of 
the LRT (Likelihood ratio test), and use approximated DFs to calculate 
standard p-values for the fixed effects.

I neved used this approach, so I appreciate if you can point me in the 
right direction.

Random effects:
I'd need to compare the this three model:
m1a <- lmer(Y1 ~ sex + (1|line) + (1|sex:line))
m1b <- lmer(Y1 ~ sex + (1|line))
m1c <- lmer(Y1 ~ sex)

and get the effect of the interaction from m1a-m1b,
and the effect of "line" from m1b-m1c.
It that correct?
Can anyone point me to any kind of documentation/examples to sort out 
the details?

Fixed effects:
I don't really know where to get the approximated degrees of freedom. 
Can you point me to an example?

Thanks again for all the help. Eventually, I'll be really happy to share 
my experience/code when everything is sorted, even if I doubt I can add 
anything helpful.

Best,
paolo

PS. I don't really understand what you mean by
sobj<-summary(result)
what object is your "result" here?
Juan Pedro Steibel wrote:

  
    
#
Sorry for replying to my own emails, but I found partial solution to my 
problems in Pinheiro and Bates at:

2.4.1 - Likelihood Ratio Tests (p.83)
to calculate logLik and p.values for random effects

and

2.4.2 - Hypothesis tests for Fixed-Effect Terms (p.87)
that is self-explanatory. Although I still don't understand how to 
calculate the "denominator degrees of freedom".

Best,
paolo

PS. Off-topic quick question: is there currently a "out-of-the-box" 
solution to get credible intervals for random effects in mcmcsamp()?
HPDinterval (from lme4 package) doesn't seem to work for that?
Paolo Innocenti wrote:

  
    
#
Hello paolo,
sobj<-summary(result), where result is m1a, m1b or m1c.

Not that m1c is actually a fixed effects model, and can not be fit with 
lmer, so you'll need to reconstruct the -2LL from another source. But 
with a normal model that should not be a problem.

Fabian Scheipl answered in a post that the LRT chisquare(1) was wrong 
and said that al least a mixture of chisquares should be used. I did not 
explained myself very well, but I meant to recommend the mixture of 
chisquare and pointmass when testing a single variance component.  
Fabian's other method could be used, provided it will not take eons to 
get the results for a high dimensional dataset.

But keep reading and you will find a good reason to hammer on me for 
recomendations on df and computing p-values 8^D

Probably the whole point for having mcmcsamp was to actually not have to 
answer that question on ddfs...
Having said that, my approach in a particular project we are working on 
is to do ddf=n-p... (n=length(y), p=ncols(x), x= full rank incidence 
matrix of fixed effects). I'll run and take cover now... 8^D
Seriously, in our particular design such approximation does not seem to 
be way off, as I checked it with mcmcsamp for a handful of transcripts.

And finally, a more existential question:
Paolo, do you really need a p-value?
Sometimes in gene expression analysis we only need to rank genes for 
evidence of differential expression.
In those cases, you may well rank then using the t-statistic or the LRT 
(not their p-values), especially if you have the same basic design 
structure across all genes or transcripts (that you probably do).
For example if you want to do enrichment of GO terms after the mm 
analysis you could do a Kolmogorov-Smirnov type of GSE using the 
t-statistics (or LRTs) to rank genes.

Hope this makes sense?
Thanks!
JP
Paolo Innocenti wrote:

  
    
#
Dear Doug,

Two issues have arisen in respect of this thread that I would like to
comment upon.

(1) In my previous posting I included the link to Doug's explanation  
of the
lmer() syntax:

http://www.nabble.com/lme-nesting-interaction-advice- 
td17131600i20.html#a17213604

I found this explanation incredibly useful, and apparently others  
have as well.
(I was thanked off-list for posting this link.)

Could, perhaps, the content of this link, in appropriately edited  
form, be
included in the documentation for lmer(), or perhaps in a vignette  
(to which
the documentation might direct the reader)?  To save on frustrating  
searching
for those who are coming in to this area fresh.
(2) In one of his posts Juan Pedro Steibel wrote:

            
He goes on to say:
Hah!  Not a problem for some, maybe.  But for those of us whose  
initially
few brain cells have been decimated by alcohol ( :-) ) it is a problem.

Might I make a ``feature request'' that it be made possible to fit  
models
with fixed effects only in lmer()?  (Basically for convenience of  
testing
whether random effects are ``useful''.)  I personally would also be  
comforted
by being able to do things like

	fit1 <- lmer(y ~ x, REML=TRUE)
	fit2 <- lmer(y ~ x, REML=FALSE)

and compare the resulting estimates of sigma^2 (which presumably  
should differ
by a factor of n/(n-2) if I'm understanding things correctly).  Being  
able to
do this would give me the comforting illusion ( :-) ) that I ***am***  
understanding
things correctly.  Or perhaps disabuse me of this illusion.

Would it be a Herculean task to adapt lmer() to have this capacity?   
Naively I would
have thought that if a model with fixed and random effects can be  
fitted, then
surely a (simpler) model with only fixed effects can be fitted.  OTOH  
the fixed effects
only scenario is in some sense ``on the boundary'', and boundary  
cases can be vexatious.

Not knowing how lmer() works, I have no idea how hard or easy making  
such a change
would be.

I would appreciate hearing your thoughts on this.

	cheers,

		Rolf

######################################################################
Attention:\ This e-mail message is privileged and confid...{{dropped:9}}
#
Hi all,

thanks again to all of you for the replies.
I took Fabian's suggestion (thanks for the effort you made to make your 
point! amazing) but unfortunately for my purpose it requires to much 
computation time (in terms of model to be fitted and evaluation time). 
At the moment I am happy with

1-(0.5 + 0.5*pchisq(tempobj$Chisq[2], df=1))

for the p-values for the random effects.

For the p-values for fixed effects: yes, i do really need them. They are 
not really important for THE specific gene, but I need to be able to 
say: around 30% of the transcriptome is differentially expressed).

In my case, the effect I am testing is pretty big, my sample size is 
pretty large (120 observations, 60 per group - M/F), completely balanced 
and I can afford to be a bit conservative. That said, I found this link:

http://wiki.r-project.org/rwiki/doku.php?id=guides:lmer-tests

where Doug discuss about using a function hatTrace() and use it to 
calculate the trace and then the degrees of freedom (denDF = n - "trace" 
i think) and feed it to

1 - pf(F.ratio, numDF, ***)

It seems to be what I am looking for, but I can't find the function 
hatTrace...
Does anyone know how I could estimate my denDF? Any suggestion is most 
welcome (JP i have some problem in understanding your lingo... =)
Pinheiro and Bates says:

denDF = m_i - (m_(i-1) + p_i),  i = 1,...,Q+1

but I don't really understand what it means.

Thanks again to everyone!
Best,
paolo
Juan Pedro Steibel wrote: