Skip to content

[Bioc-devel] medianpolish for Affymetrix genechip probesets

3 messages · Wolfgang Huber, Ben Bolstad, Henrik Bengtsson

#
Hi Rafa, Ben et al.,

what is nowadays the prefered implementation of the "medianpolish" 
probeset summarization / expression value computation method, and which 
interface to it do you recommend others to use?

I have just been looking through affy, BufferedMatrixMethods, oligo, 
affyPLM, gcrma and found the choice exciting, stimulating and ... large.

The reason for the question: vsn can be used to do bg correction and 
normalization of Affy genechips together with medianpolish to summarize. 
In the vsn vignette I have so far been recommending to use

expresso(x, bg.correct=FALSE, normalize.method="vsn",
          pmcorrect.method="pmonly", summary.method="medianpolish")

but this can be very slow and I would like to provide a more efficient 
wrapper ? la rma() without reinventing the medianpolish wheel.

So far I gather that

1. I should use, in affy/src/rma2.c:
SEXP rma_c_call(SEXP PMmat, SEXP MMmat, SEXP ProbeNamesVec,
                 SEXP N_probes, SEXP norm_flag)
with norm_flag==0, and

2. that I need to exponentiate (2^x) the vsn-transformed data before 
sending it that way because the first thing you do there is to take log2.

Is this correct? Any pointers or comments are welcome.

Best wishes
   Wolfgang

------------------------------------------------------------------
Wolfgang Huber  EBI/EMBL  Cambridge UK  http://www.ebi.ac.uk/huber
#
Wolfgang,

The method you suggest would work, but I am a little hesitant to
recommend you .Call the c code directly. I'd prefer that you just call
rma() with the background and normalize arguments set to FALSE, though
that will add a little additional overhead, it won't do anything more
that the median polish summarization step. That is essentially the
approach taken in GCRMA (except that it sets background=FALSE,
normalize=TRUE in its rma() call).

However, you do raise a larger issue about the multiple implementations
in various different packages. Mostly the functionality that is
duplicated consists of the exact same C code in affy, affyPLM and oligo
(except sometimes a little outdated), with larger differences in
BufferedMatrixMethods. I did make a promise in the past that I would fix
this potential maintenance problem by building a package that unify this
C code in a single location. This way potential developers who wanted to
use say, quantile normalization or median polish, but did not want to
"Depends:" on affy (or similar). But building infrastructure is
decidedly un-sexy, so it is still on my to do list (though perhaps a
little higher now).

Best,

Ben
On Fri, 2007-03-09 at 14:31 +0000, Wolfgang Huber wrote:
1 day later
#
On 3/9/07, Ben Bolstad <bmb at bmbolstad.com> wrote:
I fully support your idea of providing one code base for all "basic"
modeling and operators used in BioC.   I think affyPLM is a good step
towards this.

May I suggest taking this even step further and provide a low-level
package for single-step modeling / transformations operating on "as
basic data types as possible"?  This package should be used by
developers only to be incorporated inside other functions.  By keeping
the API to use basic data types only (e.g. vectors, matrices, data
frames, lists), it will also be possible to keep it stable over a very
long time.  Such an API does relies much less on the current designs
(e.g. AffyBatch, eSet etc) and other packages (minimal number of
packages in DESCRIPTION's Depends).  It will therefore provide a
common code base that is more stable over time.

Advantages with a low-level API package for single-step methods:
* More likely to be reused.
* Easier to define/document exactly what is done and what is changed
between versions.
* Easier to provide compare algorithms and verify consistency between versions.
* Bug fixes in one package, not many.

This would apply to more packages than affyPLM.  Several of the
probe-level model (PLM) functions in affyPLM could be written so they
accept matrices only, and at a probeset by probeset basis.  For
example, fitPLM.matrix(X, flavor=c("rlm", "median.polish"), ...) could
accept a matrix of probe signals for a single probeset.  This would be
the API closest to the internal code (i.e. the .Call():s).
Higher-level packages could then build up/add fitPLM.AffyBatch(),
fitPLM.eSet() etc.  These higher level packages would evolve with BioC
as new classes/structures are introduced, but fitPLM.matrix() would
remain the same. [I'm using S3 notation because it is shorter to
write; analogously for S4].

I am aware that the above is somewhat the idea of Biobase, but I would
like to take it one step further and make it "more low-level".  For
instance, methods like rowQ(), rowMedians(), and rowMax() in Biobase
could be lifted out to a even lower-level package.  What I am
suggesting should basically never be used explicitly by the end user,
but will be very handy for developers.

Cheers

Henrik

PS. Ben, we've chatted about this offline before, but I would like to
bring it up in public too. DS.