Skip to content

[Bioc-devel] Additional summarizeOverlaps counting modes for ChIP-Seq

4 messages · Ryan, Valerie Obenchain

#
Thanks.

I have some concerns about the *ExtraArgs() functions. Passing flexible 
args to findOverlaps in the existing mode functions fundamentally 
changes the documented behavior. The modes were created to capture 
specific overlap situations pertaining to gene features which are 
graphically depicted in the vignette. Changing 'maxgap' or 'minoverlap' 
will produce a variety of results inconsistent with past behavior and 
difficult to document (e.g., under what circumstances will 
IntersectionNotEmpty now register a hit).

I agree that controlling the overlap args is appealing and I like the 
added ability to resize. I've created a 'chipseq' mode that combines 
these ideas and gives what ResizeReads() was doing but now in 'mode' 
form. If this implementation gives you the flexibility you were looking 
for I'll check it into devel.

A couple of questions:

- Do you want to handle paired-end reads? You coerce to a GRanges to 
resize but don't coerce back.

- Do you want to require strand info for all reads? Is this because of 
how resize() anchors "*" to 'start'?
To count the overlaps of 5' and 3' ends:

summarizeOverlaps(reads, features, chipseq, fix="start", width=1)
summarizeOverlaps(reads, features, chipseq, fix="end", width=1)


Valerie
On 04/30/2014 02:41 PM, Ryan C. Thompson wrote:

  
    
#
Hi Valerie,
On Thu May 1 13:27:16 2014, Valerie Obenchain wrote:
Well, I wasn't so sure about those functions either. Obviously you can 
pass arguments that break things. They were mostly designed to be 
constructors for specific counting modes involving the minoverlap/maxgap 
arguments, but I decided I didn't need those modes after all. They're 
certainly not designed to be exposed to the user. I haven't carefully 
considered the interaction between the counting mode and 
maxgap/minoverlap, but I believe that it would be roughly equivalent to 
extending/shrinking the features/reads by the specified amount (with 
some differences for e.g. a feature/read smaller than 2*minoverlap). For 
example, with a read length of 100 and a minoverlap of 10 in Union 
counting mode, this would be the same as truncating the first and last 
10 (or mabe 9?) bases and operating in normal Union mode. As I said, 
though, there may be edge cases that I haven't thought of where 
unexpected things happen.
This sounds nice, but if I use the 'chipseq' mode, how do I specify 
whether I want Union, IntersectionNotEmpty, or IntersectionStrict? It 
looks like it just does Union? IntersectionStrict would be useful for 
specifying that the read has to occur entirely within the bounds of a 
called peak, for example. This is why I implemented it as a "wrapper" 
that takes another mode as an argument, so that the resizing logic and 
the counting logic were independent. Maybe summarizeOverlaps could 
accept an optional "read modification function", and if this is 
provided, it will pass the reads through this before passing them to the 
counting function. The read modification function would have to take any 
valid reads argument and return another valid reads argument. It could 
be used for modifying the reads as well as filtering them. This would 
allow resizing without the awkward nesting method that I've used.
For paired end reads, there is no need to estimate the fragment length, 
because the pair gives you both ends of the fragment. So if I had 
paired-end ChIP-Seq data, I would use it as is with no resizing. I can't 
personally think of a reason to resize a paired-end fragment, but I 
don't know if others might need that.

I corece to GRanges because I know how GRanges work, but I'm not as 
familiar with GAlignments so I don't know how the resize function works 
on GAlignments and other classes. I'm sure you know better than I do how 
these work. If the coercion is superfluous, feel free to eliminate it.
Yes, I require strand info for all reads because the reads must be 
directionally extended, which requires strand info. Ditto for counting 
the 5-prime and 3-prime ends.

-Ryan
#
On 05/01/2014 02:05 PM, Ryan wrote:
'chipseq' didn't implement the standard modes b/c I wanted to avoid the 
clash of passing unconventional findOverlaps() args to those modes. The 
assumption was that if the user specified a mode they would expect a 
certain behavior ...

Maybe summarizeOverlaps could
Good idea. Maybe a 'preprocess' or 'prefilter' arg to allow massaging 
before counting. I'll post back when it's done.

Valerie
#
GenomicAlignments 1.1.8 has a 'preprocess.reads' argument. This should 
be a function where the first argument is 'reads' and the return value 
is compatible with 'reads' in the pre-defined count modes.

I've used your ResizeReads() as an example in the man page. I think the 
ability to pre-filter and used a pre-defined mode will be useful. Thanks 
for the suggestion.

Valerie
On 05/01/2014 02:29 PM, Valerie Obenchain wrote: