Skip to content
Prev 247179 / 398503 Next

speed up subsetting with certain conditions

On 1/12/2011 2:52 PM, Duke wrote:
The Bioconductor project has many tools for dealing with 
sequence-related data. With the data

k <- read.table(textConnection(
"chr1    3237546    3237547    rs52310428    0    +
chr1    3237549    3237550    rs52097582    0    +
chr2    4513326    4513327    rs29769280    0    +
chr2    4513337    4513338    rs33286009    0    +"))

f <- read.table(textConnection(
"chr1    3213435    G    C
chr1    3237547    T    C
chr1    3237549    G    T
chr2    4513326    A    G
chr2    4513337    C    G"))

One might use the GenomicRanges package as

library(GenomicRanges)
kgr <- with(k, GRanges(V1, IRanges(V2, V3, names=V4), V6, score=V5))
fgr <- with(f, GRanges(V1, IRanges(V2, width=1), V3=V3, V4=V4))
olaps <- findOverlaps(fgr, kgr)
idx <- countOverlaps(fgr, kgr) != 0

resulting in

 > idx
[1] FALSE  TRUE  TRUE  TRUE  TRUE

This will be fast.

One could write foundY with as.data.frame(fgr[idx]) (maybe a little 
editing) but likely one would want to stay in R / Bioc and do something 
more interesting...

See

http://bioconductor.org/install/index.html

Martin