speed up subsetting with certain conditions
On 1/12/11 6:12 PM, Martin Morgan wrote:
The Bioconductor project has many tools for dealing with sequence-related data. With the data k <- read.table(textConnection( "chr1 3237546 3237547 rs52310428 0 + chr1 3237549 3237550 rs52097582 0 + chr2 4513326 4513327 rs29769280 0 + chr2 4513337 4513338 rs33286009 0 +")) f <- read.table(textConnection( "chr1 3213435 G C chr1 3237547 T C chr1 3237549 G T chr2 4513326 A G chr2 4513337 C G")) One might use the GenomicRanges package as library(GenomicRanges) kgr <- with(k, GRanges(V1, IRanges(V2, V3, names=V4), V6, score=V5)) fgr <- with(f, GRanges(V1, IRanges(V2, width=1), V3=V3, V4=V4)) olaps <- findOverlaps(fgr, kgr) idx <- countOverlaps(fgr, kgr) != 0 resulting in
idx
[1] FALSE TRUE TRUE TRUE TRUE This will be fast.
Thanks so much for your suggestion Martin. I had Bioconductor installed but I honestly do not know all its applications. Anyway, I am testing GenomicRanges with my data now. I will report back when I get the result.
One could write foundY with as.data.frame(fgr[idx]) (maybe a little editing) but likely one would want to stay in R / Bioc and do something more interesting...
I suppose foundN <- as.data.frame(fgr[!idx]) and foundY <- as.data.frame(fgr[idx]) as you suggested, but I dont really understand your last comment :). Thanks, D.