[Bioc-devel] Rsamtools applyPileups function not merging positions from multiple files if not identical
On 04/21/2014 02:33 PM, Jonathon Hill wrote:
Hi,
I have been trying to use Rsamtools? applyPileups function to compare two files position-by-position. In order to test it out, I simply ran:
minBaseQuality <- 20
minMapQuality <- 30
minDepth <- 10
maxDepth <- 1000
testparam <- PileupParam(what="seq",
which=GRanges(?chr20", IRanges(1, 1000000)),
minBaseQuality=minBaseQuality,
minMapQuality=minMapQuality,
minDepth=minDepth,
maxDepth=maxDepth,
)
fl1 <- "10696X1_140408_SN141_0782_BC4J3NACXX_2_STP1N90A.bam"
fl2 <- "10696X4_140408_SN141_0782_BC4J3NACXX_2_STP1N90A.bam"
r3 = applyPileups(PileupFiles(c(fl1, fl2)), function(x) x, param=testparam)
My understanding is that this should result in a three-dimensional array with ?ACTGN Counts? in the first dimension, files in the second and position in the third. Positions with overlapping reads in both files should thus be collapsed into a single line in the third dimension. However, selecting one of these positions shows that they are duplicated:
r3[[1]][["seq"]][ , , r3[[1]][["pos"]] == 135003]
I think your understanding is basically correct.
The function is assuming that the BAM files are sorted by position (with, e.g.,
sortBam, but the files don't have to be sorted by Rsamtools).
Executing a similar command gives me
> str(r3[[1]])
List of 3
$ seqnames: Named int 211195
..- attr(*, "names")= chr "chr20"
$ pos : int [1:211195] 60026 60027 60028 60029 60030 60031 60032 60033
60034 60035 ...
$ seq : int [1:5, 1:2, 1:211195] 0 0 0 0 0 0 0 0 0 0 ...
..- attr(*, "dimnames")=List of 3
.. ..$ : chr [1:5] "A" "C" "G" "T" ...
.. ..$ : chr [1:2] "normal_srx113635_sorted.bam" "tumor_srx036691_sorted.bam"
.. ..$ : NULL
Do you get something similar, especially the identical seqnames, pos dimension,
and third dimension of seq? 'pos' should apparently be unique; so
> any(duplicated(r3[[1]][["pos"]]))
[1] FALSE
If there are duplicates, I wonder how many there are and where they occur
pos = r3[[1]][["pos"]]
table(table(pos))
udpos = unique(pos[duplicated(pos)])
head(pos[match(pos, udpos)], 20)
head(match(pos, udpos), 20)
If nothing is suggested by the above, can you make a subset of the BAM files
available to me, e.g., the result of
param = ScanBamParam(which=GRanges("chr20", IRanges(1, 1000000)))
filterBam(fls[1], tempfile(), param=param)
Thanks,
Martin
yields: , , 1 10696X1_140408_SN141_0782_BC4J3NACXX_2_STP1N90A.bam 10696X4_140408_SN141_0782_BC4J3NACXX_2_STP1N90A.bam A 0 0 C 0 0 G 10 0 T 0 0 N 0 0 , , 2 10696X1_140408_SN141_0782_BC4J3NACXX_2_STP1N90A.bam 10696X4_140408_SN141_0782_BC4J3NACXX_2_STP1N90A.bam A 0 0 C 0 0 G 0 13 T 0 0 N 0 0 Even though the position is the same, it is showing up twice. Each time, one of the files shows zeroes. This is not consistent with what happens if the files are identical (as in the example from the help docs). For example, r3 = applyPileups(PileupFiles(c(fl1, fl1)), function(x) x, param=testparam) #file 1 entered twice r3[[1]][["seq"]][ , , r3[[1]][["pos"]] == 135003] yields: 10696X1_140408_SN141_0782_BC4J3NACXX_2_STP1N90A.bam 10696X1_140408_SN141_0782_BC4J3NACXX_2_STP1N90A.bam A 0 0 C 0 0 G 10 10 T 0 0 N 0 0 Is this the expected behavior? It seems like each position should only show up once in the output. Is there something I am missing? Thanks, Jonathon Hill Postdoc Yost Lab, University of Utah [[alternative HTML version deleted]]
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793