[Bioc-devel] Can not import gapped bam, Rsamtools bug or my parameters?
On 04/28/2012 07:23 PM, Martin Morgan wrote:
On 04/28/2012 07:01 PM, Jesper G?din wrote:
Hi Everyone! I have problems with importing a Bamfile that has its reads aligned, which has gaps. Thought that it only was to change the isUnmappedQuery to FALSE, but looking at my result, that is not the case. Read 1 and 9 (see results below) have gaps and gets wrong positions in the imported GRanges Object. Is it a bug in Rsamtools or have I set the parameters wrong?
Hi Jesper -- I can't tell what your alignment is; can you use samtools to view record ERR031838.7783934 and ERR031838.58134419, or use filterBam to create a very small bam file with just the reads that are a problem? Also, maybe GenomicRanges::readGappedAlignments does some of the work for you?
Sorry, looking at your code a little more closely...
ranges<-IRanges(
start=bam[[rangeName]][["pos"]],
width=bam[[rangeName]][["qwidth"]]
)
isn't correct -- qwidth is the width of the query (short read), not the
width of the alignment. The width of the alignment needs to be
calculated using the cigar; see GenomicRanges::cigarToWidth. But
readGappedAlignments (also readBamGappedAlignments / readBamGappedReads)
will do this work for you.
Martin
Martin
#Below the script that can reproduce the result I got
rm(list=ls())
library(Biostrings)
library(IRanges)
library(GenomicRanges)
library(Rsamtools)
BamImpGRList<-function(UserDir,searchArea,opt=TRUE){
#Set parameters
which<- searchArea #A GRanges, RangesList, RangedData, or missing
object, from which a IRangesList instance will be constructed.
what<- scanBamWhat() #A character vector naming the fields to return.
scanBamWhat() returns a vector of available fields. Fields are
described on
the scanBam help page.
flag<- scanBamFlag(isUnmappedQuery=FALSE)
#scanBamFlag(isPaired = NA, isProperPair = NA, isUnmappedQuery = FALSE,
hasUnmappedMate = NA, isMinusStrand = NA, isMateMinusStrand = NA,
isFirstMateRead = NA, isSecondMateRead = NA, isNotPrimaryRead = NA,
isValidVendorRead = NA, isDuplicate = NA)
param<-ScanBamParam(flag=flag,which=which,what=what) #store
ScanBamParam,which and what in param.
#Point to correct directory and create a BamFileList object
bamDir<- normalizePath(UserDir) #Point to the directory containing
your Bam files and its respective bam.bai files.
allFiles<- list.files(bamDir,full.names = TRUE) #list files in a
folder.
bamFiles<- allFiles[grep(".bam$",allFiles)] #list only the files
ending in .bam .
bamFilesList<- BamFileList(bamFiles) #store all the .bam paths in a
BamFile.
#Loop through, open scanBam, store in GRList and then close each object
in the BamFileList object.
BamGRL<-GRangesList()
i<-1
for(bamName in names(bamFilesList)) {
#Description
bf<- bamFilesList[[bamName]]
open(bf)
print(paste("Reading bam file",i,"with
filename",basename(bamName))) #Print information to the user
bam<-scanBam(bf,param=param)
#Description
for(rangeName in names(bam)){
ranges<-IRanges(
start=bam[[rangeName]][["pos"]], #if NA values your
in trouble. That means the read didnt map. Use better filter.
width=bam[[rangeName]][["qwidth"]]
)
GRangeBam<-GRanges(
seqnames =
as.character(bam[[rangeName]][["rname"]]), #Before "mrnm", now"rname"...
ranges = ranges,
strand = bam[[rangeName]][["strand"]],
names=bam[[rangeName]][["qname"]],
flag=bam[[rangeName]][["flag"]],
cigar=bam[[rangeName]][["cigar"]],
mapq=bam[[rangeName]][["cigar"]],
mpos=bam[[rangeName]][["mpos"]],
isize=bam[[rangeName]][["isize"]],
seq=bam[[rangeName]][["seq"]],
qual=bam[[rangeName]][["qual"]]
)
#Store GRangeBam in BamGRL (which is the GRange List
object)
if(basename(bamName)%in%names(BamGRL)){ #This way of
merging the different chromosomes to the same GRangeObject is maybe
not the
best way. Improve later.
BamGRL[[basename(bamName)]]<-c(BamGRL[[basename(bamName)]],GRangeBam)
}
else {
BamGRL[[basename(bamName)]]<-GRangeBam
}
}
print(paste("stored",basename(bamName), "in BamGRL"))
i<-1+i
gc()
close(bf)
}
return(BamGRL)
}
#Construct SearchArea
IRanges<-IRanges(
c(109852192,57522276,216225163),
c(109940573,57607134,216300895),
)
searchArea2<-GRanges(
seqnames = c("chr1","chr12","chr2"),
ranges = IRanges
)
#Import and select region of interest
BamGRL<-BamImpGRList("../bams",searchArea2)
ranges<- IRanges(start=216269079, width=1)
GR<- GRanges(seqnames="chr2",ranges)
BamGRLplus<- BamGRL[strand(BamGRL)=="+"]
subsetReads<- subsetByOverlaps(BamGRLplus[[1]],GR)
#Display the part that has the wrong positions.
subsetReads
###########Below the result###########
subsetReads
GRanges with 159 ranges and 8 elementMetadata cols: seqnames ranges strand | names flag <Rle> <IRanges> <Rle> |<character> <integer> [1] chr2 [216268992, 216269081] + | ERR031838.7783934 99 [2] chr2 [216268992, 216269081] + | ERR031838.49313802 99 [3] chr2 [216268995, 216269084] + | ERR031838.17036310 99 [4] chr2 [216268996, 216269085] + | ERR031838.57612722 99 [5] chr2 [216268997, 216269086] + | ERR031838.33383381 163 [6] chr2 [216268997, 216269086] + | ERR031838.70228271 163 [7] chr2 [216268999, 216269088] + | ERR031838.12543963 99 [8] chr2 [216268999, 216269088] + | ERR031838.40297870 163 [9] chr2 [216268999, 216269088] + | ERR031838.58134419 99 seq <DNAStringSet> [1] CTGATTTAAATTATGTACTAATTTTTTAATGTTTTTTTTTGTTGTTTTGTTTTGTTTTAAAGCATGAAGAATAGAGAATGTAAATACAGT [2] CTGATTTAAATTATGTACTAATTTTTTAATGTTTTTTTTTTGTTGTTTTGTTTTGTTTTAAAGCATGAAGAATAGAGAATGTAAATACAG [3] ATTTAAATTATGTACTAATTTTTTAATGTTTTTTTTTTGTTGTTTTGTTTTGTTTTAAAGCATGAAGAATAGAGAATGTAAATACAGTTA [4] TTTAAATTATGTACTAATTTTTTAATGTTTTTTTTTTGTTGTTTTGTTTTGTTTTAAAGCATGAAGAATAGAGAATGTAAATATAGTTAA [5] TTAAATTATGTACTAATTTTTTAATGTTTTTTTTTTGTTGTTTTGTTTTGTTTTAAAGCATGAAGAATAGAGAATGTAAATATAGTTAAG [6] TTAAATTATGTACTAATTTTTTAATGTTTTTTTTTTGTTGTTTTGTTTTGTTTTAAAGCATGAAGAATAGAGAATGTAAATACAGTTAAG [7] AAATTATGTACTAATTTTTTAATGTTTTTTTTTTGTTGTTTTGTTTTGTTTTAAAGCATGAAGAATAGAGAATGTAAATATAGTTAAGAA [8] AAATTATGTACTAATTTTTTAATGTTTTTTTTTTGTTGTTTTGTTTTGTTTTAAAGCATGAAGAATAGAGAATGTAAATATAGTTAAGAA [9] AAATTATGTACTAATTTTTTAATGTTTTTTTTTTTGTTGTTTTGTTTTGTTTTAAAGCATGAAGAATAGAGAATGTAAATATAGTTAAGA #SessionInfo
sessionInfo()
R version 2.16.0 Under development (unstable) (--) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] Rsamtools_1.7.40 GenomicRanges_1.9.7 Biostrings_2.25.2 [4] IRanges_1.15.4 BiocGenerics_0.3.0 loaded via a namespace (and not attached): [1] bitops_1.0-4.1 stats4_2.16.0 zlibbioc_1.3.0 /Jesper [[alternative HTML version deleted]]
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Computational Biology Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: M1-B861 Telephone: 206 667-2793