Hej Bioc Core!
There was some discussion last year about implementing a BamStreamer (? la FastqStreamer), but I haven't seen anything like it in the current devel. I've implemented the following function that should do the job for me - I have many very large files, and I need to use a cluster with relatively few RAM per node and a restrictive time allocation , so I want to parallelize the reading of the BAM file to manage both. The example below is obviously not affecting the RAM issue but I streamlined it to point out my issue.
".stream" <- function(bamFile,yieldSize=100000,verbose=FALSE){
## create a stream
stopifnot(is(bamFile,"BamFile"))
## set the yieldSize if it is not set already
if(is.na(yieldSize(bamFile))){
yieldSize(bamFile) <- yieldSize
}
## open it
open(bamFile)
## verb
if(verbose){
message(paste("Streaming",basename(path(bamFile))))
}
## create the output
out <- GappedAlignments()
## process it
while(length(chunk <- readBamGappedAlignments(bamFile))){
if(verbose){
message(paste("Processed",length(chunk),"reads"))
}
out <- c(out,chunk)
}
## close
close(bamFile)
## return
return(out)
}
In the method above, the first iteration of combining the GappedAlignments:
out <- c(out,chunk) takes:
system.time(append(out,chunk))
user system elapsed
123.704 0.060 124.011
whereas the second iteration (faked here) takes only (still long):
system.time(append(chunk,chunk))
user system elapsed
2.708 0.044 2.758
I suppose this has to do with the way GenomicRanges:::unlist_list_of_GappedAlignments deals with combining the objects and all the related sanity checks. For the first iteration, the seqlengths are different so I suppose that is what explains the 60X lag compared to the second iteration. Due to the implementation of GappedAlignments, I can't set the seqlengths programmatically in GappedAlignments() which I imagine would have reduced the first iteration lag; see the trials below:
out <- GappedAlignments(seqlengths=seqlengths(chunk))
Error in GappedAlignments(seqlengths = seqlengths(chunk)) :
'names(seqlengths)' incompatible with 'levels(seqnames)'
out <- GappedAlignments(seqlengths=seqlengths(chunk),seqnames=seqnames(chunk))
Error in GappedAlignments(seqlengths = seqlengths(chunk), seqnames = seqnames(chunk)) :
'strand' must be specified when 'seqnames' is not empty
out <- GappedAlignments(seqlengths=seqlengths(chunk),seqnames=seqnames(chunk),strand="+")
Error in validObject(.Object) :
invalid class ?GappedAlignments? object: 1: invalid object for slot "strand" in class "GappedAlignments": got class "character", should be or extend class "Rle"
invalid class ?GappedAlignments? object: 2: number of rows in DataTable 'mcols(x)' must match length of 'x'
I completely approve of such sanity checks; it seems that I'm just trying to do something that it was not designed for :-) All I'm really interested in is a way to stream my BAM file and I'm looking forward to any suggestion. I especially don't want to re-invent the wheel if you have already planned something. If you haven't I'd be glad to get some insight how I can walk around that problem.
My sessionInfo:
R version 3.0.1 (2013-05-16)
Platform: x86_64-unknown-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8
[5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8
[7] LC_PAPER=C LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] BiocInstaller_1.11.3 Rsamtools_1.13.22 Biostrings_2.29.12
[4] GenomicRanges_1.13.26 XVector_0.1.0 IRanges_1.19.15
[7] BiocGenerics_0.7.2
loaded via a namespace (and not attached):
[1] bitops_1.0-5 stats4_3.0.1 zlibbioc_1.7.0
Cheers,
Nico
---------------------------------------------------------------
Nicolas Delhomme
Genome Biology Computational Support
European Molecular Biology Laboratory
Tel: +49 6221 387 8310
Email: nicolas.delhomme at embl.de
Meyerhofstrasse 1 - Postfach 10.2209
69102 Heidelberg, Germany
[Bioc-devel] appending 2 GappedAlignments using "c" takes long
7 messages · Martin Morgan, Hervé Pagès, Nicolas Delhomme
1 day later
----- Nicolas Delhomme <delhomme at embl.de> wrote:
Hej Bioc Core! There was some discussion last year about implementing a BamStreamer (? la FastqStreamer), but I haven't seen anything like it in the current devel. I've implemented the following function that should do the job for me - I have many very large files, and I need to use a cluster with relatively few RAM per node and a restrictive time allocation , so I want to parallelize the reading of the BAM file to manage both. The example below is obviously not affecting the RAM issue but I streamlined it to point out my issue.
Hi Nico --
I had viewed the 'BamStreamer' functionality to be implemented by
bf <- open(BamFile("foo.bam", yieldSize=1234567))
while (length(x <- readGAlignments(bf))) {}
close(bf)
paradigm, without wanting to wrap it further (FastqStreamer isn't any more convenient) because the tough work will be done in {} and one would have to create some sort of complicated rule about function signatures for implementing this as a call-back. I usually implement '{}' as a reduction where all the work is done (as with summarizeOverlaps, where the bam data is reduced to a count vector and then summed across iterations through the loop) -- it doesn't seem like there's any memory management gains to be had by concatenating GappedAlignments (these are renamed GAlignments in devel) together; usually I'd parallelize over files
bfl <- open(BamFileList(fls, yieldSize=1234567)))
result <- mclapply(bfl, function(bf, anno, ...) {
## initialize, e.g., cnt <- integer(length(anno)))
while(length(x <- readGAlignments(bf))) {}
## return, e.g., anno
})
close(bfl)
This doesn't address the efficiency of appending GAlignments.
Martin
".stream" <- function(bamFile,yieldSize=100000,verbose=FALSE){
## create a stream
stopifnot(is(bamFile,"BamFile"))
## set the yieldSize if it is not set already
if(is.na(yieldSize(bamFile))){
yieldSize(bamFile) <- yieldSize
}
## open it
open(bamFile)
## verb
if(verbose){
message(paste("Streaming",basename(path(bamFile))))
}
## create the output
out <- GappedAlignments()
## process it
while(length(chunk <- readBamGappedAlignments(bamFile))){
if(verbose){
message(paste("Processed",length(chunk),"reads"))
}
out <- c(out,chunk)
}
## close
close(bamFile)
## return
return(out)
}
In the method above, the first iteration of combining the GappedAlignments:
out <- c(out,chunk) takes:
system.time(append(out,chunk))
user system elapsed
123.704 0.060 124.011
whereas the second iteration (faked here) takes only (still long):
system.time(append(chunk,chunk))
user system elapsed
2.708 0.044 2.758
I suppose this has to do with the way GenomicRanges:::unlist_list_of_GappedAlignments deals with combining the objects and all the related sanity checks. For the first iteration, the seqlengths are different so I suppose that is what explains the 60X lag compared to the second iteration. Due to the implementation of GappedAlignments, I can't set the seqlengths programmatically in GappedAlignments() which I imagine would have reduced the first iteration lag; see the trials below:
out <- GappedAlignments(seqlengths=seqlengths(chunk))
Error in GappedAlignments(seqlengths = seqlengths(chunk)) :
'names(seqlengths)' incompatible with 'levels(seqnames)'
out <- GappedAlignments(seqlengths=seqlengths(chunk),seqnames=seqnames(chunk))
Error in GappedAlignments(seqlengths = seqlengths(chunk), seqnames = seqnames(chunk)) :
'strand' must be specified when 'seqnames' is not empty
out <- GappedAlignments(seqlengths=seqlengths(chunk),seqnames=seqnames(chunk),strand="+")
Error in validObject(.Object) :
invalid class ?GappedAlignments? object: 1: invalid object for slot "strand" in class "GappedAlignments": got class "character", should be or extend class "Rle"
invalid class ?GappedAlignments? object: 2: number of rows in DataTable 'mcols(x)' must match length of 'x'
I completely approve of such sanity checks; it seems that I'm just trying to do something that it was not designed for :-) All I'm really interested in is a way to stream my BAM file and I'm looking forward to any suggestion. I especially don't want to re-invent the wheel if you have already planned something. If you haven't I'd be glad to get some insight how I can walk around that problem.
My sessionInfo:
R version 3.0.1 (2013-05-16)
Platform: x86_64-unknown-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8
[5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8
[7] LC_PAPER=C LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] BiocInstaller_1.11.3 Rsamtools_1.13.22 Biostrings_2.29.12
[4] GenomicRanges_1.13.26 XVector_0.1.0 IRanges_1.19.15
[7] BiocGenerics_0.7.2
loaded via a namespace (and not attached):
[1] bitops_1.0-5 stats4_3.0.1 zlibbioc_1.7.0
Cheers,
Nico
---------------------------------------------------------------
Nicolas Delhomme
Genome Biology Computational Support
European Molecular Biology Laboratory
Tel: +49 6221 387 8310
Email: nicolas.delhomme at embl.de
Meyerhofstrasse 1 - Postfach 10.2209
69102 Heidelberg, Germany
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Hi Nico,
On 07/09/2013 08:07 AM, Nicolas Delhomme wrote:
Hej Bioc Core!
There was some discussion last year about implementing a BamStreamer (? la FastqStreamer), but I haven't seen anything like it in the current devel. I've implemented the following function that should do the job for me - I have many very large files, and I need to use a cluster with relatively few RAM per node and a restrictive time allocation , so I want to parallelize the reading of the BAM file to manage both. The example below is obviously not affecting the RAM issue but I streamlined it to point out my issue.
".stream" <- function(bamFile,yieldSize=100000,verbose=FALSE){
## create a stream
stopifnot(is(bamFile,"BamFile"))
## set the yieldSize if it is not set already
if(is.na(yieldSize(bamFile))){
yieldSize(bamFile) <- yieldSize
}
## open it
open(bamFile)
## verb
if(verbose){
message(paste("Streaming",basename(path(bamFile))))
}
## create the output
out <- GappedAlignments()
## process it
while(length(chunk <- readBamGappedAlignments(bamFile))){
if(verbose){
message(paste("Processed",length(chunk),"reads"))
}
out <- c(out,chunk)
}
Note that regardless the speed of c() on GappedAlignments objects, growing an object in a loop is fundamentally inefficient (see Circle 2 of The R Inferno). Also keeping the chunks in memory kind of defeats the purpose of reading the file one chunk at a time.
## close close(bamFile) ## return return(out) } In the method above, the first iteration of combining the GappedAlignments: out <- c(out,chunk) takes: system.time(append(out,chunk)) user system elapsed 123.704 0.060 124.011
2 minutes! Whaoo, that's really slow. I can't reproduce this on my
machine though:
library(Rsamtools)
library(RNAseqData.HNRNPC.bam.chr14)
bamfile <- BamFile(RNAseqData.HNRNPC.bam.chr14_BAMFILES[1L])
yieldSize(bamfile) <- 100000L
open(bamfile)
out <- GappedAlignments()
Then:
> chunk <- readBamGappedAlignments(bamfile)
> system.time(out <- append(out, chunk))
user system elapsed
0.284 0.000 0.286
I wonder what's going on on your system. Are you sure it was not running
out of memory when you did this? Try to check the load with uptime or
top in another terminal (e.g. start top right before you call append()).
If the system starts swapping, then your R process will become hundreds
or thousands times slower!
whereas the second iteration (faked here) takes only (still long): system.time(append(chunk,chunk)) user system elapsed 2.708 0.044 2.758
2nd, 3rd and 4th iterations for me:
> chunk <- readBamGappedAlignments(bamfile)
> system.time(out <- append(out, chunk))
user system elapsed
0.516 0.004 0.521
> chunk <- readBamGappedAlignments(bamfile)
> system.time(out <- append(out, chunk))
user system elapsed
0.656 0.008 0.663
> chunk <- readBamGappedAlignments(bamfile)
> system.time(out <- append(out, chunk))
user system elapsed
0.796 0.004 0.801
As expected, the time is growing (this is why the process
of growing an object in a loop is considered to be quadratic
in time).
I suppose this has to do with the way GenomicRanges:::unlist_list_of_GappedAlignments deals with combining the objects and all the related sanity checks. For the first iteration, the seqlengths are different so I suppose that is what explains the 60X lag compared to the second iteration.
The seqinfo of the 2 objects to combine need to be merged together and set back on each object before the 2 objects can actually be combined. This operation is cheap and I wouldn't expect this to slow down the first iteration significantly.
Due to the implementation of GappedAlignments, I can't set the seqlengths programmatically in GappedAlignments() which I imagine would have reduced the first iteration lag; see the trials below: out <- GappedAlignments(seqlengths=seqlengths(chunk)) Error in GappedAlignments(seqlengths = seqlengths(chunk)) : 'names(seqlengths)' incompatible with 'levels(seqnames)' out <- GappedAlignments(seqlengths=seqlengths(chunk),seqnames=seqnames(chunk)) Error in GappedAlignments(seqlengths = seqlengths(chunk), seqnames = seqnames(chunk)) : 'strand' must be specified when 'seqnames' is not empty out <- GappedAlignments(seqlengths=seqlengths(chunk),seqnames=seqnames(chunk),strand="+") Error in validObject(.Object) : invalid class ?GappedAlignments? object: 1: invalid object for slot "strand" in class "GappedAlignments": got class "character", should be or extend class "Rle" invalid class ?GappedAlignments? object: 2: number of rows in DataTable 'mcols(x)' must match length of 'x'
The trick is to create an empty GappedAlignments objects
with non-empty seqlevels so you can put seqlengths on the
seqlevels.
Here are 2 ways to create an empty GappedAlignments objects with
non-empty seqlevels:
(1) Pass an empty factor with non-empty levels to the seqnames
arg:
out <- GappedAlignments(seqnames=factor(levels=seqlevels(chunk)))
(2) The recommended way:
out <- GappedAlignments()
seqinfo(out) <- seqinfo(chunk)
Note that with (2), 'out' gets all the seqinfo from 'chunk' (including
its seqlengths), not only its seqlevels.
(1) could be adapted to also set the seqlengths:
out <- GappedAlignments(seqnames=factor(levels=seqlevels(chunk)),
seqlengths=seqlengths(chunk))
but (2) is really the preferred way.
I completely approve of such sanity checks; it seems that I'm just trying to do something that it was not designed for :-) All I'm really interested in is a way to stream my BAM file and I'm looking forward to any suggestion. I especially don't want to re-invent the wheel if you have already planned something. If you haven't I'd be glad to get some insight how I can walk around that problem. My sessionInfo: R version 3.0.1 (2013-05-16) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats graphics grDevices utils datasets methods [8] base other attached packages: [1] BiocInstaller_1.11.3 Rsamtools_1.13.22 Biostrings_2.29.12 [4] GenomicRanges_1.13.26 XVector_0.1.0 IRanges_1.19.15 [7] BiocGenerics_0.7.2 loaded via a namespace (and not attached): [1] bitops_1.0-5 stats4_3.0.1 zlibbioc_1.7.0
Looks like you are using Bioc-devel. Did you get all the
warnings about GappedAlignments, readBamGappedAlignments(),
and GappedAlignments() being deprecated?
I thought you were using the release so that's what I used:
> sessionInfo()
R version 3.0.0 (2013-04-03)
Platform: x86_64-unknown-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=C LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] RNAseqData.HNRNPC.bam.chr14_0.1.3 Rsamtools_1.12.3
[3] Biostrings_2.28.0 GenomicRanges_1.12.4
[5] IRanges_1.18.1 BiocGenerics_0.6.0
loaded via a namespace (and not attached):
[1] bitops_1.0-5 stats4_3.0.0 zlibbioc_1.6.0
The timings I get with Bioc-devel are pretty much the same though.
Something doesn't seem to be quite right with your cluster. What happens
if you try to rbind() 2 data.frames of 100000 rows each in a fresh
session?
> df <- data.frame(aa=1:100000, bb=100000:1, cc="cc", dd="dd")
> system.time(df2 <- rbind(df, df))
user system elapsed
0.204 0.000 0.206
Thanks,
H.
Cheers, Nico --------------------------------------------------------------- Nicolas Delhomme Genome Biology Computational Support European Molecular Biology Laboratory Tel: +49 6221 387 8310 Email: nicolas.delhomme at embl.de Meyerhofstrasse 1 - Postfach 10.2209 69102 Heidelberg, Germany
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Herv? Pag?s Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
Hi Martin,
On Jul 10, 2013, at 8:40 PM, Martin Morgan wrote:
----- Nicolas Delhomme <delhomme at embl.de> wrote:
Hej Bioc Core! There was some discussion last year about implementing a BamStreamer (? la FastqStreamer), but I haven't seen anything like it in the current devel. I've implemented the following function that should do the job for me - I have many very large files, and I need to use a cluster with relatively few RAM per node and a restrictive time allocation , so I want to parallelize the reading of the BAM file to manage both. The example below is obviously not affecting the RAM issue but I streamlined it to point out my issue.
Hi Nico --
I had viewed the 'BamStreamer' functionality to be implemented by
bf <- open(BamFile("foo.bam", yieldSize=1234567))
while (length(x <- readGAlignments(bf))) {}
close(bf)
Good to see I recalled your Cambridge teaching correctly ;-)
paradigm, without wanting to wrap it further (FastqStreamer isn't any more convenient) because the tough work will be done in {} and one would have to create some sort of complicated rule about function signatures for implementing this as a call-back. I usually implement '{}' as a reduction where all the work is done (as with summarizeOverlaps, where the bam data is reduced to a count vector and then summed across iterations through the loop) -- it doesn't seem like there's any memory management gains to be had by concatenating GappedAlignments (these are renamed GAlignments in devel) together;
I agree and that's what I meant by having created a streamlined version of my function so has to show the paradigm, not the actual data reduction steps.
usually I'd parallelize over files
bfl <- open(BamFileList(fls, yieldSize=1234567)))
result <- mclapply(bfl, function(bf, anno, ...) {
## initialize, e.g., cnt <- integer(length(anno)))
while(length(x <- readGAlignments(bf))) {}
## return, e.g., anno
})
close(bfl)
So do I.
This doesn't address the efficiency of appending GAlignments.
This was really a side effect that appeared when I was preparing the code example. I'll go through Herv?'s answer about this, they might indeed have been an issue with R version (devel or not), as I didn't get the GAlignments warnings although the package version numbers seemed to match what's in devel. I'll have a closer look. Thanks, Nico
Martin
".stream" <- function(bamFile,yieldSize=100000,verbose=FALSE){
## create a stream
stopifnot(is(bamFile,"BamFile"))
## set the yieldSize if it is not set already
if(is.na(yieldSize(bamFile))){
yieldSize(bamFile) <- yieldSize
}
## open it
open(bamFile)
## verb
if(verbose){
message(paste("Streaming",basename(path(bamFile))))
}
## create the output
out <- GappedAlignments()
## process it
while(length(chunk <- readBamGappedAlignments(bamFile))){
if(verbose){
message(paste("Processed",length(chunk),"reads"))
}
out <- c(out,chunk)
}
## close
close(bamFile)
## return
return(out)
}
In the method above, the first iteration of combining the GappedAlignments:
out <- c(out,chunk) takes:
system.time(append(out,chunk))
user system elapsed
123.704 0.060 124.011
whereas the second iteration (faked here) takes only (still long):
system.time(append(chunk,chunk))
user system elapsed
2.708 0.044 2.758
I suppose this has to do with the way GenomicRanges:::unlist_list_of_GappedAlignments deals with combining the objects and all the related sanity checks. For the first iteration, the seqlengths are different so I suppose that is what explains the 60X lag compared to the second iteration. Due to the implementation of GappedAlignments, I can't set the seqlengths programmatically in GappedAlignments() which I imagine would have reduced the first iteration lag; see the trials below:
out <- GappedAlignments(seqlengths=seqlengths(chunk))
Error in GappedAlignments(seqlengths = seqlengths(chunk)) :
'names(seqlengths)' incompatible with 'levels(seqnames)'
out <- GappedAlignments(seqlengths=seqlengths(chunk),seqnames=seqnames(chunk))
Error in GappedAlignments(seqlengths = seqlengths(chunk), seqnames = seqnames(chunk)) :
'strand' must be specified when 'seqnames' is not empty
out <- GappedAlignments(seqlengths=seqlengths(chunk),seqnames=seqnames(chunk),strand="+")
Error in validObject(.Object) :
invalid class ?GappedAlignments? object: 1: invalid object for slot "strand" in class "GappedAlignments": got class "character", should be or extend class "Rle"
invalid class ?GappedAlignments? object: 2: number of rows in DataTable 'mcols(x)' must match length of 'x'
I completely approve of such sanity checks; it seems that I'm just trying to do something that it was not designed for :-) All I'm really interested in is a way to stream my BAM file and I'm looking forward to any suggestion. I especially don't want to re-invent the wheel if you have already planned something. If you haven't I'd be glad to get some insight how I can walk around that problem.
My sessionInfo:
R version 3.0.1 (2013-05-16)
Platform: x86_64-unknown-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8
[5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8
[7] LC_PAPER=C LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] BiocInstaller_1.11.3 Rsamtools_1.13.22 Biostrings_2.29.12
[4] GenomicRanges_1.13.26 XVector_0.1.0 IRanges_1.19.15
[7] BiocGenerics_0.7.2
loaded via a namespace (and not attached):
[1] bitops_1.0-5 stats4_3.0.1 zlibbioc_1.7.0
Cheers,
Nico
---------------------------------------------------------------
Nicolas Delhomme
Genome Biology Computational Support
European Molecular Biology Laboratory
Tel: +49 6221 387 8310
Email: nicolas.delhomme at embl.de
Meyerhofstrasse 1 - Postfach 10.2209
69102 Heidelberg, Germany
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Hej Herv?! --------------------------------------------------------------- Nicolas Delhomme Genome Biology Computational Support European Molecular Biology Laboratory Tel: +49 6221 387 8310 Email: nicolas.delhomme at embl.de Meyerhofstrasse 1 - Postfach 10.2209 69102 Heidelberg, Germany ---------------------------------------------------------------
On Jul 10, 2013, at 8:54 PM, Herv? Pag?s wrote:
Hi Nico, On 07/09/2013 08:07 AM, Nicolas Delhomme wrote:
Hej Bioc Core!
There was some discussion last year about implementing a BamStreamer (? la FastqStreamer), but I haven't seen anything like it in the current devel. I've implemented the following function that should do the job for me - I have many very large files, and I need to use a cluster with relatively few RAM per node and a restrictive time allocation , so I want to parallelize the reading of the BAM file to manage both. The example below is obviously not affecting the RAM issue but I streamlined it to point out my issue.
".stream" <- function(bamFile,yieldSize=100000,verbose=FALSE){
## create a stream
stopifnot(is(bamFile,"BamFile"))
## set the yieldSize if it is not set already
if(is.na(yieldSize(bamFile))){
yieldSize(bamFile) <- yieldSize
}
## open it
open(bamFile)
## verb
if(verbose){
message(paste("Streaming",basename(path(bamFile))))
}
## create the output
out <- GappedAlignments()
## process it
while(length(chunk <- readBamGappedAlignments(bamFile))){
if(verbose){
message(paste("Processed",length(chunk),"reads"))
}
out <- c(out,chunk)
}
Note that regardless the speed of c() on GappedAlignments objects, growing an object in a loop is fundamentally inefficient (see Circle 2 of The R Inferno). Also keeping the chunks in memory kind of defeats the purpose of reading the file one chunk at a time.
Sure. What this function normally really does is a data reduction - basically getting a named vector back. I just came across the appending issue when preparing the code example above.
## close close(bamFile) ## return return(out) } In the method above, the first iteration of combining the GappedAlignments: out <- c(out,chunk) takes: system.time(append(out,chunk)) user system elapsed 123.704 0.060 124.011
2 minutes! Whaoo, that's really slow. I can't reproduce this on my machine though:
OK, sounds more like a system issue then.
library(Rsamtools) library(RNAseqData.HNRNPC.bam.chr14) bamfile <- BamFile(RNAseqData.HNRNPC.bam.chr14_BAMFILES[1L]) yieldSize(bamfile) <- 100000L open(bamfile) out <- GappedAlignments() Then:
> chunk <- readBamGappedAlignments(bamfile) > system.time(out <- append(out, chunk))
user system elapsed 0.284 0.000 0.286 I wonder what's going on on your system. Are you sure it was not running out of memory when you did this?
Yes, that's a fat node with 0.2TB RAM and I was the only one on it at the time.
Try to check the load with uptime or top in another terminal (e.g. start top right before you call append()). If the system starts swapping, then your R process will become hundreds or thousands times slower!
and there was no memory intensive job running. Could still have been some NFS related issue. I will retry with a fresh session and monitor the I/O as well.
whereas the second iteration (faked here) takes only (still long): system.time(append(chunk,chunk)) user system elapsed 2.708 0.044 2.758
2nd, 3rd and 4th iterations for me:
> chunk <- readBamGappedAlignments(bamfile) > system.time(out <- append(out, chunk))
user system elapsed 0.516 0.004 0.521
> chunk <- readBamGappedAlignments(bamfile) > system.time(out <- append(out, chunk))
user system elapsed 0.656 0.008 0.663
> chunk <- readBamGappedAlignments(bamfile) > system.time(out <- append(out, chunk))
user system elapsed 0.796 0.004 0.801 As expected, the time is growing (this is why the process of growing an object in a loop is considered to be quadratic in time).
Quadratic! Wow, I knew it was slower but still... Good to know.
I suppose this has to do with the way GenomicRanges:::unlist_list_of_GappedAlignments deals with combining the objects and all the related sanity checks. For the first iteration, the seqlengths are different so I suppose that is what explains the 60X lag compared to the second iteration.
The seqinfo of the 2 objects to combine need to be merged together and set back on each object before the 2 objects can actually be combined. This operation is cheap and I wouldn't expect this to slow down the first iteration significantly.
Yes, that was very surprising.
Due to the implementation of GappedAlignments, I can't set the seqlengths programmatically in GappedAlignments() which I imagine would have reduced the first iteration lag; see the trials below: out <- GappedAlignments(seqlengths=seqlengths(chunk)) Error in GappedAlignments(seqlengths = seqlengths(chunk)) : 'names(seqlengths)' incompatible with 'levels(seqnames)' out <- GappedAlignments(seqlengths=seqlengths(chunk),seqnames=seqnames(chunk)) Error in GappedAlignments(seqlengths = seqlengths(chunk), seqnames = seqnames(chunk)) : 'strand' must be specified when 'seqnames' is not empty out <- GappedAlignments(seqlengths=seqlengths(chunk),seqnames=seqnames(chunk),strand="+") Error in validObject(.Object) : invalid class ?GappedAlignments? object: 1: invalid object for slot "strand" in class "GappedAlignments": got class "character", should be or extend class "Rle" invalid class ?GappedAlignments? object: 2: number of rows in DataTable 'mcols(x)' must match length of 'x'
The trick is to create an empty GappedAlignments objects
with non-empty seqlevels so you can put seqlengths on the
seqlevels.
Here are 2 ways to create an empty GappedAlignments objects with
non-empty seqlevels:
(1) Pass an empty factor with non-empty levels to the seqnames
arg:
out <- GappedAlignments(seqnames=factor(levels=seqlevels(chunk)))
(2) The recommended way:
out <- GappedAlignments()
seqinfo(out) <- seqinfo(chunk)
Note that with (2), 'out' gets all the seqinfo from 'chunk' (including
its seqlengths), not only its seqlevels.
(1) could be adapted to also set the seqlengths:
out <- GappedAlignments(seqnames=factor(levels=seqlevels(chunk)),
seqlengths=seqlengths(chunk))
but (2) is really the preferred way.
Thanks for the pointers!
I completely approve of such sanity checks; it seems that I'm just trying to do something that it was not designed for :-) All I'm really interested in is a way to stream my BAM file and I'm looking forward to any suggestion. I especially don't want to re-invent the wheel if you have already planned something. If you haven't I'd be glad to get some insight how I can walk around that problem. My sessionInfo: R version 3.0.1 (2013-05-16) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats graphics grDevices utils datasets methods [8] base other attached packages: [1] BiocInstaller_1.11.3 Rsamtools_1.13.22 Biostrings_2.29.12 [4] GenomicRanges_1.13.26 XVector_0.1.0 IRanges_1.19.15 [7] BiocGenerics_0.7.2 loaded via a namespace (and not attached): [1] bitops_1.0-5 stats4_3.0.1 zlibbioc_1.7.0
Looks like you are using Bioc-devel. Did you get all the warnings about GappedAlignments, readBamGappedAlignments(), and GappedAlignments() being deprecated?
I though I did, but indeed I didn't get the warnings then. This is very strange.
I thought you were using the release so that's what I used:
sessionInfo()
R version 3.0.0 (2013-04-03) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats graphics grDevices utils datasets methods [8] base other attached packages: [1] RNAseqData.HNRNPC.bam.chr14_0.1.3 Rsamtools_1.12.3 [3] Biostrings_2.28.0 GenomicRanges_1.12.4 [5] IRanges_1.18.1 BiocGenerics_0.6.0 loaded via a namespace (and not attached): [1] bitops_1.0-5 stats4_3.0.0 zlibbioc_1.6.0 The timings I get with Bioc-devel are pretty much the same though. Something doesn't seem to be quite right with your cluster.
I agree, I'll check that out.
What happens if you try to rbind() 2 data.frames of 100000 rows each in a fresh session?
> df <- data.frame(aa=1:100000, bb=100000:1, cc="cc", dd="dd") > system.time(df2 <- rbind(df, df))
user system elapsed 0.204 0.000 0.206
Good point. I'll try that out and let you know. Thanks for the very detailed answer! Cheers, Nico
Thanks, H.
Cheers, Nico --------------------------------------------------------------- Nicolas Delhomme Genome Biology Computational Support European Molecular Biology Laboratory Tel: +49 6221 387 8310 Email: nicolas.delhomme at embl.de Meyerhofstrasse 1 - Postfach 10.2209 69102 Heidelberg, Germany
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
-- Herv? Pag?s Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
That particular machine has been up for 40 days although all the parameters are in the green right now. Doing an rbind is as quick as what you reported:
system.time(df2 <- rbind(df, df))
user system elapsed 0.104 0.000 0.104 And now I do get the GAlignments warning:
GappedAlignments()
GAlignments with 0 alignments and 0 metadata columns:
seqnames strand cigar qwidth start end width
<Rle> <Rle> <character> <integer> <integer> <integer> <integer>
ngap
<integer>
---
seqlengths:
Warning message:
The GappedAlignments class, the GappedAlignments()
constructor, and the readGappedAlignments() function, have been
renamed: GAlignments, GAlignments(), and readGAlignments(),
respectively. The old names are deprecated. Please use the new
names instead.
And the appending works as for you:
library(Rsamtools) library(RNAseqData.HNRNPC.bam.chr14) bamfile <- BamFile(RNAseqData.HNRNPC.bam.chr14_BAMFILES[1L]) yieldSize(bamfile) <- 100000L open(bamfile) out <- GappedAlignments()
Warning message: The GappedAlignments class, the GappedAlignments() constructor, and the readGappedAlignments() function, have been renamed: GAlignments, GAlignments(), and readGAlignments(), respectively. The old names are deprecated. Please use the new names instead.
chunk <- readBamGappedAlignments(bamfile)
Warning message:
'readBamGappedAlignments' is deprecated.
Use 'readGAlignmentsFromBam' instead.
See help("Deprecated")
system.time(out <- append(out, chunk))
user system elapsed 0.092 0.000 0.091
chunk <- readBamGappedAlignments(bamfile)
Warning message:
'readBamGappedAlignments' is deprecated.
Use 'readGAlignmentsFromBam' instead.
See help("Deprecated")
system.time(out <- append(out, chunk))
user system elapsed 0.372 0.000 0.369
chunk <- readBamGappedAlignments(bamfile)
Warning message:
'readBamGappedAlignments' is deprecated.
Use 'readGAlignmentsFromBam' instead.
See help("Deprecated")
system.time(out <- append(out, chunk))
user system elapsed 0.896 0.012 0.909 And the sessionInfo are as before:
sessionInfo()
R version 3.0.1 (2013-05-16) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats graphics grDevices utils datasets methods [8] base other attached packages: [1] GenomicRanges_1.13.26 Biostrings_2.29.12 XVector_0.1.0 [4] IRanges_1.19.15 BiocGenerics_0.7.2 loaded via a namespace (and not attached): [1] stats4_3.0.1 So I'm not sure what happened; so far, I can only imagine an NFS / RAID related issue. Doing it with my own data gives the same results as above. Sorry for bothering you with that and many thanks for the help. Cheers, Nico --------------------------------------------------------------- Nicolas Delhomme Genome Biology Computational Support European Molecular Biology Laboratory Tel: +49 6221 387 8310 Email: nicolas.delhomme at embl.de Meyerhofstrasse 1 - Postfach 10.2209 69102 Heidelberg, Germany ---------------------------------------------------------------
On Jul 11, 2013, at 10:15 AM, Nicolas Delhomme wrote:
Hej Herv?! --------------------------------------------------------------- Nicolas Delhomme Genome Biology Computational Support European Molecular Biology Laboratory Tel: +49 6221 387 8310 Email: nicolas.delhomme at embl.de Meyerhofstrasse 1 - Postfach 10.2209 69102 Heidelberg, Germany --------------------------------------------------------------- On Jul 10, 2013, at 8:54 PM, Herv? Pag?s wrote:
Hi Nico, On 07/09/2013 08:07 AM, Nicolas Delhomme wrote:
Hej Bioc Core!
There was some discussion last year about implementing a BamStreamer (? la FastqStreamer), but I haven't seen anything like it in the current devel. I've implemented the following function that should do the job for me - I have many very large files, and I need to use a cluster with relatively few RAM per node and a restrictive time allocation , so I want to parallelize the reading of the BAM file to manage both. The example below is obviously not affecting the RAM issue but I streamlined it to point out my issue.
".stream" <- function(bamFile,yieldSize=100000,verbose=FALSE){
## create a stream
stopifnot(is(bamFile,"BamFile"))
## set the yieldSize if it is not set already
if(is.na(yieldSize(bamFile))){
yieldSize(bamFile) <- yieldSize
}
## open it
open(bamFile)
## verb
if(verbose){
message(paste("Streaming",basename(path(bamFile))))
}
## create the output
out <- GappedAlignments()
## process it
while(length(chunk <- readBamGappedAlignments(bamFile))){
if(verbose){
message(paste("Processed",length(chunk),"reads"))
}
out <- c(out,chunk)
}
Note that regardless the speed of c() on GappedAlignments objects, growing an object in a loop is fundamentally inefficient (see Circle 2 of The R Inferno). Also keeping the chunks in memory kind of defeats the purpose of reading the file one chunk at a time.
Sure. What this function normally really does is a data reduction - basically getting a named vector back. I just came across the appending issue when preparing the code example above.
## close close(bamFile) ## return return(out) } In the method above, the first iteration of combining the GappedAlignments: out <- c(out,chunk) takes: system.time(append(out,chunk)) user system elapsed 123.704 0.060 124.011
2 minutes! Whaoo, that's really slow. I can't reproduce this on my machine though:
OK, sounds more like a system issue then.
library(Rsamtools) library(RNAseqData.HNRNPC.bam.chr14) bamfile <- BamFile(RNAseqData.HNRNPC.bam.chr14_BAMFILES[1L]) yieldSize(bamfile) <- 100000L open(bamfile) out <- GappedAlignments() Then:
chunk <- readBamGappedAlignments(bamfile) system.time(out <- append(out, chunk))
user system elapsed 0.284 0.000 0.286 I wonder what's going on on your system. Are you sure it was not running out of memory when you did this?
Yes, that's a fat node with 0.2TB RAM and I was the only one on it at the time.
Try to check the load with uptime or top in another terminal (e.g. start top right before you call append()). If the system starts swapping, then your R process will become hundreds or thousands times slower!
and there was no memory intensive job running. Could still have been some NFS related issue. I will retry with a fresh session and monitor the I/O as well.
whereas the second iteration (faked here) takes only (still long): system.time(append(chunk,chunk)) user system elapsed 2.708 0.044 2.758
2nd, 3rd and 4th iterations for me:
chunk <- readBamGappedAlignments(bamfile) system.time(out <- append(out, chunk))
user system elapsed 0.516 0.004 0.521
chunk <- readBamGappedAlignments(bamfile) system.time(out <- append(out, chunk))
user system elapsed 0.656 0.008 0.663
chunk <- readBamGappedAlignments(bamfile) system.time(out <- append(out, chunk))
user system elapsed 0.796 0.004 0.801 As expected, the time is growing (this is why the process of growing an object in a loop is considered to be quadratic in time).
Quadratic! Wow, I knew it was slower but still... Good to know.
I suppose this has to do with the way GenomicRanges:::unlist_list_of_GappedAlignments deals with combining the objects and all the related sanity checks. For the first iteration, the seqlengths are different so I suppose that is what explains the 60X lag compared to the second iteration.
The seqinfo of the 2 objects to combine need to be merged together and set back on each object before the 2 objects can actually be combined. This operation is cheap and I wouldn't expect this to slow down the first iteration significantly.
Yes, that was very surprising.
Due to the implementation of GappedAlignments, I can't set the seqlengths programmatically in GappedAlignments() which I imagine would have reduced the first iteration lag; see the trials below: out <- GappedAlignments(seqlengths=seqlengths(chunk)) Error in GappedAlignments(seqlengths = seqlengths(chunk)) : 'names(seqlengths)' incompatible with 'levels(seqnames)' out <- GappedAlignments(seqlengths=seqlengths(chunk),seqnames=seqnames(chunk)) Error in GappedAlignments(seqlengths = seqlengths(chunk), seqnames = seqnames(chunk)) : 'strand' must be specified when 'seqnames' is not empty out <- GappedAlignments(seqlengths=seqlengths(chunk),seqnames=seqnames(chunk),strand="+") Error in validObject(.Object) : invalid class ?GappedAlignments? object: 1: invalid object for slot "strand" in class "GappedAlignments": got class "character", should be or extend class "Rle" invalid class ?GappedAlignments? object: 2: number of rows in DataTable 'mcols(x)' must match length of 'x'
The trick is to create an empty GappedAlignments objects
with non-empty seqlevels so you can put seqlengths on the
seqlevels.
Here are 2 ways to create an empty GappedAlignments objects with
non-empty seqlevels:
(1) Pass an empty factor with non-empty levels to the seqnames
arg:
out <- GappedAlignments(seqnames=factor(levels=seqlevels(chunk)))
(2) The recommended way:
out <- GappedAlignments()
seqinfo(out) <- seqinfo(chunk)
Note that with (2), 'out' gets all the seqinfo from 'chunk' (including
its seqlengths), not only its seqlevels.
(1) could be adapted to also set the seqlengths:
out <- GappedAlignments(seqnames=factor(levels=seqlevels(chunk)),
seqlengths=seqlengths(chunk))
but (2) is really the preferred way.
Thanks for the pointers!
I completely approve of such sanity checks; it seems that I'm just trying to do something that it was not designed for :-) All I'm really interested in is a way to stream my BAM file and I'm looking forward to any suggestion. I especially don't want to re-invent the wheel if you have already planned something. If you haven't I'd be glad to get some insight how I can walk around that problem. My sessionInfo: R version 3.0.1 (2013-05-16) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats graphics grDevices utils datasets methods [8] base other attached packages: [1] BiocInstaller_1.11.3 Rsamtools_1.13.22 Biostrings_2.29.12 [4] GenomicRanges_1.13.26 XVector_0.1.0 IRanges_1.19.15 [7] BiocGenerics_0.7.2 loaded via a namespace (and not attached): [1] bitops_1.0-5 stats4_3.0.1 zlibbioc_1.7.0
Looks like you are using Bioc-devel. Did you get all the warnings about GappedAlignments, readBamGappedAlignments(), and GappedAlignments() being deprecated?
I though I did, but indeed I didn't get the warnings then. This is very strange.
I thought you were using the release so that's what I used:
sessionInfo()
R version 3.0.0 (2013-04-03) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats graphics grDevices utils datasets methods [8] base other attached packages: [1] RNAseqData.HNRNPC.bam.chr14_0.1.3 Rsamtools_1.12.3 [3] Biostrings_2.28.0 GenomicRanges_1.12.4 [5] IRanges_1.18.1 BiocGenerics_0.6.0 loaded via a namespace (and not attached): [1] bitops_1.0-5 stats4_3.0.0 zlibbioc_1.6.0 The timings I get with Bioc-devel are pretty much the same though. Something doesn't seem to be quite right with your cluster.
I agree, I'll check that out.
What happens if you try to rbind() 2 data.frames of 100000 rows each in a fresh session?
df <- data.frame(aa=1:100000, bb=100000:1, cc="cc", dd="dd") system.time(df2 <- rbind(df, df))
user system elapsed 0.204 0.000 0.206
Good point. I'll try that out and let you know. Thanks for the very detailed answer! Cheers, Nico
Thanks, H.
Cheers, Nico --------------------------------------------------------------- Nicolas Delhomme Genome Biology Computational Support European Molecular Biology Laboratory Tel: +49 6221 387 8310 Email: nicolas.delhomme at embl.de Meyerhofstrasse 1 - Postfach 10.2209 69102 Heidelberg, Germany
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
-- Herv? Pag?s Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
1 day later
Hi Nico,
On 07/11/2013 02:11 AM, Nicolas Delhomme wrote:
That particular machine has been up for 40 days although all the parameters are in the green right now. Doing an rbind is as quick as what you reported:
system.time(df2 <- rbind(df, df))
user system elapsed 0.104 0.000 0.104 And now I do get the GAlignments warning:
GappedAlignments()
GAlignments with 0 alignments and 0 metadata columns:
seqnames strand cigar qwidth start end width
<Rle> <Rle> <character> <integer> <integer> <integer> <integer>
ngap
<integer>
---
seqlengths:
Warning message:
The GappedAlignments class, the GappedAlignments()
constructor, and the readGappedAlignments() function, have been
renamed: GAlignments, GAlignments(), and readGAlignments(),
respectively. The old names are deprecated. Please use the new
names instead.
And the appending works as for you:
library(Rsamtools) library(RNAseqData.HNRNPC.bam.chr14) bamfile <- BamFile(RNAseqData.HNRNPC.bam.chr14_BAMFILES[1L]) yieldSize(bamfile) <- 100000L open(bamfile) out <- GappedAlignments()
Warning message: The GappedAlignments class, the GappedAlignments() constructor, and the readGappedAlignments() function, have been renamed: GAlignments, GAlignments(), and readGAlignments(), respectively. The old names are deprecated. Please use the new names instead.
chunk <- readBamGappedAlignments(bamfile)
Warning message:
'readBamGappedAlignments' is deprecated.
Use 'readGAlignmentsFromBam' instead.
See help("Deprecated")
system.time(out <- append(out, chunk))
user system elapsed 0.092 0.000 0.091
chunk <- readBamGappedAlignments(bamfile)
Warning message:
'readBamGappedAlignments' is deprecated.
Use 'readGAlignmentsFromBam' instead.
See help("Deprecated")
system.time(out <- append(out, chunk))
user system elapsed 0.372 0.000 0.369
chunk <- readBamGappedAlignments(bamfile)
Warning message:
'readBamGappedAlignments' is deprecated.
Use 'readGAlignmentsFromBam' instead.
See help("Deprecated")
system.time(out <- append(out, chunk))
user system elapsed 0.896 0.012 0.909 And the sessionInfo are as before:
sessionInfo()
R version 3.0.1 (2013-05-16) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats graphics grDevices utils datasets methods [8] base other attached packages: [1] GenomicRanges_1.13.26 Biostrings_2.29.12 XVector_0.1.0 [4] IRanges_1.19.15 BiocGenerics_0.7.2 loaded via a namespace (and not attached): [1] stats4_3.0.1 So I'm not sure what happened; so far, I can only imagine an NFS / RAID related issue. Doing it with my own data gives the same results as above. Sorry for bothering you with that and many thanks for the help.
No problem. Maybe method dispatch was having a particularly bad day that day. Glad everything looks ok now :-) Cheers, H.
Cheers, Nico --------------------------------------------------------------- Nicolas Delhomme Genome Biology Computational Support European Molecular Biology Laboratory Tel: +49 6221 387 8310 Email: nicolas.delhomme at embl.de Meyerhofstrasse 1 - Postfach 10.2209 69102 Heidelberg, Germany --------------------------------------------------------------- On Jul 11, 2013, at 10:15 AM, Nicolas Delhomme wrote:
Hej Herv?! --------------------------------------------------------------- Nicolas Delhomme Genome Biology Computational Support European Molecular Biology Laboratory Tel: +49 6221 387 8310 Email: nicolas.delhomme at embl.de Meyerhofstrasse 1 - Postfach 10.2209 69102 Heidelberg, Germany --------------------------------------------------------------- On Jul 10, 2013, at 8:54 PM, Herv? Pag?s wrote:
Hi Nico, On 07/09/2013 08:07 AM, Nicolas Delhomme wrote:
Hej Bioc Core!
There was some discussion last year about implementing a BamStreamer (? la FastqStreamer), but I haven't seen anything like it in the current devel. I've implemented the following function that should do the job for me - I have many very large files, and I need to use a cluster with relatively few RAM per node and a restrictive time allocation , so I want to parallelize the reading of the BAM file to manage both. The example below is obviously not affecting the RAM issue but I streamlined it to point out my issue.
".stream" <- function(bamFile,yieldSize=100000,verbose=FALSE){
## create a stream
stopifnot(is(bamFile,"BamFile"))
## set the yieldSize if it is not set already
if(is.na(yieldSize(bamFile))){
yieldSize(bamFile) <- yieldSize
}
## open it
open(bamFile)
## verb
if(verbose){
message(paste("Streaming",basename(path(bamFile))))
}
## create the output
out <- GappedAlignments()
## process it
while(length(chunk <- readBamGappedAlignments(bamFile))){
if(verbose){
message(paste("Processed",length(chunk),"reads"))
}
out <- c(out,chunk)
}
Note that regardless the speed of c() on GappedAlignments objects, growing an object in a loop is fundamentally inefficient (see Circle 2 of The R Inferno). Also keeping the chunks in memory kind of defeats the purpose of reading the file one chunk at a time.
Sure. What this function normally really does is a data reduction - basically getting a named vector back. I just came across the appending issue when preparing the code example above.
## close close(bamFile) ## return return(out) } In the method above, the first iteration of combining the GappedAlignments: out <- c(out,chunk) takes: system.time(append(out,chunk)) user system elapsed 123.704 0.060 124.011
2 minutes! Whaoo, that's really slow. I can't reproduce this on my machine though:
OK, sounds more like a system issue then.
library(Rsamtools) library(RNAseqData.HNRNPC.bam.chr14) bamfile <- BamFile(RNAseqData.HNRNPC.bam.chr14_BAMFILES[1L]) yieldSize(bamfile) <- 100000L open(bamfile) out <- GappedAlignments() Then:
chunk <- readBamGappedAlignments(bamfile) system.time(out <- append(out, chunk))
user system elapsed 0.284 0.000 0.286 I wonder what's going on on your system. Are you sure it was not running out of memory when you did this?
Yes, that's a fat node with 0.2TB RAM and I was the only one on it at the time.
Try to check the load with uptime or top in another terminal (e.g. start top right before you call append()). If the system starts swapping, then your R process will become hundreds or thousands times slower!
and there was no memory intensive job running. Could still have been some NFS related issue. I will retry with a fresh session and monitor the I/O as well.
whereas the second iteration (faked here) takes only (still long): system.time(append(chunk,chunk)) user system elapsed 2.708 0.044 2.758
2nd, 3rd and 4th iterations for me:
chunk <- readBamGappedAlignments(bamfile) system.time(out <- append(out, chunk))
user system elapsed 0.516 0.004 0.521
chunk <- readBamGappedAlignments(bamfile) system.time(out <- append(out, chunk))
user system elapsed 0.656 0.008 0.663
chunk <- readBamGappedAlignments(bamfile) system.time(out <- append(out, chunk))
user system elapsed 0.796 0.004 0.801 As expected, the time is growing (this is why the process of growing an object in a loop is considered to be quadratic in time).
Quadratic! Wow, I knew it was slower but still... Good to know.
I suppose this has to do with the way GenomicRanges:::unlist_list_of_GappedAlignments deals with combining the objects and all the related sanity checks. For the first iteration, the seqlengths are different so I suppose that is what explains the 60X lag compared to the second iteration.
The seqinfo of the 2 objects to combine need to be merged together and set back on each object before the 2 objects can actually be combined. This operation is cheap and I wouldn't expect this to slow down the first iteration significantly.
Yes, that was very surprising.
Due to the implementation of GappedAlignments, I can't set the seqlengths programmatically in GappedAlignments() which I imagine would have reduced the first iteration lag; see the trials below: out <- GappedAlignments(seqlengths=seqlengths(chunk)) Error in GappedAlignments(seqlengths = seqlengths(chunk)) : 'names(seqlengths)' incompatible with 'levels(seqnames)' out <- GappedAlignments(seqlengths=seqlengths(chunk),seqnames=seqnames(chunk)) Error in GappedAlignments(seqlengths = seqlengths(chunk), seqnames = seqnames(chunk)) : 'strand' must be specified when 'seqnames' is not empty out <- GappedAlignments(seqlengths=seqlengths(chunk),seqnames=seqnames(chunk),strand="+") Error in validObject(.Object) : invalid class ?GappedAlignments? object: 1: invalid object for slot "strand" in class "GappedAlignments": got class "character", should be or extend class "Rle" invalid class ?GappedAlignments? object: 2: number of rows in DataTable 'mcols(x)' must match length of 'x'
The trick is to create an empty GappedAlignments objects
with non-empty seqlevels so you can put seqlengths on the
seqlevels.
Here are 2 ways to create an empty GappedAlignments objects with
non-empty seqlevels:
(1) Pass an empty factor with non-empty levels to the seqnames
arg:
out <- GappedAlignments(seqnames=factor(levels=seqlevels(chunk)))
(2) The recommended way:
out <- GappedAlignments()
seqinfo(out) <- seqinfo(chunk)
Note that with (2), 'out' gets all the seqinfo from 'chunk' (including
its seqlengths), not only its seqlevels.
(1) could be adapted to also set the seqlengths:
out <- GappedAlignments(seqnames=factor(levels=seqlevels(chunk)),
seqlengths=seqlengths(chunk))
but (2) is really the preferred way.
Thanks for the pointers!
I completely approve of such sanity checks; it seems that I'm just trying to do something that it was not designed for :-) All I'm really interested in is a way to stream my BAM file and I'm looking forward to any suggestion. I especially don't want to re-invent the wheel if you have already planned something. If you haven't I'd be glad to get some insight how I can walk around that problem. My sessionInfo: R version 3.0.1 (2013-05-16) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats graphics grDevices utils datasets methods [8] base other attached packages: [1] BiocInstaller_1.11.3 Rsamtools_1.13.22 Biostrings_2.29.12 [4] GenomicRanges_1.13.26 XVector_0.1.0 IRanges_1.19.15 [7] BiocGenerics_0.7.2 loaded via a namespace (and not attached): [1] bitops_1.0-5 stats4_3.0.1 zlibbioc_1.7.0
Looks like you are using Bioc-devel. Did you get all the warnings about GappedAlignments, readBamGappedAlignments(), and GappedAlignments() being deprecated?
I though I did, but indeed I didn't get the warnings then. This is very strange.
I thought you were using the release so that's what I used:
sessionInfo()
R version 3.0.0 (2013-04-03) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats graphics grDevices utils datasets methods [8] base other attached packages: [1] RNAseqData.HNRNPC.bam.chr14_0.1.3 Rsamtools_1.12.3 [3] Biostrings_2.28.0 GenomicRanges_1.12.4 [5] IRanges_1.18.1 BiocGenerics_0.6.0 loaded via a namespace (and not attached): [1] bitops_1.0-5 stats4_3.0.0 zlibbioc_1.6.0 The timings I get with Bioc-devel are pretty much the same though. Something doesn't seem to be quite right with your cluster.
I agree, I'll check that out.
What happens if you try to rbind() 2 data.frames of 100000 rows each in a fresh session?
df <- data.frame(aa=1:100000, bb=100000:1, cc="cc", dd="dd") system.time(df2 <- rbind(df, df))
user system elapsed 0.204 0.000 0.206
Good point. I'll try that out and let you know. Thanks for the very detailed answer! Cheers, Nico
Thanks, H.
Cheers, Nico --------------------------------------------------------------- Nicolas Delhomme Genome Biology Computational Support European Molecular Biology Laboratory Tel: +49 6221 387 8310 Email: nicolas.delhomme at embl.de Meyerhofstrasse 1 - Postfach 10.2209 69102 Heidelberg, Germany
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
-- Herv? Pag?s Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Herv? Pag?s Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319