[Bioc-devel] C++ parallel computing

Thank you for the information. I guess I'll try to stick to the R-level
parallelization whenever possible.

Best,
Oleksii

The best way to process large files is in chunks using BamFile(?,
yieldSize = ?) and by using ScanBamParam() to select just the components of
the bam files of interest. The number of cores is basically irrelevant for
input -- you'll be using just one, so choose yieldSize to use modest
amounts of memory for primary data, e.g, 4 GB per file, and process each
file separately.

Figure the iteration-by-chunk solution for one file; the simplest example
is in ?Rsamtools::BamFile

     ## Use 'yieldSize' to iterate through a file in chunks.
     bf <- open(BamFile(fl, yieldSize=1000))
     while (nrec <- length(scanBam(bf)[[1]][[1]]))
         cat("records:", nrec, "\n")
     close(bf)

but you'd likely want the convenience of
GenomicAlignments::readGAlignments() / readGAlignmentPairs().

Once this is working, write this as a proper function, specifying all
packages required for the function to complete, e.g.,

fun = function(fl, yieldSize) {
    library(Rsamtools)
    nrec <- 0L
    bf <- open(BamFile(fl, yieldSize=yieldSize))
    repeat {
        len <- length(scanBam(bf)[[1]][[1]])
        if (len == 0L)
            break
        nrec = nrec + len
    }
    close(bf)
    nrec
}

try to minimize the size of the inputs (here just the file name) and the
outputs (nrec, a single integer), perhaps using the file system to
temporarily store large results. Use BiocParallel::bplapply to apply this
to all files

    bplapply(fls, fun, yieldSize = 1000000)

I would actually recommend BiocParallel::SnowParam() (separate processes)
because (a) this enforces the discipline that the function does not rely
implicitly on the state of the parent process and (b) ensures operation
across all OS, and easier transition to, e.g., an HPC cluster. The fixed
cost of starting separate processes for each file are outweighed by the
time spent processing the file in the process.

GenomicFiles::reduceByYield() or reduceByFile() might be relevant.

I am not totally current (others on this list probably know more) but I
don't think openMP is supported on MacOS (
https://mac.r-project.org/openmp/) so would be a poor choice at the C
level if cross-platform utility were important. If it were me, and again I
do not have enough recent experience, I might aim for Intel Threaded
Building Blocks, using RcppParallel for inspiration.

Martin

From: Oleksii Nikolaienko <oleksii.nikolaienko at gmail.com>
Date: Tuesday, May 25, 2021 at 6:28 PM
To: Martin Morgan <mtmorgan.bioc at gmail.com>
Cc: "bioc-devel at r-project.org" <bioc-devel at r-project.org>
Subject: Re: [Bioc-devel] C++ parallel computing

Hi Martin,
thanks for your answer. The goal is to speed up my package (epialleleR),
where most of the functions are already written in C++, but the code is
single-threaded. Tasks include: apply analog of
GenomicAlignments::sequenceLayer to SEQ, QUAL and XM strings, calculate
per-read methylation beta values, create methylation cytosine reports with
prefiltering of sequence reads. Probably all of them I could parallelize
at the level of R, but even in this case I'd maybe like to use OpenMP SIMD
directives.
And yes, the plan is to use Rhtslib. Current backend for reading BAM
is Rsamtools, however I believe I could speed things up significantly by
avoiding unnecessary type conversions and cutting other corners. It doesn't
hurt much when the BAM file is smaller than 1GB, but for 20-40GB file
loading takes more than an hour (24 cores, 378GB RAM workstation).

Best,
Oleksii

On Tue, 25 May 2021 at 19:39, Martin Morgan <mailto:
mtmorgan.bioc at gmail.com> wrote:
If the BAM files are each processed independently, and each processing
task takes a while, then it is probably 'good enough' to use R-level
parallel evaluation using BiocParallel (currently the recommendation for
Bioconductor packages) or other evaluation framework. Also, presumably you
will use Rhtslib, which provides C-level access to the hts library. This
will requiring writing C / C++ code to interface between R and the hts
library, and will of course be a significant underataking.

It might be worth outlining in a bit more detail what your task is and how
(not too much detail!) you've tried to implement this in Rsamtools.

Martin Morgan

?On 5/24/21, 10:01 AM, "Bioc-devel on behalf of Oleksii Nikolaienko"
<mailto:bioc-devel-bounces at r-project.org on behalf of mailto:
oleksii.nikolaienko at gmail.com> wrote:

    Dear Bioc team,
    I'd like to ask for your advice on the parallelization within a Bioc
    package. Please point me to a better place if this mailing list is not
    appropriate.
    After a bit of thinking I decided that I'd like to parallelize
processing
    at the level of C++ code. Would you strongly recommend not to and use
an R
    approach instead (e.g. "future")?
    If parallel C++ is ok, what would be the best solution for all major
OSs?
    My initial choice was OpenMP, but then it seems that Apple has
something
    against it (https://mac.r-project.org/openmp/). My own dev
environment is
    mostly Big Sur/ARM64, but I wouldn't want to drop its support anyway.

    (On the actual task: loading and specific processing of very large BAM
    files, ideally significantly faster than by means of Rsamtools as a
backend)

    Best,
    Oleksii Nikolaienko

        [[alternative HTML version deleted]]

    _______________________________________________
    mailto:Bioc-devel at r-project.org mailing list
    https://stat.ethz.ch/mailman/listinfo/bioc-devel

[Bioc-devel] C++ parallel computing

Thread (6 messages)