[Bioc-devel] Memory usage for bplapply - Bioc-devel

Thu, Jan 3, 2019 8:51 PM #

Dear all,

I met a memory issue for bplapply with SnowParam(). I need to calculate
something from a large matrix many many times. But from the discussions in
https://support.bioconductor.org/p/92587, I learned that bplapply copied
the current and parent environment to each worker thread. Then means the
large matrix in my package will be copied so many times. Do you have better
suggestions in windows platform?

Before I tried to package my code, I used doSNOW package with foreach
%dopar%. It seems to consume less memory in each core (almost the size of
the matrix the task needs). But bplapply seems to copy more then objects in
current environment and the above one level environment. I am very
confused.and just guess it was copying everything.

Thanks for any help!
Best,
Lulu

Martin Morgan

Fri, Jan 4, 2019 11:38 AM #

Memory use can be complicated to understand.

    library(BiocParallel)
    
    v <- replicate(100, rnorm(10000), simplify=FALSE)
    bplapply(v, sum)

by default, bplapply splits 100 jobs (each element of the list) equally between the number of cores available, and sends just the necessary data to the cores. Again by default, the jobs are sent 'en masse' to the cores, so if there were 10 cores (and hence 10 tasks), the first core would receive the first 10 jobs and 10 x 10000 elements, and so on. The memory used to store v on the workers would be approximately the size of v, # of workers * jobs /per worker  * job size = 10 * 10 * 10000.

If memory were particularly tight, or if computation time for each job was highly variable, it might be advantageous to sends jobs one at a time, by setting the number of tasks equal to the number of jobs SnowParam(workers = 10, tasks = length(v)). Then the amount of memory used to store v would only be # of workers * 1  * 10000; this is generally slower, because there is much more communication between the manager and the workers.
    
    m <- matrix(rnorm(100 * 10000), 100, 10000)
    bplapply(seq_len(nrow(m)), function(i, m) sum(m[i]), m)

Here bplapply doesn't know how to send just some rows to the workers, so each worker gets a complete copy of m. This would be expensive.

    f <- function(x) sum(x)
        
    g <- function() {
        v <- replicate(100, rnorm(10000), simplify=FALSE)
        bplapply(v, f)
    }

this has the same memory consequences as above, the function `f()` is defined in the .GlobalEnv, so only the function definition (small) is sent to the workers.    

    h <- function() {
        f <- function(x) sum(x)
        v <- replicate(100, rnorm(10000), simplify=FALSE)
        bplapply(v, f)
    }
        
 This is expensive. The function `f()` is defined in the body of the function `h()`. So the workers receive both the function f and the environment in which it defined. The environment includes v, so each worker receives a slice of v (for f() to operate on) AND an entire copy of v (because it is in the body of the environment where `f()` was defined. A similar cost would be paid in a package, if the package defined large data objects at load time.

For more guidance, it might be helpful to provide a simplified example of what you did with doSNOW, and what you do with BiocParallel.

Hope that helps,

Martin

?On 1/3/19, 11:52 PM, "Bioc-devel on behalf of Lulu Chen" <bioc-devel-bounces at r-project.org on behalf of luluchen at vt.edu> wrote:

    Dear all,
    
    I met a memory issue for bplapply with SnowParam(). I need to calculate
    something from a large matrix many many times. But from the discussions in
    https://support.bioconductor.org/p/92587, I learned that bplapply copied
    the current and parent environment to each worker thread. Then means the
    large matrix in my package will be copied so many times. Do you have better
    suggestions in windows platform?
    
    Before I tried to package my code, I used doSNOW package with foreach
    %dopar%. It seems to consume less memory in each core (almost the size of
    the matrix the task needs). But bplapply seems to copy more then objects in
    current environment and the above one level environment. I am very
    confused.and just guess it was copying everything.
    
    Thanks for any help!
    Best,
    Lulu
    
    
    _______________________________________________
    Bioc-devel at r-project.org mailing list
    https://stat.ethz.ch/mailman/listinfo/bioc-devel

Lulu Chen

Sat, Jan 5, 2019 11:50 AM #

Hi Martin,

Thanks for your explanation which make me understand BiocParallel
much better.

I compare memory usage in my code before packaged (using doSNOW) and after
packaged (using BiocParallel) and find the increased memory is caused by
the attached packages, especially 'SummarizedExperiment'.
As required to support common Bioconductor class, I used
importFrom(SummarizedExperiment,assay). After deleting this, the memory for
each thread save nearly 200Mb. I open a new R session and find

38.5 MB

314 MB
 (I am still using R 3.5.2, not sure any update in develop version). I
think it should be a issue. A lot of packages are importing
SummarizedExperiment just for a support and never know it can cause such a
problem.

My package still imports other packages, e.g limma, fdrtool. Checked by
pryr::mem_used() as above, only 1~2 Mb increase for each. I also check
my_package in a new session, which is around 5Mb. However,  each thread in
parallel computation still increases much larger than 5 Mb. I did a
simulation: In my old code with doSNOW, I just inserted
"require('my_package')" into foreach loop and keep other code as the same.
I used 20 cores and 1000 jobs. Each thread still increases 20~30 Mb. I
don't know if there are any other thing that cause extra cost to each
thread. Thanks!

Best,
Lulu

On Fri, Jan 4, 2019 at 2:38 PM Martin Morgan <mtmorgan.bioc at gmail.com>
wrote:

Memory use can be complicated to understand.

    library(BiocParallel)

    v <- replicate(100, rnorm(10000), simplify=FALSE)
    bplapply(v, sum)

by default, bplapply splits 100 jobs (each element of the list) equally
between the number of cores available, and sends just the necessary data to
the cores. Again by default, the jobs are sent 'en masse' to the cores, so
if there were 10 cores (and hence 10 tasks), the first core would receive
the first 10 jobs and 10 x 10000 elements, and so on. The memory used to
store v on the workers would be approximately the size of v, # of workers *
jobs /per worker  * job size = 10 * 10 * 10000.

If memory were particularly tight, or if computation time for each job was
highly variable, it might be advantageous to sends jobs one at a time, by
setting the number of tasks equal to the number of jobs SnowParam(workers =
10, tasks = length(v)). Then the amount of memory used to store v would
only be # of workers * 1  * 10000; this is generally slower, because there
is much more communication between the manager and the workers.

    m <- matrix(rnorm(100 * 10000), 100, 10000)
    bplapply(seq_len(nrow(m)), function(i, m) sum(m[i]), m)

Here bplapply doesn't know how to send just some rows to the workers, so
each worker gets a complete copy of m. This would be expensive.

    f <- function(x) sum(x)

    g <- function() {
        v <- replicate(100, rnorm(10000), simplify=FALSE)
        bplapply(v, f)
    }

this has the same memory consequences as above, the function `f()` is
defined in the .GlobalEnv, so only the function definition (small) is sent
to the workers.

    h <- function() {
        f <- function(x) sum(x)
        v <- replicate(100, rnorm(10000), simplify=FALSE)
        bplapply(v, f)
    }

 This is expensive. The function `f()` is defined in the body of the
function `h()`. So the workers receive both the function f and the
environment in which it defined. The environment includes v, so each worker
receives a slice of v (for f() to operate on) AND an entire copy of v
(because it is in the body of the environment where `f()` was defined. A
similar cost would be paid in a package, if the package defined large data
objects at load time.

For more guidance, it might be helpful to provide a simplified example of
what you did with doSNOW, and what you do with BiocParallel.

Hope that helps,

Martin

?On 1/3/19, 11:52 PM, "Bioc-devel on behalf of Lulu Chen" <
bioc-devel-bounces at r-project.org on behalf of luluchen at vt.edu> wrote:

    Dear all,

    I met a memory issue for bplapply with SnowParam(). I need to calculate
    something from a large matrix many many times. But from the
discussions in
    https://support.bioconductor.org/p/92587, I learned that bplapply
copied
    the current and parent environment to each worker thread. Then means
the
    large matrix in my package will be copied so many times. Do you have
better
    suggestions in windows platform?

    Before I tried to package my code, I used doSNOW package with foreach
    %dopar%. It seems to consume less memory in each core (almost the size
of
    the matrix the task needs). But bplapply seems to copy more then
objects in
    current environment and the above one level environment. I am very
    confused.and just guess it was copying everything.

    Thanks for any help!
    Best,
    Lulu

        [[alternative HTML version deleted]]

    _______________________________________________
    Bioc-devel at r-project.org mailing list
    https://stat.ethz.ch/mailman/listinfo/bioc-devel

Martin Morgan

Sat, Jan 5, 2019 2:24 PM #

In one R session I did library(SummarizedExperiment) and then saved search(). In another R session I loaded the packages on the search path in reverse order, recording pryr::mem_used() after each. I ended up with

                      mem_used
methods               25870312
datasets              30062016
utils                 30062136
grDevices             30062256
graphics              30062376
stats                 30062496
stats4                32262992
parallel              32495080
BiocGenerics          38903928
S4Vectors             59586928
IRanges              100171896
GenomeInfoDb         113791328
GenomicRanges        154729400
Biobase              163335336
matrixStats          163518520
BiocParallel         167373512
DelayedArray         280812736
SummarizedExperiment 317386656

Each of the Bioconductor dependencies of SummarizedExperiment contribute to the overall size. Two dependencies (Biobase, DelayedArray) look a little unnecessary to me (they do not provide functionality that must be used by SummarizedExperiment) but removing them only reduces the total footprint to about 300MB. Somehow it makes sense that a package like SummarizedExperiment uses the data structures defined in other packages, and that it has a complex dependency graph. It is surprising how large the final footprint is.

One possible way to avoid at least some of the cost is to Import: SummarizedExperiment in the DESCRIPTION file, but not mention SummarizedExperiment in the NAMESPACE. Use SummarizedExperiment::assay() in the code. I think this has complicated side effects, e.g., adding methods to the imported methods table in your package (look for ".__T__" and ".__C__" (generic and class definitions) in ls(parent.env(getNamespace(<your package>)))), that indirectly increase the size of your package.

I'm not exactly sure what you mean in your second paragraph, maybe a specific example (if necessary, create a small package on github) would help. It sounds like you're saying that even with doSNOW() there are additional costs to loading your package on the worker compared to in the master...

Martin

?On 1/5/19, 2:44 PM, "Lulu Chen" <luluchen at vt.edu> wrote:

Hi Martin,
    
    
    Thanks for your explanation which make me understand BiocParallel much better. 
    
    
    I compare memory usage in my code before packaged (using doSNOW) and after packaged (using BiocParallel) and find the increased memory is caused by the attached packages, especially 'SummarizedExperiment'. 
    As required to support common Bioconductor class, I used importFrom(SummarizedExperiment,assay). After deleting this, the memory for each thread save nearly 200Mb. I open a new R session and find
    > pryr::mem_used()
    38.5 MB
    > library(SummarizedExperiment)
    
    > pryr::mem_used()
    314 MB
    
     (I am still using R 3.5.2, not sure any update in develop version). I think it should be a issue. A lot of packages are importing SummarizedExperiment just for a support and never know it can cause such a problem.
    
    
    My package still imports other packages, e.g limma, fdrtool. Checked by pryr::mem_used() as above, only 1~2 Mb increase for each. I also check my_package in a new session, which is around 5Mb. However,  each thread in parallel computation still increases
     much larger than 5 Mb. I did a simulation: In my old code with doSNOW, I just inserted "require('my_package')" into foreach loop and keep other code as the same. I used 20 cores and 1000 jobs. Each thread still increases 20~30 Mb. I don't know if there are
     any other thing that cause extra cost to each thread. Thanks!
    
    
    Best,
    Lulu

On Fri, Jan 4, 2019 at 2:38 PM Martin Morgan <mtmorgan.bioc at gmail.com> wrote:

Memory use can be complicated to understand.
    
        library(BiocParallel)
    
        v <- replicate(100, rnorm(10000), simplify=FALSE)
        bplapply(v, sum)
    
    by default, bplapply splits 100 jobs (each element of the list) equally between the number of cores available, and sends just the necessary data to the cores. Again by default, the jobs are sent 'en masse' to the cores, so if there were 10 cores (and hence
     10 tasks), the first core would receive the first 10 jobs and 10 x 10000 elements, and so on. The memory used to store v on the workers would be approximately the size of v, # of workers * jobs /per worker  * job size = 10 * 10 * 10000.
    
    If memory were particularly tight, or if computation time for each job was highly variable, it might be advantageous to sends jobs one at a time, by setting the number of tasks equal to the number of jobs SnowParam(workers = 10, tasks = length(v)). Then the
     amount of memory used to store v would only be # of workers * 1  * 10000; this is generally slower, because there is much more communication between the manager and the workers.
    
        m <- matrix(rnorm(100 * 10000), 100, 10000)
        bplapply(seq_len(nrow(m)), function(i, m) sum(m[i]), m)
    
    Here bplapply doesn't know how to send just some rows to the workers, so each worker gets a complete copy of m. This would be expensive.
    
        f <- function(x) sum(x)
    
        g <- function() {
            v <- replicate(100, rnorm(10000), simplify=FALSE)
            bplapply(v, f)
        }
    
    this has the same memory consequences as above, the function `f()` is defined in the .GlobalEnv, so only the function definition (small) is sent to the workers.   
    
    
        h <- function() {
            f <- function(x) sum(x)
            v <- replicate(100, rnorm(10000), simplify=FALSE)
            bplapply(v, f)
        }
    
     This is expensive. The function `f()` is defined in the body of the function `h()`. So the workers receive both the function f and the environment in which it defined. The environment includes v, so each worker receives a slice of v (for f() to operate on)
     AND an entire copy of v (because it is in the body of the environment where `f()` was defined. A similar cost would be paid in a package, if the package defined large data objects at load time.
    
    For more guidance, it might be helpful to provide a simplified example of what you did with doSNOW, and what you do with BiocParallel.
    
    Hope that helps,
    
    Martin
    
    ?On 1/3/19, 11:52 PM, "Bioc-devel on behalf of Lulu Chen" <bioc-devel-bounces at r-project.org on behalf of

luluchen at vt.edu> wrote:

Dear all,
    
        I met a memory issue for bplapply with SnowParam(). I need to calculate
        something from a large matrix many many times. But from the discussions in
        
    https://support.bioconductor.org/p/92587 <https://support.bioconductor.org/p/92587>, I learned that bplapply copied
        the current and parent environment to each worker thread. Then means the
        large matrix in my package will be copied so many times. Do you have better
        suggestions in windows platform?
    
        Before I tried to package my code, I used doSNOW package with foreach
        %dopar%. It seems to consume less memory in each core (almost the size of
        the matrix the task needs). But bplapply seems to copy more then objects in
        current environment and the above one level environment. I am very
        confused.and just guess it was copying everything.
    
        Thanks for any help!
        Best,
        Lulu
    
    
        _______________________________________________
        Bioc-devel at r-project.org mailing list
        
    https://stat.ethz.ch/mailman/listinfo/bioc-devel <https://stat.ethz.ch/mailman/listinfo/bioc-devel>

Shian Su

Sun, Jan 6, 2019 6:48 PM #

Can I get a indication here about what is expected to consume memory under fork and socket models as well as patterns to mitigate excessive memory consumption?

When using sockets, the model is that of multiple communicating machines running on their own memory, so it makes sense that memory usage is duplicated for loaded packages and the parent environment. But is the while data object duplicated or only the portion of the tasks assigned to a thread? i.e. 4 mb of packages, 4 mb of parent environment, 4 mb of data to run bplapply over, is each thread going to consume 12mb or 9mb of memory? It is unclear to me whether the data object operated on should be thought of as a part of the parent environment.

When using forks, the model is that of multiple processes running on shared memory. This is specific to macOS and Unix variants and I believe the model is meant to share memory until a write operation causes variables to be copied. I also believe R?s internal memory management can potentially touch all the variables and cause copies, so the worse case scenario is that everything is copied. What?s unclear is whether this applies to loaded packages, are they under the supervision of a garbage collector? So as per the previous scenario, from the second thread onwards, do we expect up to (0 + 4 + 1)mb, (4 + 4 + 1)mb or (4 + 4 + 4)mb of memory usage? Maybe even the ideal scenario of (0 + 0 + 1)?

With regards to patterns to efficiently use memory, is it sufficient to keep the parent environment as compact as possible? Are there clever ways to use local() for this?

Kind regards,
Shian

On 6 Jan 2019, at 9:24 am, Martin Morgan <mtmorgan.bioc at gmail.com<mailto:mtmorgan.bioc at gmail.com>> wrote:

In one R session I did library(SummarizedExperiment) and then saved search(). In another R session I loaded the packages on the search path in reverse order, recording pryr::mem_used() after each. I ended up with

                     mem_used
methods               25870312
datasets              30062016
utils                 30062136
grDevices             30062256
graphics              30062376
stats                 30062496
stats4                32262992
parallel              32495080
BiocGenerics          38903928
S4Vectors             59586928
IRanges              100171896
GenomeInfoDb         113791328
GenomicRanges        154729400
Biobase              163335336
matrixStats          163518520
BiocParallel         167373512
DelayedArray         280812736
SummarizedExperiment 317386656

Each of the Bioconductor dependencies of SummarizedExperiment contribute to the overall size. Two dependencies (Biobase, DelayedArray) look a little unnecessary to me (they do not provide functionality that must be used by SummarizedExperiment) but removing them only reduces the total footprint to about 300MB. Somehow it makes sense that a package like SummarizedExperiment uses the data structures defined in other packages, and that it has a complex dependency graph. It is surprising how large the final footprint is.

One possible way to avoid at least some of the cost is to Import: SummarizedExperiment in the DESCRIPTION file, but not mention SummarizedExperiment in the NAMESPACE. Use SummarizedExperiment::assay() in the code. I think this has complicated side effects, e.g., adding methods to the imported methods table in your package (look for ".__T__" and ".__C__" (generic and class definitions) in ls(parent.env(getNamespace(<your package>)))), that indirectly increase the size of your package.

I'm not exactly sure what you mean in your second paragraph, maybe a specific example (if necessary, create a small package on github) would help. It sounds like you're saying that even with doSNOW() there are additional costs to loading your package on the worker compared to in the master...

Martin

?On 1/5/19, 2:44 PM, "Lulu Chen" <luluchen at vt.edu<mailto:luluchen at vt.edu>> wrote:

Hi Martin,


   Thanks for your explanation which make me understand BiocParallel much better.


   I compare memory usage in my code before packaged (using doSNOW) and after packaged (using BiocParallel) and find the increased memory is caused by the attached packages, especially 'SummarizedExperiment'.
   As required to support common Bioconductor class, I used importFrom(SummarizedExperiment,assay). After deleting this, the memory for each thread save nearly 200Mb. I open a new R session and find
pryr::mem_used()
   38.5 MB
library(SummarizedExperiment)

pryr::mem_used()
   314 MB

    (I am still using R 3.5.2, not sure any update in develop version). I think it should be a issue. A lot of packages are importing SummarizedExperiment just for a support and never know it can cause such a problem.


   My package still imports other packages, e.g limma, fdrtool. Checked by pryr::mem_used() as above, only 1~2 Mb increase for each. I also check my_package in a new session, which is around 5Mb. However,  each thread in parallel computation still increases
    much larger than 5 Mb. I did a simulation: In my old code with doSNOW, I just inserted "require('my_package')" into foreach loop and keep other code as the same. I used 20 cores and 1000 jobs. Each thread still increases 20~30 Mb. I don't know if there are
    any other thing that cause extra cost to each thread. Thanks!


   Best,
   Lulu

On Fri, Jan 4, 2019 at 2:38 PM Martin Morgan <mtmorgan.bioc at gmail.com<mailto:mtmorgan.bioc at gmail.com>> wrote:

Memory use can be complicated to understand.

       library(BiocParallel)

       v <- replicate(100, rnorm(10000), simplify=FALSE)
       bplapply(v, sum)

   by default, bplapply splits 100 jobs (each element of the list) equally between the number of cores available, and sends just the necessary data to the cores. Again by default, the jobs are sent 'en masse' to the cores, so if there were 10 cores (and hence
    10 tasks), the first core would receive the first 10 jobs and 10 x 10000 elements, and so on. The memory used to store v on the workers would be approximately the size of v, # of workers * jobs /per worker  * job size = 10 * 10 * 10000.

   If memory were particularly tight, or if computation time for each job was highly variable, it might be advantageous to sends jobs one at a time, by setting the number of tasks equal to the number of jobs SnowParam(workers = 10, tasks = length(v)). Then the
    amount of memory used to store v would only be # of workers * 1  * 10000; this is generally slower, because there is much more communication between the manager and the workers.

       m <- matrix(rnorm(100 * 10000), 100, 10000)
       bplapply(seq_len(nrow(m)), function(i, m) sum(m[i]), m)

   Here bplapply doesn't know how to send just some rows to the workers, so each worker gets a complete copy of m. This would be expensive.

       f <- function(x) sum(x)

       g <- function() {
           v <- replicate(100, rnorm(10000), simplify=FALSE)
           bplapply(v, f)
       }

   this has the same memory consequences as above, the function `f()` is defined in the .GlobalEnv, so only the function definition (small) is sent to the workers.


       h <- function() {
           f <- function(x) sum(x)
           v <- replicate(100, rnorm(10000), simplify=FALSE)
           bplapply(v, f)
       }

    This is expensive. The function `f()` is defined in the body of the function `h()`. So the workers receive both the function f and the environment in which it defined. The environment includes v, so each worker receives a slice of v (for f() to operate on)
    AND an entire copy of v (because it is in the body of the environment where `f()` was defined. A similar cost would be paid in a package, if the package defined large data objects at load time.

   For more guidance, it might be helpful to provide a simplified example of what you did with doSNOW, and what you do with BiocParallel.

   Hope that helps,

   Martin

   ?On 1/3/19, 11:52 PM, "Bioc-devel on behalf of Lulu Chen" <bioc-devel-bounces at r-project.org<mailto:bioc-devel-bounces at r-project.org> on behalf of

luluchen at vt.edu<mailto:luluchen at vt.edu>> wrote:

Dear all,

       I met a memory issue for bplapply with SnowParam(). I need to calculate
       something from a large matrix many many times. But from the discussions in

   https://support.bioconductor.org/p/92587 <https://support.bioconductor.org/p/92587>, I learned that bplapply copied
       the current and parent environment to each worker thread. Then means the
       large matrix in my package will be copied so many times. Do you have better
       suggestions in windows platform?

       Before I tried to package my code, I used doSNOW package with foreach
       %dopar%. It seems to consume less memory in each core (almost the size of
       the matrix the task needs). But bplapply seems to copy more then objects in
       current environment and the above one level environment. I am very
       confused.and just guess it was copying everything.

       Thanks for any help!
       Best,
       Lulu


       _______________________________________________
       Bioc-devel at r-project.org<mailto:Bioc-devel at r-project.org> mailing list

   https://stat.ethz.ch/mailman/listinfo/bioc-devel <https://stat.ethz.ch/mailman/listinfo/bioc-devel>





_______________________________________________
Bioc-devel at r-project.org<mailto:Bioc-devel at r-project.org> mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


_______________________________________________

The information in this email is confidential and intended solely for the addressee.
You must not disclose, forward, print or use it without the permission of the sender.

The Walter and Eliza Hall Institute acknowledges the Wurundjeri people of the Kulin
Nation as the traditional owners of the land where our campuses are located and
the continuing connection to country and community.
_______________________________________________

Martin Morgan

Sun, Jan 6, 2019 8:18 PM #