Calls to Boost interprocess / big.matrix
Hi Jay, Following up on the previous email, I've found that I can mark the shared memory segment as read only when attaching the `big.matrix` objects. Unfortunately this has not solved the problem: the permutation procedure still runs very slowly when run on multiple cores on this cluster. Regards, Scott
On 20 May 2016 at 11:29, Scott Ritchie <sritchie73 at gmail.com> wrote:
Thanks so much Jay! I suspect your speculation on mmap is likely the root cause of the issue! So far I've been exclusively running analyses with the package on our Ubuntu cluster, which does not have a job submission system, where it performs quite nicely and scales as you would expect as I add more cores (each machine has 80 cores). The performance issues on the cluster with multiple nodes and a job submission persist even when running the code on a few cores on the head node - i.e. when running the job interactively and without the job submission system / queue. Where you able to find a work around for the performance issues on the filesystem you described? I am not concerned with file synchronicity at all: the package never writes to the big.matrix objects. I'm wondering if there is some way to mark the segment of shared memory as read only for the duration of a function call so that the OS does not check for coherency while the permutation procedure is running. I expect this might be an issue most potential users of the package once it is released, since the job-based multi-node cluster set up is much more common than the free-for-all style cluster I've been working on. Thanks, Scott On 19 May 2016 at 22:51, Jay Emerson <jayemerson at gmail.com> wrote:
Ritchie, It sounds like you have already tested the code on an Ubuntu cluster and see the types of behavior/behaviour you expect: faster runtimes with increasing number of cores, etc... (as opposed to what you are seeing on the RedHat cluster)? However: foreach with doMC can leverage shared memory are designed for single nodes of a cluster (as you probably know, doSNOW would be more elegant for distributing jobs on a cluster, but may not always be possible). A memory-mapped file provides a means of "sharing" a single object across nodes, and is kind of like "poor man's shared memory". It sounds like you are using a job submission system to distribute the work, and then foreach/doMC within nodes. This is fine and will work with bigmemory/foreach/doMC. But be careful in your testing to consider both performance using cores on a single node versus performance on a cluster with multiple nodes. However, here's some speculation: it may have to do with the filesystem. In early testing, we tried the "newest and greatest" high-performance parallel filesystem on one of our clusters, and I don't even remember the specific details. Performances plummeted. The reason was that the mmap driver implemented for the filesystem was obsessed with maintaining coherency. Imagine: one node does some work and changes something, that change needs to be reflected in the memory-mapped file as well as then up in RAM on other machines that have cached that element in RAM. It's pretty darn important (and a reason to consider a locking strategy via package synchronicity if you run concurrency risks in your algorithm). In any event, we think that the OS was checking coherency even upon _reads_ and not just _writes_. Huge traffic jams and extra work. The help solve the puzzle, we used an old-school NFS partition on the same machine, and were back up to full-speed in no time. You might give that a try if possible. Jay
Message: 1
Date: Thu, 19 May 2016 18:05:44 +1000
From: Scott Ritchie <sritchie73 at gmail.com>
To: "r-sig-hpc at r-project.org" <r-sig-hpc at r-project.org>
Subject: [R-sig-hpc] Calls to Boost interprocess / big.matrix
extremely slow on RedHat cluster
Message-ID:
<
CAO1VBV3aFWRGMkT++9cg0kMzvraTqLR7+WLEKBYC0xJbAzM_aQ at mail.gmail.com>
Content-Type: text/plain; charset="UTF-8"
Hi all,
Apologies in advance for the vagueness of the question, but I'm not sure
where the source of my problem lies.
The crux of my problem, is that an R package I have developed is running
100-1000x slower on a RedHat cluster in comparison to any other machine
I
have tested on (My mac, a Ubuntu cluster). The package uses the bigmemory package to store large matrices in shared memory, which are then accessed from parallel R session spawned from the foreach package using the doMC parallel backend. Calculations at each permutation are run in RcppArmadillo. The main routine essentially does the following: 1. As input, take the file paths to multiple file-backed big.matrix objects 2. Attach the big.matrix objects, and run some BLAS calculations on subsets within each matrix using RcppArmadillo code that I've
written.
These form the basis of several test statistics, comparing two big.matrix objects. 3. Run a permutation procedure, in which permutations are broken up
in
batches over multiple cores using the foreach package, and the doMC package as a parallel backend 4. At each permutation, run BLAS calculations on the big.matrix
objects
which are stored in shared memory. I've isolated the problem down to the calls to the `big.matrix` objects, which as I understand, utilise the Boost interprocess library (through
the
BH package) 1. On this particular server, there is huge variability in the time
it
takes to pull the data from the file-backed memory map into shared memory (e.g. just running [,] to return all elements as a regular matrix) 2. I can get the code to run very quickly in serial if I run some
code
prior to the BLAS calculations that, I think, loads the data from the file-map into shared memory. If I run some Rcpp code that runs
through
every element of the big.matrix and checks for NAs, then the
subsequent
calls to BLAS happen very quickly. 3. If I do not run the code the runs through every element of the `big.matrix` the calls to the RcppArmadillo code take a very long
time
(in comparison to other machines). 4. I still have this problem when running the code in parallel: Each permutation takes a very long time to compute. I have tried running
the
checkFinite code within each foreach loop with the aim of forcing the data into shared memory for each child process, but this does not solve my issue. 5. The runtime of the permutations seems to scale with the number of cores: the more cores I add, the longer the code takes to run. This
does
not happen on any other system. To complicate matters, this server runs on a job submission system. However, I have the same issue when running the code in parallel on the head node. I'm not sure if the problem is due to: 1. The way shared memory is set up on the server / OS 2. The way I'm interacting with the big.matrix objects in parallel The versions of R, big.matrix, Rcpp, RcppArmadillo, BH, etc are all up
to
date on the server. The hardware on the cluster I am having issues with
is
better the other machines I have tested on. I would appreciate any thoughts on how to solve or isolate this problem. Kind regards, -- Scott Ritchie, Ph.D. Student | Integrative Systems Biology | Pathology | http://www.inouyelab.org The University of Melbourne --- [[alternative HTML version deleted]] ------------------------------ Subject: Digest Footer
_______________________________________________ R-sig-hpc mailing list R-sig-hpc at r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-hpc ------------------------------ End of R-sig-hpc Digest, Vol 88, Issue 9 ****************************************
-- John W. Emerson (Jay) Associate Professor of Statistics, Adjunct, and Director of Graduate Studies Department of Statistics Yale University http://www.stat.yale.edu/~jay [[alternative HTML version deleted]]
_______________________________________________ R-sig-hpc mailing list R-sig-hpc at r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-hpc
On 19 May 2016 at 22:51, Jay Emerson <jayemerson at gmail.com> wrote:
Ritchie, It sounds like you have already tested the code on an Ubuntu cluster and see the types of behavior/behaviour you expect: faster runtimes with increasing number of cores, etc... (as opposed to what you are seeing on the RedHat cluster)? However: foreach with doMC can leverage shared memory are designed for single nodes of a cluster (as you probably know, doSNOW would be more elegant for distributing jobs on a cluster, but may not always be possible). A memory-mapped file provides a means of "sharing" a single object across nodes, and is kind of like "poor man's shared memory". It sounds like you are using a job submission system to distribute the work, and then foreach/doMC within nodes. This is fine and will work with bigmemory/foreach/doMC. But be careful in your testing to consider both performance using cores on a single node versus performance on a cluster with multiple nodes. However, here's some speculation: it may have to do with the filesystem. In early testing, we tried the "newest and greatest" high-performance parallel filesystem on one of our clusters, and I don't even remember the specific details. Performances plummeted. The reason was that the mmap driver implemented for the filesystem was obsessed with maintaining coherency. Imagine: one node does some work and changes something, that change needs to be reflected in the memory-mapped file as well as then up in RAM on other machines that have cached that element in RAM. It's pretty darn important (and a reason to consider a locking strategy via package synchronicity if you run concurrency risks in your algorithm). In any event, we think that the OS was checking coherency even upon _reads_ and not just _writes_. Huge traffic jams and extra work. The help solve the puzzle, we used an old-school NFS partition on the same machine, and were back up to full-speed in no time. You might give that a try if possible. Jay
Message: 1
Date: Thu, 19 May 2016 18:05:44 +1000
From: Scott Ritchie <sritchie73 at gmail.com>
To: "r-sig-hpc at r-project.org" <r-sig-hpc at r-project.org>
Subject: [R-sig-hpc] Calls to Boost interprocess / big.matrix
extremely slow on RedHat cluster
Message-ID:
<
CAO1VBV3aFWRGMkT++9cg0kMzvraTqLR7+WLEKBYC0xJbAzM_aQ at mail.gmail.com>
Content-Type: text/plain; charset="UTF-8"
Hi all,
Apologies in advance for the vagueness of the question, but I'm not sure
where the source of my problem lies.
The crux of my problem, is that an R package I have developed is running
100-1000x slower on a RedHat cluster in comparison to any other machine
I
have tested on (My mac, a Ubuntu cluster). The package uses the bigmemory package to store large matrices in shared memory, which are then accessed from parallel R session spawned from the foreach package using the doMC parallel backend. Calculations at each permutation are run in RcppArmadillo. The main routine essentially does the following: 1. As input, take the file paths to multiple file-backed big.matrix objects 2. Attach the big.matrix objects, and run some BLAS calculations on subsets within each matrix using RcppArmadillo code that I've
written.
These form the basis of several test statistics, comparing two big.matrix objects. 3. Run a permutation procedure, in which permutations are broken up
in
batches over multiple cores using the foreach package, and the doMC package as a parallel backend 4. At each permutation, run BLAS calculations on the big.matrix
objects
which are stored in shared memory. I've isolated the problem down to the calls to the `big.matrix` objects, which as I understand, utilise the Boost interprocess library (through
the
BH package) 1. On this particular server, there is huge variability in the time
it
takes to pull the data from the file-backed memory map into shared memory (e.g. just running [,] to return all elements as a regular matrix) 2. I can get the code to run very quickly in serial if I run some
code
prior to the BLAS calculations that, I think, loads the data from the file-map into shared memory. If I run some Rcpp code that runs
through
every element of the big.matrix and checks for NAs, then the
subsequent
calls to BLAS happen very quickly. 3. If I do not run the code the runs through every element of the `big.matrix` the calls to the RcppArmadillo code take a very long
time
(in comparison to other machines). 4. I still have this problem when running the code in parallel: Each permutation takes a very long time to compute. I have tried running
the
checkFinite code within each foreach loop with the aim of forcing the data into shared memory for each child process, but this does not solve my issue. 5. The runtime of the permutations seems to scale with the number of cores: the more cores I add, the longer the code takes to run. This
does
not happen on any other system. To complicate matters, this server runs on a job submission system. However, I have the same issue when running the code in parallel on the head node. I'm not sure if the problem is due to: 1. The way shared memory is set up on the server / OS 2. The way I'm interacting with the big.matrix objects in parallel The versions of R, big.matrix, Rcpp, RcppArmadillo, BH, etc are all up
to
date on the server. The hardware on the cluster I am having issues with
is
better the other machines I have tested on. I would appreciate any thoughts on how to solve or isolate this problem. Kind regards, -- Scott Ritchie, Ph.D. Student | Integrative Systems Biology | Pathology | http://www.inouyelab.org The University of Melbourne --- [[alternative HTML version deleted]] ------------------------------ Subject: Digest Footer
_______________________________________________ R-sig-hpc mailing list R-sig-hpc at r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-hpc ------------------------------ End of R-sig-hpc Digest, Vol 88, Issue 9 ****************************************
-- John W. Emerson (Jay) Associate Professor of Statistics, Adjunct, and Director of Graduate Studies Department of Statistics Yale University http://www.stat.yale.edu/~jay [[alternative HTML version deleted]]
_______________________________________________ R-sig-hpc mailing list R-sig-hpc at r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-hpc