Thanks so much Jay!
I suspect your speculation on mmap is likely the root cause of the issue!
So far I've been exclusively running analyses with the package on our
Ubuntu cluster,
which does not have a job submission system, where it performs quite
nicely and scales
as you would expect as I add more cores (each machine has 80 cores).
The performance issues on the cluster with multiple nodes and a job
submission persist
even when running the code on a few cores on the head node - i.e. when
running the job
interactively and without the job submission system / queue.
Where you able to find a work around for the performance issues on the
filesystem
you described? I am not concerned with file synchronicity at all: the
package never
writes to the big.matrix objects. I'm wondering if there is some way to
mark the
segment of shared memory as read only for the duration of a function call
so that
the OS does not check for coherency while the permutation procedure is
running.
I expect this might be an issue most potential users of the package once
it is released,
since the job-based multi-node cluster set up is much more common than the
free-for-all
style cluster I've been working on.
Thanks,
Scott
On 19 May 2016 at 22:51, Jay Emerson <jayemerson at gmail.com> wrote:
Ritchie,
It sounds like you have already tested the code on an Ubuntu cluster and
see the types of behavior/behaviour you expect: faster runtimes with
increasing number of cores, etc... (as opposed to what you are seeing on
the RedHat cluster)?
However: foreach with doMC can leverage shared memory are designed for
single nodes of a cluster (as you probably know, doSNOW would be more
elegant for distributing jobs on a cluster, but may not always be
possible). A memory-mapped file provides a means of "sharing" a single
object across nodes, and is kind of like "poor man's shared memory". It
sounds like you are using a job submission system to distribute the work,
and then foreach/doMC within nodes. This is fine and will work with
bigmemory/foreach/doMC.
But be careful in your testing to consider both performance using cores on
a single node versus performance on a cluster with multiple nodes.
However, here's some speculation: it may have to do with the filesystem.
In early testing, we tried the "newest and greatest" high-performance
parallel filesystem on one of our clusters, and I don't even remember the
specific details. Performances plummeted. The reason was that the mmap
driver implemented for the filesystem was obsessed with maintaining
coherency. Imagine: one node does some work and changes something, that
change needs to be reflected in the memory-mapped file as well as then up
in RAM on other machines that have cached that element in RAM. It's
pretty
darn important (and a reason to consider a locking strategy via package
synchronicity if you run concurrency risks in your algorithm). In any
event, we think that the OS was checking coherency even upon _reads_ and
not just _writes_. Huge traffic jams and extra work.
The help solve the puzzle, we used an old-school NFS partition on the same
machine, and were back up to full-speed in no time. You might give that a
try if possible.
Jay
Message: 1
Date: Thu, 19 May 2016 18:05:44 +1000
From: Scott Ritchie <sritchie73 at gmail.com>
To: "r-sig-hpc at r-project.org" <r-sig-hpc at r-project.org>
Subject: [R-sig-hpc] Calls to Boost interprocess / big.matrix
extremely slow on RedHat cluster
Message-ID:
<
CAO1VBV3aFWRGMkT++9cg0kMzvraTqLR7+WLEKBYC0xJbAzM_aQ at mail.gmail.com>
Content-Type: text/plain; charset="UTF-8"
Hi all,
Apologies in advance for the vagueness of the question, but I'm not sure
where the source of my problem lies.
The crux of my problem, is that an R package I have developed is running
100-1000x slower on a RedHat cluster in comparison to any other machine
have tested on (My mac, a Ubuntu cluster).
The package uses the bigmemory package to store large matrices in shared
memory, which are then accessed from parallel R session spawned from the
foreach package using the doMC parallel backend. Calculations at each
permutation are run in RcppArmadillo.
The main routine essentially does the following:
1. As input, take the file paths to multiple file-backed big.matrix
objects
2. Attach the big.matrix objects, and run some BLAS calculations on
subsets within each matrix using RcppArmadillo code that I've
These form the basis of several test statistics, comparing two
big.matrix
objects.
3. Run a permutation procedure, in which permutations are broken up
batches over multiple cores using the foreach package, and the doMC
package
as a parallel backend
4. At each permutation, run BLAS calculations on the big.matrix
which are stored in shared memory.
I've isolated the problem down to the calls to the `big.matrix` objects,
which as I understand, utilise the Boost interprocess library (through
BH package)
1. On this particular server, there is huge variability in the time
takes to pull the data from the file-backed memory map into shared
memory
(e.g. just running [,] to return all elements as a regular matrix)
2. I can get the code to run very quickly in serial if I run some
prior to the BLAS calculations that, I think, loads the data from the
file-map into shared memory. If I run some Rcpp code that runs
every element of the big.matrix and checks for NAs, then the
calls to BLAS happen very quickly.
3. If I do not run the code the runs through every element of the
`big.matrix` the calls to the RcppArmadillo code take a very long
(in
comparison to other machines).
4. I still have this problem when running the code in parallel: Each
permutation takes a very long time to compute. I have tried running
checkFinite code within each foreach loop with the aim of forcing the
data
into shared memory for each child process, but this does not solve my
issue.
5. The runtime of the permutations seems to scale with the number of
cores: the more cores I add, the longer the code takes to run. This
not happen on any other system.
To complicate matters, this server runs on a job submission system.
However, I have the same issue when running the code in parallel on the
head node.
I'm not sure if the problem is due to:
1. The way shared memory is set up on the server / OS
2. The way I'm interacting with the big.matrix objects in parallel
The versions of R, big.matrix, Rcpp, RcppArmadillo, BH, etc are all up
date on the server. The hardware on the cluster I am having issues with
better the other machines I have tested on.
I would appreciate any thoughts on how to solve or isolate this problem.
Kind regards,
--
Scott Ritchie,
Ph.D. Student | Integrative Systems Biology | Pathology |
http://www.inouyelab.org
The University of Melbourne
---
[[alternative HTML version deleted]]
------------------------------
Subject: Digest Footer