Skip to content

Calls to Boost interprocess / big.matrix

3 messages · Jay Emerson, Scott Ritchie

#
Ritchie,

It sounds like you have already tested the code on an Ubuntu cluster and
see the types of behavior/behaviour you expect: faster runtimes with
increasing number of cores, etc... (as opposed to what you are seeing on
the RedHat cluster)?

However: foreach with doMC can leverage shared memory are designed for
single nodes of a cluster (as you probably know, doSNOW would be more
elegant for distributing jobs on a cluster, but may not always be
possible).  A memory-mapped file provides a means of "sharing" a single
object across nodes, and is kind of like "poor man's shared memory".  It
sounds like you are using a job submission system to distribute the work,
and then foreach/doMC within nodes.  This is fine and will work with
bigmemory/foreach/doMC.

But be careful in your testing to consider both performance using cores on
a single node versus performance on a cluster with multiple nodes.

However, here's some speculation: it may have to do with the filesystem.
In early testing, we tried the "newest and greatest" high-performance
parallel filesystem on one of our clusters, and I don't even remember the
specific details.  Performances plummeted.  The reason was that the mmap
driver implemented for the filesystem was obsessed with maintaining
coherency.  Imagine: one node does some work and changes something, that
change needs to be reflected in the memory-mapped file as well as then up
in RAM on other machines that have cached that element in RAM.  It's pretty
darn important (and a reason to consider a locking strategy via package
synchronicity if you run concurrency risks in your algorithm).  In any
event, we think that the OS was checking coherency even upon _reads_ and
not just _writes_.  Huge traffic jams and extra work.

The help solve the puzzle, we used an old-school NFS partition on the same
machine, and were back up to full-speed in no time.  You might give that a
try if possible.

Jay

  
    
#
Thanks so much Jay!

I suspect your speculation on mmap is likely the root cause of the issue!

So far I've been exclusively running analyses with the package on our
Ubuntu cluster,
which does not have a job submission system, where it performs quite nicely
and scales
as you would expect as I add more cores (each machine has 80 cores).

The performance issues on the cluster with multiple nodes and a job
submission persist
even when running the code on a few cores on the head node - i.e. when
running the job
interactively and without the job submission system / queue.

Where you able to find a work around for the performance issues on the
filesystem
you described? I am not concerned with file synchronicity at all: the
package never
writes to the big.matrix objects. I'm wondering if there is some way to
mark the
segment of shared memory as read only for the duration of a function call
so that
the OS does not check for coherency while the permutation procedure is
running.
I expect this might be an issue most potential users of the package once it
is released,
since the job-based multi-node cluster set up is much more common than the
free-for-all
style cluster I've been working on.

Thanks,

Scott
On 19 May 2016 at 22:51, Jay Emerson <jayemerson at gmail.com> wrote:

            

        
On 19 May 2016 at 22:51, Jay Emerson <jayemerson at gmail.com> wrote:

            

  
  
#
Hi Jay,

Following up on the previous email, I've found that I can mark the shared
memory segment as read only when attaching the `big.matrix` objects.
Unfortunately this has not solved the problem: the permutation procedure
still runs very slowly when run on multiple cores on this cluster.

Regards,

Scott
On 20 May 2016 at 11:29, Scott Ritchie <sritchie73 at gmail.com> wrote: