Skip to content

Unreproducable crashes of R-instances on cluster running Torque

8 messages · Till Francke, Simon Urbanek, Rainer M Krug +1 more

#
Dear List,
I am a user of a Linux cluster running Torque.
I want to run very "embarassingly parallel" R jobs (no worker interaction,  
no MPI/multicore, just simple replicates of a script with different  
arguments). Whenever I submit more than ~30 of these, I encounter  
problems: Some jobs run fine, others terminate with R-messages on memory  
allocation problems, or even finish without further output, sometimes  
crashing a node of the cluster. Any of these scripts run fine when started  
alone.
My admin suggests this is a memory leak in R, however, I wonder if even  
that would be the case, if this should stall the cluster.
Could anyone give me some advise how to address this, please?

Thanks,

Till


Scientific Linux SL release 5.5 (Boron)
Linux head 2.6.18-348.1.1.el5 #1 SMP Tue Jan 22 16:26:03 EST 2013 x86_64  
x86_64 x86_64 GNU/Linux
R version 2.12.1 (2010-12-16)
#
On Thu, May 2, 2013 at 5:14 AM, Till Francke <win at comets.de> wrote:
Hi, Till.

You describe several problems rather vaguely, but I would suspect that
your problems are related to memory use of user code and not to a
memory leak in R.  R messages about "memory allocation problems"
usually mean that your code is asking for more memory than is
available on the machine.  By "crashing a node of the cluster", I
suspect you mean that the machine becomes unreachable; this is often
due to the machine swapping large blocks of memory (again, a memory
issue in user code).  The scripts will run fine when enough memory is
available.  So, to deal with your problem, monitor memory usage on
running jobs and follow good programming policies regarding memory
usage.  Request larger memory resources if that is an option.  It is
possible that R has a memory leak, but it is rather unlikely this is
the problem.

If you still have issues, you may want to provide some error messages
and some sessionInfo() as well as some measure of memory usage.

Sean
#
Dear Sean,
thanks for your suggestions in spite of my obscure descriptions. I'll try  
to clarify some points:
I get things like
	Error: cannot allocate vector of size 304.6 Mb
However, the jobs are started with the Torque option
	#PBS -l mem=3gb
When I submit this job alone, everything works like a charm, so 3 gb seem  
to suffice, right? With 20 or more jobs, I get the memory message. I  
assumed Torque would only start a job if the ressources are available, is  
that a misconception?
I cannot tell more precisely; the admin just told me he had to reboot this  
node. Before that, the entire queue-handling of Torque seemed to have come  
to a halt.
If that means being frugal, removing unused objects and preallocation of  
matrices I've tried my best. Adding some calls to gc() seemed to improve  
the situation only slightly.
For memory issue, the message above is thrown. For other jobs, the process  
just terminates without any more output just after having read some large  
input files.
I agree that this is unlikely an R memory leak, however, I am trying to  
find out what I can still do from my side or if I can point the admin at  
some Torque configurations problems, which is what I suspect.
Has anyone observed similar behaviour and knows a fix?

Thanks in advance,
Till


R version 2.12.1 (2010-12-16)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
  [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8
  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
  [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] graphics  grDevices datasets  stats     utils     methods   base

other attached packages:
[1] Rmpi_0.5-9
#
On May 2, 2013, at 9:46 AM, Till Francke wrote:

            
No, the 300MB are *in addition* to all other memory allocated by R - probably very close to the 3Gb. Also note that mem is total memory over all, not per process, so some may get very little (I don't use Torque, though so this is just based on the docs).
R does gc automatically when it's running out of memory, so that makes no real difference. Sometimes it's useful to code in local scope so objects can be collected automatically, but that's all very application-specific.
It's very easy to run out of memory with parallel jobs. In particular if you don't share data across the jobs, you'll end up using a lot of memory. People underestimate that aspect even though the math is simple - if you have let's say 128GB of RAM which sounds like a lot, but run 40 jobs, you'll end up with only ~3Gb per job which is likely not enough (at least not the jobs I'm running ;)). Note that things like parsing an input file can use quite a bit of memory - it's usually a good idea to run a pre-processing step that parses random files into binary objects or RData files which can be loaded much more efficiently.

Anyway, first run just one job and watch its memory usage to see how it works. Linux typically cannot reclaim much memory back, so when it's done you should see roughly the physical memory footprint.
Geeez... I didn't know such ancient versions still existed in the wild =)

Cheers,
Simon
#
Hi, Till.

See below.
On Thu, May 2, 2013 at 9:46 AM, Till Francke <win at comets.de> wrote:
Torque will start a job if it THINKS there is memory available.  If
you have told Torque that your job needs 3gb and it uses 6gb, Torque
will not know that (typically).  If a node has 16gb of RAM, Torque may
try to put 5 3gb jobs on the node and if each is using 6gb, you can
see how problems arise. Therefore, what you are describing seems
consistent with a job not having enough memory; "cannot allocate
vector..." is an "out-of-memory" error in R.

If you can ssh into the nodes while jobs are running, you can run
"top" to see memory usage for each process.  If you cannot do so,
double the mem request anyway.
I'm not sure how hanging a node would halt an entire Torque cluster
unless the scheduler is running on a worker node (generally not a good
idea, but sometimes necessary to reduce cost).  However, having R hang
a node is a relatively common occurrence on clusters with limited node
memory relative to typical workloads.  I suspect that the memory
issues are related.  Again, I'd monitor memory usage in running
processes to make sure that you guess correctly.  For a shortcut,
simply double your Torque memory request to see if the issue is
resolved.
Yes, you'll need to be careful to remove unused objects (using rm())
in addition to gc().  At the end of the day, though, you may just need
more resources as I noted above.
You (or your admin) should have logs from the cluster that might be useful.
I do not really suspect Torque configuration problems though I cannot
rule them out.  "Crashing" a node on the cluster by trying to allocate
large blocks of memory and then swapping is, in my experience, a
not-too-uncommon event.
This is unrelated, but you should get your admin to update to a newer
version of R.  This version is 2+ years old.

Sean
10 days later
#
Simon Urbanek <simon.urbanek at r-project.org> writes:
If I remember correctly, memory fragmentation plays an important role
for R (still in version 3.0.0?), so that one continuous memory block
needs to be available to be used - otherwise one can get these error
messages even if enough memory is available, but fragmented in smaller
blocks (or does torque take care of memory fragmentation?)
Thanks for this discussion - because these are exactly the symptoms I
experienced and could not make sense of (i.e. crashing R sessions on the
cluster, hanging nodes which needed to be restarted to work again) - as
I assumed that torque would protect the node from crashing due to much memory usage. 

One point is mentioned here again and again: monitor memory usage. But
is there an easy way to do this? Can I submit a script to torque and get
back a memory report in a log file, which I can analyse to get memory
usage over time?

Rainer

  
    
#
On Mon, May 13, 2013 at 4:08 AM, Rainer M. Krug <Rainer at krugs.de> wrote:
Torque is a batch system.  The underlying OS (typically linux) is
responsible for memory management.
Some clusters do have something in place to try to do this, but it is
not a simple task to implement well since Torque is not really
"responsible" for memory management once a job is running.
You will probably need to talk to your cluster admins, but on our
cluster, I simply login to a node and run "top".  Other clusters have
dedicated monitoring tools.  Finally, some clusters have configured a
job postscript that reports on job resource usage.  All of these
issues are best dealt with by talking with cluster administrators
since each cluster (even those running torque) are unique in some
ways.

Sean
#
Sean Davis <sdavis2 at mail.nih.gov> writes:
True - makes sense.
Yes - there is always the system level approach. I was more thinking
along the R approach - something along the lines of using R's memory
profiling (which I haven't used yet).

The advantage would be that one could (depending on the simulation) run
it once locally and get the memory requirements.

Rainer
<#secure method=pgpmime mode=sign>