Skip to content
Prev 1674 / 2152 Next

Unreproducable crashes of R-instances on cluster running Torque

Hi, Till.

See below.
On Thu, May 2, 2013 at 9:46 AM, Till Francke <win at comets.de> wrote:
Torque will start a job if it THINKS there is memory available.  If
you have told Torque that your job needs 3gb and it uses 6gb, Torque
will not know that (typically).  If a node has 16gb of RAM, Torque may
try to put 5 3gb jobs on the node and if each is using 6gb, you can
see how problems arise. Therefore, what you are describing seems
consistent with a job not having enough memory; "cannot allocate
vector..." is an "out-of-memory" error in R.

If you can ssh into the nodes while jobs are running, you can run
"top" to see memory usage for each process.  If you cannot do so,
double the mem request anyway.
I'm not sure how hanging a node would halt an entire Torque cluster
unless the scheduler is running on a worker node (generally not a good
idea, but sometimes necessary to reduce cost).  However, having R hang
a node is a relatively common occurrence on clusters with limited node
memory relative to typical workloads.  I suspect that the memory
issues are related.  Again, I'd monitor memory usage in running
processes to make sure that you guess correctly.  For a shortcut,
simply double your Torque memory request to see if the issue is
resolved.
Yes, you'll need to be careful to remove unused objects (using rm())
in addition to gc().  At the end of the day, though, you may just need
more resources as I noted above.
You (or your admin) should have logs from the cluster that might be useful.
I do not really suspect Torque configuration problems though I cannot
rule them out.  "Crashing" a node on the cluster by trying to allocate
large blocks of memory and then swapping is, in my experience, a
not-too-uncommon event.
This is unrelated, but you should get your admin to update to a newer
version of R.  This version is 2+ years old.

Sean