With 20 or more jobs, I get the memory message. I assumed Torque
would only start a job if the ressources are available, is that a
misconception?
By "crashing a node of the cluster", I
suspect you mean that the machine becomes unreachable; this is often
due to the machine swapping large blocks of memory (again, a memory
issue in user code).
I cannot tell more precisely; the admin just told me he had to
reboot this node. Before that, the entire queue-handling of Torque
seemed to have come to a halt.
The scripts will run fine when enough memory is
available. So, to deal with your problem, monitor memory usage on
running jobs and follow good programming policies regarding memory
usage.
If that means being frugal, removing unused objects and
preallocation of matrices I've tried my best. Adding some calls to
gc() seemed to improve the situation only slightly.
R does gc automatically when it's running out of memory, so that makes
no real difference. Sometimes it's useful to code in local scope so
objects can be collected automatically, but that's all very
application-specific.
Request larger memory resources if that is an option. It is
possible that R has a memory leak, but it is rather unlikely this is
the problem. If you still have issues, you may want to provide some error messages
and some sessionInfo() as well as some measure of memory usage.
For memory issue, the message above is thrown. For other jobs, the
process just terminates without any more output just after having
read some large input files.
I agree that this is unlikely an R memory leak, however, I am trying
to find out what I can still do from my side or if I can point the
admin at some Torque configurations problems, which is what I
suspect.
Has anyone observed similar behaviour and knows a fix?
It's very easy to run out of memory with parallel jobs. In particular
if you don't share data across the jobs, you'll end up using a lot of
memory. People underestimate that aspect even though the math is
simple - if you have let's say 128GB of RAM which sounds like a lot,
but run 40 jobs, you'll end up with only ~3Gb per job which is likely
not enough (at least not the jobs I'm running ;)). Note that things
like parsing an input file can use quite a bit of memory - it's
usually a good idea to run a pre-processing step that parses random
files into binary objects or RData files which can be loaded much more
efficiently.
Thanks for this discussion - because these are exactly the symptoms I
experienced and could not make sense of (i.e. crashing R sessions on the
cluster, hanging nodes which needed to be restarted to work again) - as
I assumed that torque would protect the node from crashing due to much memory usage.
Some clusters do have something in place to try to do this, but it is
not a simple task to implement well since Torque is not really
"responsible" for memory management once a job is running.
One point is mentioned here again and again: monitor memory usage. But
is there an easy way to do this? Can I submit a script to torque and get
back a memory report in a log file, which I can analyse to get memory
usage over time?
You will probably need to talk to your cluster admins, but on our
cluster, I simply login to a node and run "top". Other clusters have
dedicated monitoring tools. Finally, some clusters have configured a
job postscript that reports on job resource usage. All of these
issues are best dealt with by talking with cluster administrators
since each cluster (even those running torque) are unique in some
ways.