Skip to content

Force quitting a FORK cluster node on macOS and Solaris wreaks havoc

3 messages · Henrik Bengtsson, Simon Urbanek

#
The following smells like a bug in R to me, because it puts the main R
session into an unstable state.  Consider the following R script:

a <- 42
message("a=", a)
cl <- parallel::makeCluster(1L, type="FORK")
try(parallel::clusterEvalQ(cl, quit(save="no")))
message("parallel:::isChild()=", parallel:::isChild())
message("a=", a)
rm(a)

The purpose of this was to emulate what happens when an parallel
workers crashes.

Now, if you source() the above on macOS, you might(*) end up with:
a=42
Error: Error in unserialize(node$con) : error reading from connection
parallel:::isChild()=FALSE
a=42
Error: Error in unserialize(node$con) : error reading from connection
parallel:::isChild()=FALSE
Error: Error in message("a=", a) : object 'a' not found
Execution halted

Note how 'rm(a)' is supposed to be the last line of code to be
evaluated.  However, the force quitting of the FORK cluster node
appears to result in the main code being evaluated twice (in
parallel?).

(*) This does not happen on all macOS variants. For example, it works
fine on CRAN's 'r-release-macos-x86_64' but it does give the above
behavior on 'r-release-macos-arm64'.  I can reproduce it on GitHub
Actions (https://github.com/HenrikBengtsson/teeny/runs/3309235106?check_suite_focus=true#step:10:219)
but not on R-hub's 'macos-highsierra-release' and
'macos-highsierra-release-cran'.  I can also reproduce it on R-hub's
'solaris-x86-patched' and solaris-x86-patched-ods' machines.  However,
I still haven't found a Linux machine where this happens.

If one replaces quit(save="no") with tools::pskill(Sys.getpid()) or
parallel:::mcexit(0L), this behavior does not take place (at least not
on GitHub Actions and R-hub).

I don't have access to a macOS or a Solaris machine, so I cannot
investigate further myself. For example, could it be an issue with
quit(), or does is it possible to trigger by other means? And more
importantly, should this be fixed? Also, I'd be curious what happens
if you run the above in an interactive R session.

/Henrik
#
Henrik,

I'm not quite sure I understand the report to be honest.

Just a quick comment here - using quit() in a forked child is not allowed, because the R clean-up is only intended for the master as it will be blowing away the master's state, connections, working directory, running master's exit handlers etc. That's why the children have to use either abort or mcexit() to terminate - which is what mcparallel() does. If you use q() a lot of things go wrong no matter the platform - e.g. try using ? in the master session after sourcing your code.

Cheers,
Simon
3 days later
#
Thank you Simon, this is helpful.  I take this is specific to quit(),
so it's a poor choice for emulating crashed parallel workers, and
Sys.kill() is much better for that.

I was focusing on that odd extra execution/output, but as you say,
there are lots of other things that is done by quit() here, e.g.
regardless of platform quit() damages the main R process too:
Warning message:
In parallel::mccollect(f) : 1 parallel job did not deliver a result
[1] FALSE


Would it be sufficient to make quit() fork safe by, conceptually,
doing something like:

quit <- function(save = "default", status = 0, runLast = TRUE) {
  if (parallel:::isChild())
      stop("quit() must not be called in a forked process")
  .Internal(quit(save, status, runLast))
}

This would protect against calling quit() in forked code by mistake,
e.g. when someone parallelize over code/scripts they don't have full
control over and the ones who write those scripts might not be aware
that they may be used in forks.

Thanks,

Henrik