Skip to content

error in parallel:::sendMaster

6 messages · Tomas Kalibera, Andreas Kersting

#
Hi,

I am facing a very weird problem with parallel::mclapply. I have a script which does some data wrangling on an input dataset in parallel and then writes the results to disk. I have been using this script daily for more than one year always on an EC2 instance launched from the same AMI (no updates installed after launch) and processed thousands of different input data sets successfully. I now have an input dataset for which I face the following bug:

The basic outline of the problematic section of the script:

# parts is a data.table with 88 rows
mc_ret <- parallel::mclapply(sample.int(nrow(parts)), function(i) {
  # do some data wrangling and write the result to a file
  # ...

  print(paste0("part ", i, " written successfully."))
  return(TRUE)
}, mc.preschedule = FALSE, mc.cores = 2L)

str(mc_ret)


Expected output: "part i written successfully." is printed 88 times, once for each value of i. mc_ret is a list of length 88, each element being TRUE. Its structure is printed once. All outputs are created successfully.

Actual output (see end of the message): "part i written successfully." is printed 88 times, once for each value of i. mc_ret is a list of length 88, each element being TRUE. Its structure is printed. All outputs are created successfully. So far so good.

But then "part i written successfully." it is printed another X times, for values of i for which it was already printed. This output is intermingled with X-1 times the following error message:

Error in sendMaster(try(eval(expr, env), silent = TRUE)) :
  write error, closing pipe to the master
Calls: lapply ... <Anonymous> -> <Anonymous> -> mcparallel -> sendMaster

and Y times the message "Execution halted". mc_ret is printed again, now being a list of length 85, with the first element being TRUE and all other elements being NULL. X and Y vary from run to run.


Now to the main problem: I tried very hard to create a reproducible example, but I failed. What I observed:
- The output is (and has always been) written to path which is on an NFS share. If I instead write to a path on a local disk it will work.
- The script is invoked using Rscript. If I instead source it from an interactive R session it works. There are at least two more people who have observed this: https://stackoverflow.com/questions/51986674/mclapply-sendmaster-error-only-with-rscript
- Before the call to mclapply the code acquires an exclusive file lock on a dedicated lock file, not written to but also on the NFS share. If I remove the code acquiring the lock, the whole script will also work if called using Rscript.
- The problem also occurs for mc.preschedule = TRUE.
- There is no error if I set mc.cores to 1.
- And stressing again: the code works without any changes from Rscript for thousands of other data sets.


Rscript -e "sessionInfo()":
R version 3.5.2 (2018-12-20)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.2 LTS

Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.so

locale:
 [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8
 [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8
 [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C
[10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

loaded via a namespace (and not attached):
[1] compiler_3.5.2


I know this is a fairly old R version. I have not been able to reproduce the bug with a more recent version, but since it is so difficult to trigger, this does not mean much, I guess. I have looked through the changes made to the code of mclapply since that version and could not find something directly related. I am not even sure if it is a problem in the parallel package or some other (memory) bug. What strikes me is that others have observed a very similar error when using Rscript but not when using an interactive R session, just like I do.

I am not expecting a fix based on the information I provide, but maybe someone has some thoughts on this!?

Regards,
Andreas




Actual output:

[1] "part 51 written successfully."
[1] "part 30 written successfully."
[1] "part 32 written successfully."
[1] "part 48 written successfully."
[1] "part 63 written successfully."
[1] "part 5 written successfully."
[1][1] "part 14 written successfully." "part 18 written successfully."

[1] "part 38 written successfully."
[1] "part 11 written successfully."
[1] "part 68 written successfully."
[1] "part 45 written successfully."
[1] "part 88 written successfully."
[1] "part 36 written successfully."
[1] "part 44 written successfully."
[1] "part 55 written successfully."
[1] "part 26 written successfully."
[1] "part 37 written successfully."
[1] "part 22 written successfully."
[1] "part 13 written successfully."
[1] "part 67 written successfully."
[1] "part 10 written successfully."
[1] "part 24 written successfully."
[1] "part 20 written successfully."
[1] "part 74 written successfully."
[1] "part 50 written successfully."
[1] "part 2 written successfully."
[1] "part 17 written successfully."
[1] "part 66 written successfully."
[1] "part 61 written successfully."
[1][1] "part 7 written successfully." "part 31 written successfully."

[1] "part 70 written successfully."
[1] "part 71 written successfully."
[1][1] "part 76 written successfully."
 "part 42 written successfully."
[1][1] "part 72 written successfully." "part 12 written successfully."

[1] "part 39 written successfully."
[1] "part 58 written successfully."
[1] "part 87 written successfully."
[1] "part 81 written successfully."
[1] "part 3 written successfully."
[1] "part 29 written successfully."
[1] "part 82 written successfully."
[1] "part 54 written successfully."
[1] "part 23 written successfully."
[1] "part 56 written successfully."
[1] "part 73 written successfully."
[1] "part 27 written successfully."
[1] "part 64 written successfully."
[1] "part 46 written successfully."
[1] "part 80 written successfully."
[1] "part 53 written successfully."
[1] "part 43 written successfully."
[1] "part 47 written successfully."
[1] "part 57 written successfully."
[1] "part 75 written successfully."
[1] "part 84 written successfully."
[1] "part 85 written successfully."
[1][1] "part 86 written successfully."
 "part 19 written successfully."
[1][1] "part 33 written successfully." "part 34 written successfully."

[1] "part 21 written successfully."
[1] "part 40 written successfully."
[1][1] "part 4 written successfully."
 "part 28 written successfully."
[1] "part 62 written successfully."
[1] "part 16 written successfully."
[1] "part 79 written successfully."
[1] "part 6 written successfully."
[1] "part 77 written successfully."
[1] "part 65 written successfully."
[1] "part 60 written successfully."
[1] "part 78 written successfully."
[1] "part 1 written successfully."
[1] "part 83 written successfully."
[1] "part 9 written successfully."
[1] "part 15 written successfully."
[1][1] "part 41 written successfully."
 "part 35 written successfully."
[1][1] "part 8 written successfully." "part 69 written successfully."

[1] "part 59 written successfully."
[1] "part 25 written successfully."
[1] "part 49 written successfully."
[1] "part 52 written successfully."
List of 88
 $ : logi TRUE
 $ : logi TRUE
 $ : logi TRUE
 $ : logi TRUE
 $ : logi TRUE
 $ : logi TRUE
 $ : logi TRUE
 $ : logi TRUE
 $ : logi TRUE
 $ : logi TRUE
 $ : logi TRUE
 $ : logi TRUE
 $ : logi TRUE
 $ : logi TRUE
 $ : logi TRUE
 $ : logi TRUE
 $ : logi TRUE
 $ : logi TRUE
 $ : logi TRUE
 $ : logi TRUE
 $ : logi TRUE
 $ : logi TRUE
 $ : logi TRUE
 $ : logi TRUE
 $ : logi TRUE
 $ : logi TRUE
 $ : logi TRUE
 $ : logi TRUE
 $ : logi TRUE
 $ : logi TRUE
 $ : logi TRUE
 $ : logi TRUE
 $ : logi TRUE
 $ : logi TRUE
 $ : logi TRUE
 $ : logi TRUE
 $ : logi TRUE
 $ : logi TRUE
 $ : logi TRUE
 $ : logi TRUE
 $ : logi TRUE
 $ : logi TRUE
 $ : logi TRUE
 $ : logi TRUE
 $ : logi TRUE
 $ : logi TRUE
 $ : logi TRUE
 $ : logi TRUE
 $ : logi TRUE
 $ : logi TRUE
 $ : logi TRUE
 $ : logi TRUE
 $ : logi TRUE
 $ : logi TRUE
 $ : logi TRUE
 $ : logi TRUE
 $ : logi TRUE
 $ : logi TRUE
 $ : logi TRUE
 $ : logi TRUE
 $ : logi TRUE
 $ : logi TRUE
 $ : logi TRUE
 $ : logi TRUE
 $ : logi TRUE
 $ : logi TRUE
 $ : logi TRUE
 $ : logi TRUE
 $ : logi TRUE
 $ : logi TRUE
 $ : logi TRUE
 $ : logi TRUE
 $ : logi TRUE
 $ : logi TRUE
 $ : logi TRUE
 $ : logi TRUE
 $ : logi TRUE
 $ : logi TRUE
 $ : logi TRUE
 $ : logi TRUE
 $ : logi TRUE
 $ : logi TRUE
 $ : logi TRUE
 $ : logi TRUE
 $ : logi TRUE
 $ : logi TRUE
 $ : logi TRUE
 $ : logi TRUE
[1] "part 31 written successfully."
Error in sendMaster(try(eval(expr, env), silent = TRUE)) : 
  write error, closing pipe to the master
Calls: lapply ... <Anonymous> -> lapply -> FUN -> mcparallel -> sendMaster
[1] "part 71 written successfully."
[1] "part 35 written successfully."
Error in sendMaster(try(eval(expr, env), silent = TRUE)) : 
  write error, closing pipe to the master
Calls: lapply ... <Anonymous> -> <Anonymous> -> mcparallel -> sendMaster
[1] "part 42 written successfully."
Error in sendMaster(try(eval(expr, env), silent = TRUE)) : 
  write error, closing pipe to the master
Calls: lapply ... <Anonymous> -> <Anonymous> -> mcparallel -> sendMaster
Execution halted
[1] "part 4 written successfully."
Execution halted
Error in sendMaster(try(eval(expr, env), silent = TRUE)) : 
  write error, closing pipe to the master
Calls: lapply ... <Anonymous> -> <Anonymous> -> mcparallel -> sendMaster
[1][1] "part 30 written successfully." "part 40 written successfully."

Error in sendMaster(try(eval(expr, env), silent = TRUE)) : 
  write error, closing pipe to the master
Calls: lapply ... <Anonymous> -> <Anonymous> -> mcparallel -> sendMaster
Error in sendMaster(try(eval(expr, env), silent = TRUE)) : 
  write error, closing pipe to the master
Calls: lapply ... <Anonymous> -> <Anonymous> -> mcparallel -> sendMaster
Execution halted
[1] "part 41 written successfully."
[1] "part 80 written successfully."
Error in sendMaster(try(eval(expr, env), silent = TRUE)) : 
  write error, closing pipe to the master
Calls: lapply ... <Anonymous> -> <Anonymous> -> mcparallel -> sendMaster
Error in sendMaster(try(eval(expr, env), silent = TRUE)) : 
  write error, closing pipe to the master
Calls: lapply ... <Anonymous> -> <Anonymous> -> mcparallel -> sendMaster
[1] "part 45 written successfully."
Error in sendMaster(try(eval(expr, env), silent = TRUE)) : 
  write error, closing pipe to the master
Calls: lapply ... <Anonymous> -> <Anonymous> -> mcparallel -> sendMaster
Execution halted
[1] "part 82 written successfully."
[1] "part 54 written successfully."
Error in sendMaster(try(eval(expr, env), silent = TRUE)) : 
  write error, closing pipe to the master
Calls: lapply ... <Anonymous> -> <Anonymous> -> mcparallel -> sendMaster
Error in sendMaster(try(eval(expr, env), silent = TRUE)) : 
  write error, closing pipe to the master
Calls: lapply ... <Anonymous> -> <Anonymous> -> mcparallel -> sendMaster
[1] "part 59 written successfully."
Execution halted
Error in sendMaster(try(eval(expr, env), silent = TRUE)) : 
  write error, closing pipe to the master
Calls: lapply ... <Anonymous> -> <Anonymous> -> mcparallel -> sendMaster
[1] "part 27 written successfully."
Error in sendMaster(try(eval(expr, env), silent = TRUE)) : 
  write error, closing pipe to the master
Calls: lapply ... <Anonymous> -> <Anonymous> -> mcparallel -> sendMaster
Execution halted
[1] "part 36 written successfully."
Error in sendMaster(try(eval(expr, env), silent = TRUE)) : 
  write error, closing pipe to the master
Calls: lapply ... <Anonymous> -> <Anonymous> -> mcparallel -> sendMaster
Execution halted
[1] "part 44 written successfully."
Error in sendMaster(try(eval(expr, env), silent = TRUE)) : 
  write error, closing pipe to the master
Calls: lapply ... <Anonymous> -> <Anonymous> -> mcparallel -> sendMaster
Execution halted
Execution halted
[1] "part 83 written successfully."
Error in sendMaster(try(eval(expr, env), silent = TRUE)) : 
  write error, closing pipe to the master
Calls: lapply ... <Anonymous> -> <Anonymous> -> mcparallel -> sendMaster
[1] "part 61 written successfully."
Error in sendMaster(try(eval(expr, env), silent = TRUE)) : 
  write error, closing pipe to the master
Calls: lapply ... <Anonymous> -> <Anonymous> -> mcparallel -> sendMaster
[1] "part 70 written successfully."
Error in sendMaster(try(eval(expr, env), silent = TRUE)) : 
  write error, closing pipe to the master
Calls: lapply ... <Anonymous> -> <Anonymous> -> mcparallel -> sendMaster
[1] "part 66 written successfully."
Error in sendMaster(try(eval(expr, env), silent = TRUE)) : 
  write error, closing pipe to the master
Calls: lapply ... <Anonymous> -> <Anonymous> -> mcparallel -> sendMaster
[1] "part 10 written successfully."
Error in sendMaster(try(eval(expr, env), silent = TRUE)) : 
  write error, closing pipe to the master
Calls: lapply ... <Anonymous> -> <Anonymous> -> mcparallel -> sendMaster
Execution halted
Execution halted
[1] "part 53 written successfully."
Error in sendMaster(try(eval(expr, env), silent = TRUE)) : 
  write error, closing pipe to the master
Calls: lapply ... <Anonymous> -> <Anonymous> -> mcparallel -> sendMaster
[1] "part 55 written successfully."
Error in sendMaster(try(eval(expr, env), silent = TRUE)) : 
  write error, closing pipe to the master
Calls: lapply ... <Anonymous> -> <Anonymous> -> mcparallel -> sendMaster
[1] "part 15 written successfully."
Error in sendMaster(try(eval(expr, env), silent = TRUE)) : 
  write error, closing pipe to the master
Calls: lapply ... <Anonymous> -> <Anonymous> -> mcparallel -> sendMaster
[1] "part 50 written successfully."
Error in sendMaster(try(eval(expr, env), silent = TRUE)) : 
  write error, closing pipe to the master
Calls: lapply ... <Anonymous> -> <Anonymous> -> mcparallel -> sendMaster
[1] "part 5 written successfully."
Error in sendMaster(try(eval(expr, env), silent = TRUE)) : 
  write error, closing pipe to the master
Calls: lapply ... <Anonymous> -> <Anonymous> -> mcparallel -> sendMaster
Execution halted
Execution halted
[1] "part 48 written successfully."
Error in sendMaster(try(eval(expr, env), silent = TRUE)) : 
  write error, closing pipe to the master
Calls: lapply ... <Anonymous> -> <Anonymous> -> mcparallel -> sendMaster
[1] "part 32 written successfully."
Error in sendMaster(try(eval(expr, env), silent = TRUE)) : 
  write error, closing pipe to the master
Calls: lapply ... <Anonymous> -> <Anonymous> -> mcparallel -> sendMaster
[1] "part 69 written successfully."
Execution halted
[1] "part 52 written successfully."
Error in sendMaster(try(eval(expr, env), silent = TRUE)) : 
  write error, closing pipe to the master
Calls: lapply ... <Anonymous> -> <Anonymous> -> mcparallel -> sendMaster
Error in sendMaster(try(eval(expr, env), silent = TRUE)) : 
  write error, closing pipe to the master
Calls: lapply ... <Anonymous> -> <Anonymous> -> mcparallel -> sendMaster
[1] "part 77 written successfully."
Error in sendMaster(try(eval(expr, env), silent = TRUE)) : 
  write error, closing pipe to the master
Calls: lapply ... <Anonymous> -> <Anonymous> -> mcparallel -> sendMaster
Execution halted
Execution halted
[1] "part 24 written successfully."
Error in sendMaster(try(eval(expr, env), silent = TRUE)) : 
  write error, closing pipe to the master
Calls: lapply ... <Anonymous> -> <Anonymous> -> mcparallel -> sendMaster
[1] "part 56 written successfully."
Error in sendMaster(try(eval(expr, env), silent = TRUE)) : 
  write error, closing pipe to the master
Calls: lapply ... <Anonymous> -> <Anonymous> -> mcparallel -> sendMaster
[1] "part 6 written successfully."
Error in sendMaster(try(eval(expr, env), silent = TRUE)) : 
  write error, closing pipe to the master
Calls: lapply ... <Anonymous> -> <Anonymous> -> mcparallel -> sendMaster
[1] "part 62 written successfully."
Error in sendMaster(try(eval(expr, env), silent = TRUE)) : 
  write error, closing pipe to the master
Calls: lapply ... <Anonymous> -> <Anonymous> -> mcparallel -> sendMaster
[1] "part 33 written successfully."
Error in sendMaster(try(eval(expr, env), silent = TRUE)) : 
  write error, closing pipe to the master
Calls: lapply ... <Anonymous> -> <Anonymous> -> mcparallel -> sendMaster
Execution halted
[1] "part 16 written successfully."
Error in sendMaster(try(eval(expr, env), silent = TRUE)) : 
  write error, closing pipe to the master
Calls: lapply ... <Anonymous> -> <Anonymous> -> mcparallel -> sendMaster
[1] "part 34 written successfully."
Error in sendMaster(try(eval(expr, env), silent = TRUE)) : 
  write error, closing pipe to the master
Calls: lapply ... <Anonymous> -> <Anonymous> -> mcparallel -> sendMaster
[1] "part 74 written successfully."
Error in sendMaster(try(eval(expr, env), silent = TRUE)) : 
  write error, closing pipe to the master
Calls: lapply ... <Anonymous> -> <Anonymous> -> mcparallel -> sendMaster
Execution halted
[1] "part 2 written successfully."
Error in sendMaster(try(eval(expr, env), silent = TRUE)) : 
  write error, closing pipe to the master
Calls: lapply ... <Anonymous> -> <Anonymous> -> mcparallel -> sendMaster
[1] "part 19 written successfully."
Error in sendMaster(try(eval(expr, env), silent = TRUE)) : 
  write error, closing pipe to the master
Calls: lapply ... <Anonymous> -> <Anonymous> -> mcparallel -> sendMaster
Execution halted
[1] "part 9 written successfully."
Error in sendMaster(try(eval(expr, env), silent = TRUE)) : 
  write error, closing pipe to the master
Calls: lapply ... <Anonymous> -> <Anonymous> -> mcparallel -> sendMaster
[1] "part 11 written successfully."
[1] "part 28 written successfully."
[1] "part 29 written successfully."
Error in sendMaster(try(eval(expr, env), silent = TRUE)) : 
  write error, closing pipe to the master
Calls: lapply ... <Anonymous> -> <Anonymous> -> mcparallel -> sendMaster
Error in sendMaster(try(eval(expr, env), silent = TRUE)) : 
  write error, closing pipe to the master
Calls: lapply ... <Anonymous> -> <Anonymous> -> mcparallel -> sendMaster
Error in sendMaster(try(eval(expr, env), silent = TRUE)) : 
  write error, closing pipe to the master
Calls: lapply ... <Anonymous> -> <Anonymous> -> mcparallel -> sendMaster
[1] "part 17 written successfully."
Error in sendMaster(try(eval(expr, env), silent = TRUE)) : 
  write error, closing pipe to the master
Calls: lapply ... <Anonymous> -> <Anonymous> -> mcparallel -> sendMaster
[1] "part 65 written successfully."
Error in sendMaster(try(eval(expr, env), silent = TRUE)) : 
  write error, closing pipe to the master
Calls: lapply ... <Anonymous> -> <Anonymous> -> mcparallel -> sendMaster
Execution halted
Execution halted
[1] "part 68 written successfully."
Error in sendMaster(try(eval(expr, env), silent = TRUE)) : 
  write error, closing pipe to the master
Calls: lapply ... <Anonymous> -> <Anonymous> -> mcparallel -> sendMaster
[1] "part 43 written successfully."
Error in sendMaster(try(eval(expr, env), silent = TRUE)) : 
  write error, closing pipe to the master
Calls: lapply ... <Anonymous> -> <Anonymous> -> mcparallel -> sendMaster
[1] "part 38 written successfully."
Error in sendMaster(try(eval(expr, env), silent = TRUE)) : 
  write error, closing pipe to the master
Calls: lapply ... <Anonymous> -> <Anonymous> -> mcparallel -> sendMaster
[1] "part 1 written successfully."
Error in sendMaster(try(eval(expr, env), silent = TRUE)) : 
  write error, closing pipe to the master
Calls: lapply ... <Anonymous> -> <Anonymous> -> mcparallel -> sendMaster
[1] "part 57 written successfully."
Error in sendMaster(try(eval(expr, env), silent = TRUE)) : 
  write error, closing pipe to the master
Calls: lapply ... <Anonymous> -> <Anonymous> -> mcparallel -> sendMaster
[1] "part 8 written successfully."
Error in sendMaster(try(eval(expr, env), silent = TRUE)) : 
  write error, closing pipe to the master
Calls: lapply ... <Anonymous> -> <Anonymous> -> mcparallel -> sendMaster
[1] "part 25 written successfully."
Error in sendMaster(try(eval(expr, env), silent = TRUE)) : 
  write error, closing pipe to the master
Calls: lapply ... <Anonymous> -> <Anonymous> -> mcparallel -> sendMaster
[1]List of 85
 $ : logi TRUE
 $ : NULL
 $ : NULL
 $ : NULL
 $ : NULL
 $ : NULL
 $ : NULL
 $ : NULL
 $ : NULL
 $ : NULL
 $ : NULL
 $ : NULL
 $ : NULL
 $ : NULL
 $ : NULL
 $ : NULL
 $ : NULL
 $ : NULL
 $ : NULL
 $ : NULL
 $ : NULL
 $ : NULL
 $ : NULL
 $ : NULL
 $ : NULL
 $ : NULL
 $ : NULL
 $ : NULL
 $ : NULL
 $ : NULL
 $ : NULL
 $ : NULL
 $ : NULL
 $ : NULL
 $ : NULL
 $ : NULL
 $ : NULL
 $ : NULL
 $ : NULL
 $ : NULL
 $ : NULL
 $ : NULL
 $ : NULL
 $ : NULL
 $ : NULL
 $ : NULL
 $ : NULL
 $ : NULL
 $ : NULL
 $ : NULL
 $ : NULL
 $ : NULL
 $ : NULL
 $ : NULL
 $ : NULL
 $ : NULL
 $ : NULL
 $ : NULL
 $ : NULL
 $ : NULL
 $ : NULL
 $ : NULL
 $ : NULL
 $ : NULL
 $ : NULL
 $ : NULL
 $ : NULL
 $ : NULL
 $ : NULL
 $ : NULL
 $ : NULL
 $ : NULL
 $ : NULL
 $ : NULL
 $ : NULL
 $ : NULL
 $ : NULL
 $ : NULL
 $ : NULL
 $ : NULL
 $ : NULL
 $ : NULL
 $ : NULL
 $ : NULL
 $ : NULL
#
Hi again,

One important correction of my first message: I misinterpreted the output. Actually in that R session 2 input files were processed one after the other in a loop. The first (with 88 parts went fine). The second (with 85 parts) produced the sendMaster errors and failed. If (in a new session via Rscript) I only process the second input file it will work. The other observations on R vs Rscript, NFS share etc. still hold.

Sorry for this! Regards,
Andreas

2019-11-27 12:10 GMT+01:00 Andreas Kersting<r-devel at akersting.de>:
#
Hi Andreas,

the error is reported when some child process cannot send results to the 
master process, which originates from an error returned by write() - 
when write() returns -1 or 0. The logic around the writing has not 
changed since R 3.5.2. It should not be related to the printing in the 
child, only to returning the value. The problem may be originating from 
the execution environment, virtualization, and/or possibly from a lack 
of robustness in R. To resolve this we need to find out which error was 
returned and why. Either you can try to create a reproducible example 
(something I could use to trigger an error on my system and then debug) 
or to debug on your system (build R from source, ensure the bug is still 
triggered, then instrument to print the exact error from the OS and 
where it was detected, etc). In principle you could also try without 
code instrumentation just using strace. Just from looking at the code in 
R around the writing I am not seeing any bug there. If you choose to 
debug on your system I can help with the instrumentation.

Best
Tomas
On 11/27/19 12:40 PM, Andreas Kersting wrote:
#
Hi Tomas,

Thanks for your prompt reply and your offer to help. I might need to get back to this since I am not too experienced in debugging these kinds of issues. Anyway, I gave it a try and I think I have found the immediate cause:

I installed the debug symbols (r-base-core-dbg), placed https://github.com/wch/r-source/blob/tags/R-3-5-2/src/library/parallel/src/fork.c in cwd and changed the wrapper code to:

mc_ret <- parallel::mclapply(seq_len(nrow(parts)), function(i) {
    # we fail for the input resulting in parts having 85 rows
    if (nrow(parts) == 85L && !file.exists(as.character(Sys.getpid()))) {
      file.create(as.character(Sys.getpid()))
      print(Sys.getpid())
      Sys.sleep(30)
    }

    # ...

    return(TRUE)
  }, mc.preschedule = TRUE, mc.cores = 2L)

This way I ended up with only two child processes to which I each attached a debugger. In total I ran about 10 debugging sessions and it was always the second child process failing. The errno after write returned -1 was 9 (EBADF).
File descriptors directly after attaching gdb to both child processes during Sys.sleep(30):

### master
root at ip-10-0-48-30:~/latest_test# ls -l /proc/22119/fd
total 0
lrwx------ 1 root root 64 Nov 28 04:49 0 -> /dev/pts/0
lrwx------ 1 root root 64 Nov 28 04:49 1 -> /dev/pts/0
lrwx------ 1 root root 64 Nov 28 04:49 2 -> /dev/pts/0
lr-x------ 1 root root 64 Nov 28 04:49 3 -> /path/to/script.R
lrwx------ 1 root root 64 Nov 28 04:49 4 -> /path/on/nfs/write.lock
lr-x------ 1 root root 64 Nov 28 04:49 5 -> 'pipe:[266120]'
l-wx------ 1 root root 64 Nov 28 04:49 8 -> 'pipe:[266121]'


### first child (writes to fd 6)
(gdb) shell ls -l /proc/22134/fd
total 0
lrwx------ 1 root root 64 Nov 28 04:42 0 -> /dev/pts/0
lrwx------ 1 root root 64 Nov 28 04:42 1 -> /dev/pts/0
lrwx------ 1 root root 64 Nov 28 04:42 2 -> /dev/pts/0
lr-x------ 1 root root 64 Nov 28 04:42 3 -> /path/to/script.R
lrwx------ 1 root root 64 Nov 28 04:42 4 -> /path/on/nfs/write.lock
l-wx------ 1 root root 64 Nov 28 04:42 6 -> 'pipe:[266120]'
l-wx------ 1 root root 64 Nov 28 04:42 8 -> 'pipe:[266121]'

### second child (tries writing to fd 7)
(gdb) shell ls -l /proc/22135/fd
total 0
lr-x------ 1 root root 64 Nov 28 04:42 0 -> 'pipe:[266123]'
lrwx------ 1 root root 64 Nov 28 04:42 1 -> /dev/pts/0
l-wx------ 1 root root 64 Nov 28 04:42 10 -> 'pipe:[266123]'
lrwx------ 1 root root 64 Nov 28 04:42 2 -> /dev/pts/0
lr-x------ 1 root root 64 Nov 28 04:42 3 -> /path/to/script.R
lrwx------ 1 root root 64 Nov 28 04:42 4 -> /path/on/nfs/write.lock


# debugging mc_send_master in both child processes:

### first child (all OK):
(gdb) break mc_send_master
Breakpoint 1 at 0x7f5b2687b0a0: file fork.c, line 681.
(gdb) c
Continuing.

Breakpoint 1, mc_send_master (what=0x555c6dd2cd00) at fork.c:681
warning: Source file is more recent than executable.
681	{
(gdb) info args
what = 0x555c6dd2cd00
(gdb) n
682	    if (is_master)
(gdb) n
684	    if (master_fd == -1) 
(gdb) n
686	    if (TYPEOF(what) != RAWSXP) 
(gdb) n
688	    R_xlen_t len = XLENGTH(what);
(gdb) n
689	    unsigned char *b = RAW(what);
(gdb) n
693	    if (writerep(master_fd, &len, sizeof(len)) != sizeof(len)) {
(gdb) info locals
len = 538
b = <optimized out>
n = <optimized out>
(gdb) s
writerep (fildes=6, buf=buf at entry=0x7ffdd6da7db0, nbyte=nbyte at entry=8)
    at fork.c:653
653	{
(gdb) info args
fildes = 6
buf = 0x7ffdd6da7db0
nbyte = 8
(gdb) n
654	    size_t wbyte = 0;
(gdb) n
653	{
(gdb) n
657		ssize_t w = write(fildes, ptr + wbyte, nbyte - wbyte);
(gdb) n
658		if (w == -1) {
(gdb) n
657		ssize_t w = write(fildes, ptr + wbyte, nbyte - wbyte);
(gdb) n
658		if (w == -1) {
(gdb) n
664		if (w == 0) {
(gdb) n
670		wbyte += w;
(gdb) n
671		if (wbyte == nbyte)
(gdb) n
672		    return wbyte;
(gdb) n
674	}
(gdb) n
mc_send_master (what=<optimized out>) at fork.c:699
699	    for (R_xlen_t i = 0; i < len; i += n) {
(gdb) n
700		n = writerep(master_fd, b + i, len - i);
(gdb) n
701		if (n < 1) {
(gdb) n
699	    for (R_xlen_t i = 0; i < len; i += n) {
(gdb) n
707	    return ScalarLogical(1);
(gdb) n
708	}



### second child (write fails with EBADF):
(gdb) break mc_send_master
Breakpoint 1 at 0x7f5016f3e0a0: file fork.c, line 681.
(gdb) c
Continuing.

Breakpoint 1, mc_send_master (what=0x563382a71910) at fork.c:681
warning: Source file is more recent than executable.
681	{
(gdb) info args
what = 0x563382a71910
(gdb) n
682	    if (is_master)
(gdb) n
684	    if (master_fd == -1) 
(gdb) n
686	    if (TYPEOF(what) != RAWSXP) 
(gdb) n
688	    R_xlen_t len = XLENGTH(what);
(gdb) n
689	    unsigned char *b = RAW(what);
(gdb) n
693	    if (writerep(master_fd, &len, sizeof(len)) != sizeof(len)) {
(gdb) info locals
len = 526
b = <optimized out>
n = <optimized out>
(gdb) s
writerep (fildes=7, buf=buf at entry=0x7fff4027ad60, nbyte=nbyte at entry=8)
    at fork.c:653
653	{
(gdb) info args
fildes = 7
buf = 0x7fff4027ad60
nbyte = 8
(gdb) n
654	    size_t wbyte = 0;
(gdb) n
653	{
(gdb) n
657		ssize_t w = write(fildes, ptr + wbyte, nbyte - wbyte);
(gdb) n
658		if (w == -1) {
(gdb) info locals
w = <optimized out>
wbyte = 0
ptr = 0x7fff4027ad60 "\016\002"
(gdb) n
657		ssize_t w = write(fildes, ptr + wbyte, nbyte - wbyte);
(gdb) n
658		if (w == -1) {
(gdb) n
659		    if (errno == EINTR)
(gdb) n
674	}
(gdb) p __errno_location()
$1 = (int *) 0x7f50322cb540
(gdb) x/x $1
0x7f50322cb540:	0x00000009
(gdb) python import errno
(gdb) python print(errno.errorcode[9])
EBADF
(gdb) n
mc_send_master (what=<optimized out>) at fork.c:702
702		    close(master_fd);
(gdb) n
704		    error(_("write error, closing pipe to the master"));
(gdb) n
703		    master_fd = -1;
(gdb) n
704		    error(_("write error, closing pipe to the master"));
(gdb) n
685		error(_("there is no pipe to the master process"));


Does this help in any way? 

Is there something else I can/should look at?

Regards,
Andreas


2019-11-27 15:04 GMT+01:00 Tomas Kalibera<tomas.kalibera at gmail.com>:
#
Hi Andreas,

thank you very much, good job finding it was EBADF. Now the question is 
why the pipe has been closed prematurely; it could be accidentally by R 
(a race condition in the cleanup code in fork.c) or possibly by some 
other code running in the same process (maybe the R program itself or 
some other code it runs). Maybe we can take this off the list and come 
back when we know the cause or have it fixed.

It would help a lot if you could try with R built from source, with 
optimizations disabled to get more accurate debug symbols (e.g. env 
CFLAGS="-Wall -O0 -gdwarf-2 -g3" CXXFLAGS="-Wall -O0 -gdwarf-2 -g3" 
./configure), and with MC_DEBUG defined in fork.c - line 26. Ideally in 
R-devel, so that we are sure the problem still exists. The debug 
messages should give a hint whether it was R (fork.c) that closed the 
pipe and why. Maybe you could also add a debug message to 
close_fds_child_ci() to see if it was closed there. Maybe you could find 
this out even in your current debugging setup via breakpoints and 
backtraces, but I think it may be easier to build from source with these 
debugging messages.

Also if you could send me a complete example I could run that causes 
this on your system, that would be nice (even if it didn't cause the 
problem on my system).

Thanks
Tomas
On 11/28/19 6:35 AM, Andreas Kersting wrote:
#
Hi Tomas,

I rebuild R (v3.5.2 for now, R-devel to follow) from the Debian package with MC_DEBUG defined and hopefully also with "-Wall -O0 -gdwarf-2 -g3", though I still have to verify this.

Below is the output. I think it is a total of two mclapply invocations in this R session, the failing one starting around the lines "[1] 15381" and "[1] 15382". The "Error in partDF::write_partDF ..." is because the script/package checks the return value of mclapply and detects that it is not a list of length 85 with only the elements "TRUE".

Regarding sending you the complete example: I first have to figure out if this is possible at all, because it would involve data of a client.

Regards,
Andreas

parent[15366] created pipes: comm (6->5), sir (8->7)
parent registers new child 15379
child process 15379 started
parent[15366] created pipes: comm (7->6), sir (10->9)
parent registers new child 15380
child process 15380 started
select_children: added child 15380 (6)
select_children: added child 15379 (5)
select_children: maxfd=6, wlen=2, wcount=2, timeout=-1.000000
child 15380: send_master (550 bytes)
  sr = 1
 - read select 1 children: 15380 
child 15380: 'mcexit' called
child 15380 is waiting for permission to exit
read_child_ci(15380) - read length returned 8
read_child_ci(15380) - read 550 at 0 returned 550
select_children: added child 15380 (6)
select_children: added child 15379 (5)
select_children: maxfd=6, wlen=2, wcount=2, timeout=-1.000000
  sr = 1
 - read select 1 children: 15380 
read_child_ci(15380) - read length returned 8
detached child 15380 (signal 10)
child process 15380 got SIGUSR1; child_exit_status=-1
child 15380: exiting
select_children: added child 15379 (5)
select_children: maxfd=5, wlen=1, wcount=1, timeout=-1.000000
child 15380 terminated with exit status 0
child 15379: send_master (550 bytes)
  sr = 1
 - read select 1 children: 15379 
read_child_ci(15379) - read length returned 8
read_child_ci(15379) - read 550 at 0 returned 550
select_children: added child 15379 (5)
select_children: maxfd=5, wlen=1, wcount=1, timeout=-1.000000
child 15379: 'mcexit' called
child 15379 is waiting for permission to exit
  sr = 1
 - read select 1 children: 15379 
read_child_ci(15379) - read length returned 8
detached child 15379 (signal 10)
child process 15379 got SIGUSR1; child_exit_status=-1
child 15379: exiting
removing waited-for child 15380 from the list
killed detached child 15379 (signal 15)
removing waited-for child 2147483647 from the list
child 15379 terminated with exit status 0
removing waited-for child 15379 from the list
parent[15366] created pipes: comm (6->5), sir (8->7)
parent registers new child 15381
child process 15381 started
parent[15366] created pipes: comm (7->6), sir (10->9)
[1] 15381
parent registers new child 15382
child process 15382 started
select_children: added child 15382 (6)
select_children: added child 15381 (5)
select_children: maxfd=6, wlen=2, wcount=2, timeout=-1.000000
  sr = 1
 - read select 1 children: 15382 
read_child_ci(15382) - read length returned 0
detached child 15382 (signal 10)
child process 15382 got SIGUSR1; child_exit_status=-1
select_children: added child 15381 (5)
select_children: maxfd=5, wlen=1, wcount=1, timeout=-1.000000
[1] 15382
child 15382: send_master (526 bytes)
Error in sendMaster(try(lapply(X = S, FUN = FUN, ...), silent = TRUE)) : 
  write error, closing pipe to the master
Calls: lapply ... <Anonymous> -> <Anonymous> -> lapply -> FUN -> sendMaster
child 15382: 'mcexit' called
child 15382: exiting
child 15382 terminated with exit status 1
child 15381: send_master (538 bytes)
  sr = 1
 - read select 1 children: 15381 
read_child_ci(15381) - read length returned 8
read_child_ci(15381) - read 538 at 0 returned 538
child 15381: 'mcexit' called
child 15381 is waiting for permission to exit
select_children: added child 15381 (5)
select_children: maxfd=5, wlen=1, wcount=1, timeout=-1.000000
  sr = 1
 - read select 1 children: 15381 
read_child_ci(15381) - read length returned 8
detached child 15381 (signal 10)
child process 15381 got SIGUSR1; child_exit_status=-1
child 15381: exiting
removing waited-for child 15382 from the list
killed detached child 15381 (signal 15)
removing waited-for child 2147483647 from the list
Error in partDF::write_partDF(X, out_path, c("Date", "Panel", "Type",  : 
  at least one core did not return from parllel write; maybe it was killed (be the Linux Out of Memory Killer ?) or there was a fatal error in the forked process(es)
Calls: lapply -> FUN -> write_partDF_repo -> <Anonymous>
Execution halted
killed detached child 15381 (signal 9)
child 15381 terminated with exit status 0
removing waited-for child 15381 from the list
process 15366 parallel shutdown ok


2019-11-28 10:40 GMT+01:00 Tomas Kalibera<tomas.kalibera at gmail.com>: