Skip to content

segfault issue with parallel::mclapply and download.file() on Mac OS X

4 messages · Martin Maechler, Gábor Csárdi, Seth Russell

#
I have an lapply function call that I want to parallelize. Below is a very
simplified version of the code:

url_base <- "https://cloud.r-project.org/src/contrib/"
files <- c("A3_1.0.0.tar.gz", "ABC.RAP_0.9.0.tar.gz")
res <- parallel::mclapply(files, function(s) download.file(paste0(url_base,
s), s))

Instead of download a couple of files in parallel, I get a segfault per
process with a 'memory not mapped' message. I've been working with Henrik
Bengtsson on resolving this issue and he recommended I send a message to
the R-Devel mailing list.

Here's the output:

trying URL 'https://cloud.r-project.org/src/contrib/A3_1.0.0.tar.gz'
trying URL 'https://cloud.r-project.org/src/contrib/ABC.RAP_0.9.0.tar.gz'

 *** caught segfault ***
address 0x11575ba3a, cause 'memory not mapped'

 *** caught segfault ***
address 0x11575ba3a, cause 'memory not mapped'

Traceback:
 1: download.file(paste0(url_base, s), s)
 2: FUN(X[[i]], ...)
 3: lapply(X = S, FUN = FUN, ...)
 4: doTryCatch(return(expr), name, parentenv, handler)
 5: tryCatchOne(expr, names, parentenv, handlers[[1L]])
 6: tryCatchList(expr, classes, parentenv, handlers)
 7: tryCatch(expr, error = function(e) {    call <- conditionCall(e)    if
(!is.null(call)) {        if (identical(call[[1L]], quote(doTryCatch)))
        call <- sys.call(-4L)        dcall <- deparse(call)[1L]
 prefix <- paste("Error in", dcall, ": ")
        LONG <- 75LTraceback:
        sm <- strsplit(conditionMessage(e), "\n")[[1L]] 1:         w <- 14L
+ nchar(dcall, type = "w") + nchar(sm[1L], type = "w")        if (is.na(w))
download.file(paste0(url_base, s), s)            w <- 14L + nchar(dcall,
type = "b") + nchar(sm[1L],
                type = "b")        if (w > LONG)  2: FUN(X[[i]], ...)
 3: lapply(X = S, FUN = FUN, ...)
 4: doTryCatch(return(expr), name, parentenv, handler)
 5: tryCatchOne(expr, names, parentenv, handlers[[1L]])
 6:             prefix <- paste0(prefix, "\n  ")tryCatchList(expr, classes,
parentenv, handlers)
    }    else prefix <- "Error : " 7:     msg <- paste0(prefix,
conditionMessage(e), "\n")tryCatch(expr, error = function(e) {
 .Internal(seterrmessage(msg[1L]))    call <- conditionCall(e)    if
(!silent && isTRUE(getOption("show.error.messages"))) {    if
(!is.null(call)) {        cat(msg, file = outFile)        if
(identical(call[[1L]], quote(doTryCatch)))
.Internal(printDeferredWarnings())            call <- sys.call(-4L)    }
     dcall <- deparse(call)[1L]    invisible(structure(msg, class =
"try-error", condition = e))        prefix <- paste("Error in", dcall, ":
")})        LONG <- 75L        sm <- strsplit(conditionMessage(e),
"\n")[[1L]]
        w <- 14L + nchar(dcall, type = "w") + nchar(sm[1L], type = "w")
   if (is.na(w))  8:             w <- 14L + nchar(dcall, type = "b") +
nchar(sm[1L], try(lapply(X = S, FUN = FUN, ...), silent = TRUE)
   type = "b")
        if (w > LONG)             prefix <- paste0(prefix, "\n  ") 9:
}sendMaster(try(lapply(X = S, FUN = FUN, ...), silent = TRUE))    else
prefix <- "Error : "
    msg <- paste0(prefix, conditionMessage(e), "\n")
 .Internal(seterrmessage(msg[1L]))10:     if (!silent &&
isTRUE(getOption("show.error.messages"))) {FUN(X[[i]], ...)        cat(msg,
file = outFile)
        .Internal(printDeferredWarnings())    }11:
invisible(structure(msg, class = "try-error", condition =
e))lapply(seq_len(cores), inner.do)})

12:  8: parallel::mclapply(files, function(s)
download.file(paste0(url_base, try(lapply(X = S, FUN = FUN, ...), silent =
TRUE)    s), s))

 9:
sendMaster(try(lapply(X = S, FUN = FUN, ...), silent = TRUE))Possible
actions:

1: abort (with core dump, if enabled)
2: normal R exit
10: 3: exit R without saving workspace
FUN(X[[i]], ...)4: exit R saving workspace

11: lapply(seq_len(cores), inner.do)
12: parallel::mclapply(files, function(s) download.file(paste0(url_base,
  s), s))

Here's my sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-apple-darwin16.7.0 (64-bit)
Running under: macOS Sierra 10.12.6

Matrix products: default
BLAS/LAPACK: /usr/local/Cellar/openblas/0.3.3/lib/libopenblasp-r0.3.3.dylib

locale:
[1] en_US/en_US/en_US/C/en_US/en_US

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods
[8] base

loaded via a namespace (and not attached):
[1] compiler_3.5.1

My version of R I'm running was installed via homebrew with "brew install r
--with-java --with-openblas"

Also, the provided example code works as expected on Linux. Also, if I
provide a non-default download method to the download.file() call such as:

res <- parallel::mclapply(files, function(s) download.file(paste0(url_base,
s), s, method="wget"))
res <- parallel::mclapply(files, function(s) download.file(paste0(url_base,
s), s, method="curl"))

It works correctly - no segfault. If I use method="libcurl" it does
segfault.

I'm not sure what steps to take to further narrow down the source of the
error.

Is this a known bug? if not, is this a new bug or an unexpected feature?

Thanks,
Seth
#
> I have an lapply function call that I want to parallelize. Below is a very
    > simplified version of the code:

    > url_base <- "https://cloud.r-project.org/src/contrib/"
    > files <- c("A3_1.0.0.tar.gz", "ABC.RAP_0.9.0.tar.gz")
    > res <- parallel::mclapply(files, function(s) download.file(paste0(url_base,
    > s), s))

    > Instead of download a couple of files in parallel, I get a segfault per
    > process with a 'memory not mapped' message. I've been working with Henrik
    > Bengtsson on resolving this issue and he recommended I send a message to
    > the R-Devel mailing list.

Thank you for the simple reproducible (*) example.

If I run the above in either R-devel  or R 3.5.1, it works
flawlessly [on Linux Fedora 28]. .... ah, now I see you say so
much later... also that other methods than "libcurl" work.

To note here is that "libcurl" is also the default method on
Linux where things work.

I've also tried it on the Windows server I've easily access and
the following code -- also explicitly using  "libcurl" --

##--------------------------------------------------------------
url_base <- "https://cloud.r-project.org/src/contrib/"
files <- c("A3_1.0.0.tar.gz", "ABC.RAP_0.9.0.tar.gz")
res <- parallel::mclapply(files, function(s)
            download.file(paste0(url_base, s), s, method="libcurl"))
##--------------------------------------------------------------

works fine there too.

- So maybe this should have gone to the R-SIG-Mac mailing list
  instead of this one ??

- Can other MacOS R users try and see?

--
*) at least till one of the 2 packages gets updated ! ;-)

    > Here's the output:

    > trying URL 'https://cloud.r-project.org/src/contrib/A3_1.0.0.tar.gz'
    > trying URL 'https://cloud.r-project.org/src/contrib/ABC.RAP_0.9.0.tar.gz'

    > *** caught segfault ***
    > address 0x11575ba3a, cause 'memory not mapped'

    > *** caught segfault ***
    > address 0x11575ba3a, cause 'memory not mapped'

    > Traceback:
    > 1: download.file(paste0(url_base, s), s)
    > 2: FUN(X[[i]], ...)
    > 3: lapply(X = S, FUN = FUN, ...)
    > 4: doTryCatch(return(expr), name, parentenv, handler)
    > 5: tryCatchOne(expr, names, parentenv, handlers[[1L]])
    > 6: tryCatchList(expr, classes, parentenv, handlers)
    > 7: tryCatch(expr, error = function(e) {    call <- conditionCall(e)    if
    > (!is.null(call)) {        if (identical(call[[1L]], quote(doTryCatch)))
    > call <- sys.call(-4L)        dcall <- deparse(call)[1L]
    > prefix <- paste("Error in", dcall, ": ")
    > LONG <- 75LTraceback:
    > sm <- strsplit(conditionMessage(e), "\n")[[1L]] 1:         w <- 14L
    > + nchar(dcall, type = "w") + nchar(sm[1L], type = "w")        if (is.na(w))
    > download.file(paste0(url_base, s), s)            w <- 14L + nchar(dcall,
    > type = "b") + nchar(sm[1L],
    > type = "b")        if (w > LONG)  2: FUN(X[[i]], ...)
    > 3: lapply(X = S, FUN = FUN, ...)
    > 4: doTryCatch(return(expr), name, parentenv, handler)
    > 5: tryCatchOne(expr, names, parentenv, handlers[[1L]])
    > 6:             prefix <- paste0(prefix, "\n  ")tryCatchList(expr, classes,
    > parentenv, handlers)
    > }    else prefix <- "Error : " 7:     msg <- paste0(prefix,
    > conditionMessage(e), "\n")tryCatch(expr, error = function(e) {
    > .Internal(seterrmessage(msg[1L]))    call <- conditionCall(e)    if
    > (!silent && isTRUE(getOption("show.error.messages"))) {    if
    > (!is.null(call)) {        cat(msg, file = outFile)        if
    > (identical(call[[1L]], quote(doTryCatch)))
    > .Internal(printDeferredWarnings())            call <- sys.call(-4L)    }
    > dcall <- deparse(call)[1L]    invisible(structure(msg, class =
    > "try-error", condition = e))        prefix <- paste("Error in", dcall, ":
    > ")})        LONG <- 75L        sm <- strsplit(conditionMessage(e),
    > "\n")[[1L]]
    > w <- 14L + nchar(dcall, type = "w") + nchar(sm[1L], type = "w")
    > if (is.na(w))  8:             w <- 14L + nchar(dcall, type = "b") +
    > nchar(sm[1L], try(lapply(X = S, FUN = FUN, ...), silent = TRUE)
    > type = "b")
    > if (w > LONG)             prefix <- paste0(prefix, "\n  ") 9:
    > }sendMaster(try(lapply(X = S, FUN = FUN, ...), silent = TRUE))    else
    > prefix <- "Error : "
    > msg <- paste0(prefix, conditionMessage(e), "\n")
    > .Internal(seterrmessage(msg[1L]))10:     if (!silent &&
    > isTRUE(getOption("show.error.messages"))) {FUN(X[[i]], ...)        cat(msg,
    > file = outFile)
    > .Internal(printDeferredWarnings())    }11:
    > invisible(structure(msg, class = "try-error", condition =
    > e))lapply(seq_len(cores), inner.do)})

    > 12:  8: parallel::mclapply(files, function(s)
    > download.file(paste0(url_base, try(lapply(X = S, FUN = FUN, ...), silent =
    > TRUE)    s), s))

    > 9:
    > sendMaster(try(lapply(X = S, FUN = FUN, ...), silent = TRUE))Possible
    > actions:

    > 1: abort (with core dump, if enabled)
    > 2: normal R exit
    > 10: 3: exit R without saving workspace
    > FUN(X[[i]], ...)4: exit R saving workspace

    > 11: lapply(seq_len(cores), inner.do)
    > 12: parallel::mclapply(files, function(s) download.file(paste0(url_base,
    > s), s))

    > Here's my sessionInfo()

    >> sessionInfo()
    > R version 3.5.1 (2018-07-02)
    > Platform: x86_64-apple-darwin16.7.0 (64-bit)
    > Running under: macOS Sierra 10.12.6

    > Matrix products: default
    > BLAS/LAPACK: /usr/local/Cellar/openblas/0.3.3/lib/libopenblasp-r0.3.3.dylib

    > locale:
    > [1] en_US/en_US/en_US/C/en_US/en_US

    > attached base packages:
    > [1] parallel  stats     graphics  grDevices utils     datasets  methods
    > [8] base

    > loaded via a namespace (and not attached):
    > [1] compiler_3.5.1

    > My version of R I'm running was installed via homebrew with "brew install r
    > --with-java --with-openblas"

    > Also, the provided example code works as expected on Linux. Also, if I
    > provide a non-default download method to the download.file() call such as:

    > res <- parallel::mclapply(files, function(s) download.file(paste0(url_base,
    > s), s, method="wget"))
    > res <- parallel::mclapply(files, function(s) download.file(paste0(url_base,
    > s), s, method="curl"))

    > It works correctly - no segfault. If I use method="libcurl" it does
    > segfault.

    > I'm not sure what steps to take to further narrow down the source of the
    > error.

    > Is this a known bug? if not, is this a new bug or an unexpected feature?

    > Thanks,
    > Seth

    > [[alternative HTML version deleted]]

    > ______________________________________________
    > R-devel at r-project.org mailing list
    > https://stat.ethz.ch/mailman/listinfo/r-devel
#
This code actually happens to work for me on macOS, but I think in
general you cannot rely on performing HTTP requests in fork clusters,
i.e. with mclapply().

Fork clusters create worker processes by forking the R process and
then _not_ executing another R binary. (Which is often convenient,
because the new processes will inherit the memory image of the parent
process.)

Fork without exec is not supported by macOS, basically any calls to
system libraries might crash. (Ie. not just HTTP-related calls.) For
HTTP calls I have seen errors, crashes, and sometimes it works.
Depends on the combination of libcurl version, macOS version and
probably luck.

It usually (always?) works on Linux, but I would not rely on that, either.

So, yes, this is a known issue.

Creating new processes to perform HTTP in parallel is very often bad
practice, actually. Whenever you can, use I/O multiplexing instead,
since the main R process is not doing anything, anyway, just waiting
for the data to come in. So you don't need more processes, you need
parallel I/O. Take a look at the curl::multi_add() etc. functions.

Btw. download.file() can actually download files in parallel if the
liburl method is used, just give it a list of URLs in a character
vector. This API is very restricted, though, so I suggest to look at
the curl package.

GaborOn Thu, Sep 20, 2018 at 8:44 AM Seth Russell
<seth.russell at gmail.com> wrote:
#
Thanks for the warning about fork without exec(). A co-worker of mine, also
on Mac, ran the sample code and got an error about that exact problem.

Thanks also for the pointer to try curl::multi_add() or download.file()
with a vector of files.

My actual use case includes downloading the files and then untar() for
analysis of files contained in the tar.gz file. I'm currently parallelizing
both the download and untar operation and found that using a parallel form
of lapply resulted in 4x - 8x improvement depending on hardware, network
latency, etc. I'll see how much of that improvement can be attributed to
I/O multiplexing for the downloading portion using your recommendations.

Seth

Trimmed reply from G?bor Cs?rdi <csardi.gabor at gmail.com>: