ISSUE: Using *forks* for parallel processing in R is not always safe. The `parallel::mclapply()` function uses forked processes to parallelize. One example where it has been confirmed that forked processing causes problems is when running R via RStudio. It is recommended to use PSOCK clusters (`parallel::makeCluster()`) rather than *forked* processes when running R from RStudio ( https://github.com/rstudio/rstudio/issues/2597#issuecomment-482187011). AFAIK, it is not straightforward to disable forked processing in R. One could set environment variable `MC_CORES=1` which will set R option `mc.cores=1` when the parallel package is loaded. Since `mc.cores = getOption("mc.cores", 2L)` is the default for `parallel::mclapply()`, this will cause `mclapply()` to fall back to `lapply()` avoiding _forked_ processing. However, this does not work when the code specifies argument `mc.cores`, e.g. `mclapply(..., mc.cores = detectCores())`. SUGGESTION: Introduce environment variable `R_ENABLE_FORKS` and corresponding R option `enable.forks` that both take logical scalars. By setting `R_ENABLE_FORKS=false` or equivalently `enable.forks=FALSE`, `parallel::mclapply()` will fall back to `lapply()`. For `parallel::mcparallel()`, we could produce an error if forks are disabled. Comments? /Henrik
SUGGESTION: Settings to disable forked processing in R, e.g. parallel::mclapply()
14 messages · Henrik Bengtsson, Iñaki Ucar, Travers Ching +3 more
On Thu, 11 Apr 2019 at 22:07, Henrik Bengtsson
<henrik.bengtsson at gmail.com> wrote:
ISSUE: Using *forks* for parallel processing in R is not always safe. [...] Comments?
Using fork() is never safe. The reference provided by Kevin [1] is pretty compelling (I kindly encourage anyone who ever forked a process to read it). Therefore, I'd go beyond Henrik's suggestion, and I'd advocate for deprecating fork clusters and eventually removing them from parallel. [1] https://www.microsoft.com/en-us/research/uploads/prod/2019/04/fork-hotos19.pdf
I?aki ?car
Just throwing my two cents in: I think removing/deprecating fork would be a bad idea for two reasons: 1) There are no performant alternatives 2) Removing fork would break existing workflows Even if replaced with something using the same interface (e.g., a function that automatically detects variables to export as in the amazing `future` package), the lack of copy-on-write functionality would cause scripts everywhere to break. A simple example illustrating these two points: `x <- 5e8; mclapply(1:24, sum, x, 8)` Using fork, `mclapply` takes 5 seconds. Using "psock", `clusterApply` does not complete. Travers
On Fri, Apr 12, 2019 at 2:32 AM I?aki Ucar <iucar at fedoraproject.org> wrote:
On Thu, 11 Apr 2019 at 22:07, Henrik Bengtsson <henrik.bengtsson at gmail.com> wrote:
ISSUE: Using *forks* for parallel processing in R is not always safe. [...] Comments?
Using fork() is never safe. The reference provided by Kevin [1] is pretty compelling (I kindly encourage anyone who ever forked a process to read it). Therefore, I'd go beyond Henrik's suggestion, and I'd advocate for deprecating fork clusters and eventually removing them from parallel. [1] https://www.microsoft.com/en-us/research/uploads/prod/2019/04/fork-hotos19.pdf -- I?aki ?car
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
On Fri, 12 Apr 2019 at 21:32, Travers Ching <traversc at gmail.com> wrote:
Just throwing my two cents in: I think removing/deprecating fork would be a bad idea for two reasons: 1) There are no performant alternatives
"Performant"... in terms of what. If the cost of copying the data predominates over the computation time, maybe you didn't need parallelization in the first place.
2) Removing fork would break existing workflows
I don't see why mclapply could not be rewritten using PSOCK clusters. And as a side effect, this would enable those workflows on Windows, which doesn't support fork.
Even if replaced with something using the same interface (e.g., a function that automatically detects variables to export as in the amazing `future` package), the lack of copy-on-write functionality would cause scripts everywhere to break.
To implement copy-on-write, Linux overcommits virtual memory, and this is what causes scripts to break unexpectedly: everything works fine, until you change a small unimportant bit and... boom, out of memory. And in general, running forks in any GUI would cause things everywhere to break.
A simple example illustrating these two points: `x <- 5e8; mclapply(1:24, sum, x, 8)` Using fork, `mclapply` takes 5 seconds. Using "psock", `clusterApply` does not complete.
I'm not sure how did you setup that, but it does complete. Or do you mean that you ran out of memory? Then try replacing "x" with, e.g., "x+1" in your mclapply example and see what happens (hint: save your work first). -- I?aki ?car
Hi Inaki,
"Performant"... in terms of what. If the cost of copying the data predominates over the computation time, maybe you didn't need parallelization in the first place.
Performant in terms of speed. There's no copying in that example using `mclapply` and so it is significantly faster than other alternatives. It is a very simple and contrived example, but there are lots of applications that depend on processing of large data and benefit from multithreading. For example, if I read in large sequencing data with `Rsamtools` and want to check sequences for a set of motifs.
I don't see why mclapply could not be rewritten using PSOCK clusters.
Because it would be much slower.
To implement copy-on-write, Linux overcommits virtual memory, and this is what causes scripts to break unexpectedly: everything works fine, until you change a small unimportant bit and... boom, out of memory. And in general, running forks in any GUI would cause things everywhere to break.
I'm not sure how did you setup that, but it does complete. Or do you mean that you ran out of memory? Then try replacing "x" with, e.g., "x+1" in your mclapply example and see what happens (hint: save your work first).
Yes, I meant that it ran out of memory on my desktop. I understand the limits, and it is not perfect because of the GUI issue you mention, but I don't see a better alternative in terms of speed. Regards, Travers
On Fri, Apr 12, 2019 at 3:45 PM I?aki Ucar <iucar at fedoraproject.org> wrote:
On Fri, 12 Apr 2019 at 21:32, Travers Ching <traversc at gmail.com> wrote:
Just throwing my two cents in: I think removing/deprecating fork would be a bad idea for two reasons: 1) There are no performant alternatives
"Performant"... in terms of what. If the cost of copying the data predominates over the computation time, maybe you didn't need parallelization in the first place.
2) Removing fork would break existing workflows
I don't see why mclapply could not be rewritten using PSOCK clusters. And as a side effect, this would enable those workflows on Windows, which doesn't support fork.
Even if replaced with something using the same interface (e.g., a function that automatically detects variables to export as in the amazing `future` package), the lack of copy-on-write functionality would cause scripts everywhere to break.
To implement copy-on-write, Linux overcommits virtual memory, and this is what causes scripts to break unexpectedly: everything works fine, until you change a small unimportant bit and... boom, out of memory. And in general, running forks in any GUI would cause things everywhere to break.
A simple example illustrating these two points: `x <- 5e8; mclapply(1:24, sum, x, 8)` Using fork, `mclapply` takes 5 seconds. Using "psock", `clusterApply` does not complete.
I'm not sure how did you setup that, but it does complete. Or do you mean that you ran out of memory? Then try replacing "x" with, e.g., "x+1" in your mclapply example and see what happens (hint: save your work first). -- I?aki ?car
I think it's worth saying that mclapply() works as documented: it relies on forking, and so doesn't work well in environments where it's unsafe to fork. This is spelled out explicitly in the documentation of ?mclapply: It is strongly discouraged to use these functions in GUI or embedded environments, because it leads to several processes sharing the same GUI which will likely cause chaos (and possibly crashes). Child processes should never use on-screen graphics devices. I believe the expectation is that users who need more control over the kind of cluster that's used for parallel computations would instead create the cluster themselves with e.g. `makeCluster()` and then use `clusterApply()` / `parLapply()` or other APIs as appropriate. In environments where forking works, `mclapply()` is nice because you don't need to think -- the process is forked, and anything available in your main session is automatically available in the child processes. This is a nice convenience for when you know it's safe to fork R (and know what you're doing is safe to do within a forked process). When it's not safe, it's better to prefer the other APIs available for computation on a cluster. Forking can be unsafe and dangerous, but it's also convenient and sometimes that convenience can outweigh the other concerns. Finally, I want to add: the onus should be on the front-end to work well with R, and not the other way around. I don't think it's fair to impose extra work / an extra maintenance burden on the R Core team for something that's already clearly documented ... Best, Kevin
On Fri, Apr 12, 2019 at 6:04 PM Travers Ching <traversc at gmail.com> wrote:
Hi Inaki,
"Performant"... in terms of what. If the cost of copying the data predominates over the computation time, maybe you didn't need parallelization in the first place.
Performant in terms of speed. There's no copying in that example using `mclapply` and so it is significantly faster than other alternatives. It is a very simple and contrived example, but there are lots of applications that depend on processing of large data and benefit from multithreading. For example, if I read in large sequencing data with `Rsamtools` and want to check sequences for a set of motifs.
I don't see why mclapply could not be rewritten using PSOCK clusters.
Because it would be much slower.
To implement copy-on-write, Linux overcommits virtual memory, and this is what causes scripts to break unexpectedly: everything works fine, until you change a small unimportant bit and... boom, out of memory. And in general, running forks in any GUI would cause things everywhere to break.
I'm not sure how did you setup that, but it does complete. Or do you mean that you ran out of memory? Then try replacing "x" with, e.g., "x+1" in your mclapply example and see what happens (hint: save your work first).
Yes, I meant that it ran out of memory on my desktop. I understand the limits, and it is not perfect because of the GUI issue you mention, but I don't see a better alternative in terms of speed. Regards, Travers On Fri, Apr 12, 2019 at 3:45 PM I?aki Ucar <iucar at fedoraproject.org> wrote:
On Fri, 12 Apr 2019 at 21:32, Travers Ching <traversc at gmail.com> wrote:
Just throwing my two cents in: I think removing/deprecating fork would be a bad idea for two reasons: 1) There are no performant alternatives
"Performant"... in terms of what. If the cost of copying the data predominates over the computation time, maybe you didn't need parallelization in the first place.
2) Removing fork would break existing workflows
I don't see why mclapply could not be rewritten using PSOCK clusters. And as a side effect, this would enable those workflows on Windows, which doesn't support fork.
Even if replaced with something using the same interface (e.g., a function that automatically detects variables to export as in the amazing `future` package), the lack of copy-on-write functionality would cause scripts everywhere to break.
To implement copy-on-write, Linux overcommits virtual memory, and this is what causes scripts to break unexpectedly: everything works fine, until you change a small unimportant bit and... boom, out of memory. And in general, running forks in any GUI would cause things everywhere to break.
A simple example illustrating these two points: `x <- 5e8; mclapply(1:24, sum, x, 8)` Using fork, `mclapply` takes 5 seconds. Using "psock", `clusterApply` does not complete.
I'm not sure how did you setup that, but it does complete. Or do you mean that you ran out of memory? Then try replacing "x" with, e.g., "x+1" in your mclapply example and see what happens (hint: save your work first). -- I?aki ?car
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
I fully agree with Kevin. Front-ends can always use pthread_atfork() to close descriptors and suspend threads in children. Anyone who thinks you can use PSOCK clusters has obviously not used mclappy() in real applications - trying to save the workspace and restore it in 20 new processes is not only incredibly wasteful (no shared memory whatsoever) but slow. If you want to use PSOCK just do it (I never do - you might as well just use a full cluster instead), multicore is for the cases where you want to parallelize something quickly and it works really well for that purpose. I'd like to separate the issues here - the fact that RStudio has issues is really not R's fault - there is no technical reason why it shouldn't be able to handle it correctly. That is not to say that there are cases where fork() is dangerous, but in most cases it's not and the benefits outweigh the risk. That said, I do acknowledge the idea of having an ability to prevent forking if desired - I think that's a good idea, in particular if there is a standard that packages can also adhere to it (yes, there are also packages that use fork() explicitly). I just think that the motivation is wrong (i.e., I don't think it would be wise for RStudio to prevent parallelization by default). Also I'd like to point out that the main problem came about when packages started using parallel implicitly - the good citizens out there expose it as a parameter to the user, but not all packages do it which means you can hit forked code without knowing it. If you use mclapply() in user code, you typically know what you're doing, but if a package author does it for you, it's a different story. Cheers, Simon
On Apr 12, 2019, at 21:50, Kevin Ushey <kevinushey at gmail.com> wrote: I think it's worth saying that mclapply() works as documented: it relies on forking, and so doesn't work well in environments where it's unsafe to fork. This is spelled out explicitly in the documentation of ?mclapply: It is strongly discouraged to use these functions in GUI or embedded environments, because it leads to several processes sharing the same GUI which will likely cause chaos (and possibly crashes). Child processes should never use on-screen graphics devices. I believe the expectation is that users who need more control over the kind of cluster that's used for parallel computations would instead create the cluster themselves with e.g. `makeCluster()` and then use `clusterApply()` / `parLapply()` or other APIs as appropriate. In environments where forking works, `mclapply()` is nice because you don't need to think -- the process is forked, and anything available in your main session is automatically available in the child processes. This is a nice convenience for when you know it's safe to fork R (and know what you're doing is safe to do within a forked process). When it's not safe, it's better to prefer the other APIs available for computation on a cluster. Forking can be unsafe and dangerous, but it's also convenient and sometimes that convenience can outweigh the other concerns. Finally, I want to add: the onus should be on the front-end to work well with R, and not the other way around. I don't think it's fair to impose extra work / an extra maintenance burden on the R Core team for something that's already clearly documented ... Best, Kevin On Fri, Apr 12, 2019 at 6:04 PM Travers Ching <traversc at gmail.com> wrote:
Hi Inaki,
"Performant"... in terms of what. If the cost of copying the data predominates over the computation time, maybe you didn't need parallelization in the first place.
Performant in terms of speed. There's no copying in that example using `mclapply` and so it is significantly faster than other alternatives. It is a very simple and contrived example, but there are lots of applications that depend on processing of large data and benefit from multithreading. For example, if I read in large sequencing data with `Rsamtools` and want to check sequences for a set of motifs.
I don't see why mclapply could not be rewritten using PSOCK clusters.
Because it would be much slower.
To implement copy-on-write, Linux overcommits virtual memory, and this is what causes scripts to break unexpectedly: everything works fine, until you change a small unimportant bit and... boom, out of memory. And in general, running forks in any GUI would cause things everywhere to break.
I'm not sure how did you setup that, but it does complete. Or do you mean that you ran out of memory? Then try replacing "x" with, e.g., "x+1" in your mclapply example and see what happens (hint: save your work first).
Yes, I meant that it ran out of memory on my desktop. I understand the limits, and it is not perfect because of the GUI issue you mention, but I don't see a better alternative in terms of speed. Regards, Travers On Fri, Apr 12, 2019 at 3:45 PM I?aki Ucar <iucar at fedoraproject.org> wrote:
On Fri, 12 Apr 2019 at 21:32, Travers Ching <traversc at gmail.com> wrote:
Just throwing my two cents in: I think removing/deprecating fork would be a bad idea for two reasons: 1) There are no performant alternatives
"Performant"... in terms of what. If the cost of copying the data predominates over the computation time, maybe you didn't need parallelization in the first place.
2) Removing fork would break existing workflows
I don't see why mclapply could not be rewritten using PSOCK clusters. And as a side effect, this would enable those workflows on Windows, which doesn't support fork.
Even if replaced with something using the same interface (e.g., a function that automatically detects variables to export as in the amazing `future` package), the lack of copy-on-write functionality would cause scripts everywhere to break.
To implement copy-on-write, Linux overcommits virtual memory, and this is what causes scripts to break unexpectedly: everything works fine, until you change a small unimportant bit and... boom, out of memory. And in general, running forks in any GUI would cause things everywhere to break.
A simple example illustrating these two points: `x <- 5e8; mclapply(1:24, sum, x, 8)` Using fork, `mclapply` takes 5 seconds. Using "psock", `clusterApply` does not complete.
I'm not sure how did you setup that, but it does complete. Or do you mean that you ran out of memory? Then try replacing "x" with, e.g., "x+1" in your mclapply example and see what happens (hint: save your work first). -- I?aki ?car
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
On Sat, 13 Apr 2019 at 03:51, Kevin Ushey <kevinushey at gmail.com> wrote:
I think it's worth saying that mclapply() works as documented
Mostly, yes. But it says nothing about fork's copy-on-write and memory overcommitment, and that this means that it may work nicely or fail spectacularly depending on whether, e.g., you operate on a long vector.
I?aki ?car
Sure, but that a completely bogus argument because in that case it would fail even more spectacularly with any other method like PSOCK because you would *have to* allocate n times as much memory so unlike mclapply it is guaranteed to fail. With mclapply it is simply much more efficient as it will share memory as long as possible. It is rather obvious that any new objects you create can no longer be shared as they now exist separately in each process. Cheers, Simon
On Apr 13, 2019, at 06:05, I?aki Ucar <iucar at fedoraproject.org> wrote: On Sat, 13 Apr 2019 at 03:51, Kevin Ushey <kevinushey at gmail.com> wrote:
I think it's worth saying that mclapply() works as documented
Mostly, yes. But it says nothing about fork's copy-on-write and memory overcommitment, and that this means that it may work nicely or fail spectacularly depending on whether, e.g., you operate on a long vector. -- I?aki ?car
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
On Sat, 13 Apr 2019 at 18:41, Simon Urbanek <simon.urbanek at r-project.org> wrote:
Sure, but that a completely bogus argument because in that case it would fail even more spectacularly with any other method like PSOCK because you would *have to* allocate n times as much memory so unlike mclapply it is guaranteed to fail. With mclapply it is simply much more efficient as it will share memory as long as possible. It is rather obvious that any new objects you create can no longer be shared as they now exist separately in each process.
The point was that PSOCK fails and succeeds *consistently*, independently of what you do with the input in the function provided. I think that's a good property.
I?aki ?car
On Apr 13, 2019, at 16:56, I?aki Ucar <iucar at fedoraproject.org> wrote: On Sat, 13 Apr 2019 at 18:41, Simon Urbanek <simon.urbanek at r-project.org> wrote:
Sure, but that a completely bogus argument because in that case it would fail even more spectacularly with any other method like PSOCK because you would *have to* allocate n times as much memory so unlike mclapply it is guaranteed to fail. With mclapply it is simply much more efficient as it will share memory as long as possible. It is rather obvious that any new objects you create can no longer be shared as they now exist separately in each process.
The point was that PSOCK fails and succeeds *consistently*, independently of what you do with the input in the function provided. I think that's a good property.
So does parallel. It is consistent. If you do things that use too much memory you will consistently fail. That's a pretty universal rule, there is nothing probabilistic about it. It makes no difference if it's PSOCK, multicore, or anything else.
1 day later
On 4/13/19 12:05 PM, I?aki Ucar wrote:
On Sat, 13 Apr 2019 at 03:51, Kevin Ushey <kevinushey at gmail.com> wrote:
I think it's worth saying that mclapply() works as documented
Mostly, yes. But it says nothing about fork's copy-on-write and memory overcommitment, and that this means that it may work nicely or fail spectacularly depending on whether, e.g., you operate on a long vector.
R cannot possibly replicate documentation of the underlying operating systems. It clearly says that fork() is used and readers who may not know what fork() is need to learn it from external sources. Copy-on-write is an elementary property of fork(). Reimplementing mclapply to use PSOCK does not make sense -- if someone wants to write code that can be used both with PSOCK and FORK, there is the cluster API in parallel for that. Tomas
On Mon, 15 Apr 2019 at 08:44, Tomas Kalibera <tomas.kalibera at gmail.com> wrote:
On 4/13/19 12:05 PM, I?aki Ucar wrote:
On Sat, 13 Apr 2019 at 03:51, Kevin Ushey <kevinushey at gmail.com> wrote:
I think it's worth saying that mclapply() works as documented
Mostly, yes. But it says nothing about fork's copy-on-write and memory overcommitment, and that this means that it may work nicely or fail spectacularly depending on whether, e.g., you operate on a long vector.
R cannot possibly replicate documentation of the underlying operating systems. It clearly says that fork() is used and readers who may not know what fork() is need to learn it from external sources. Copy-on-write is an elementary property of fork().
Just to be precise, copy-on-write is an optimization widely deployed in most modern *nixes, particularly for the architectures in which R usually runs. But it is not an elementary property; it is not even possible without an MMU.
I?aki ?car
On 4/15/19 11:02 AM, I?aki Ucar wrote:
On Mon, 15 Apr 2019 at 08:44, Tomas Kalibera <tomas.kalibera at gmail.com> wrote:
On 4/13/19 12:05 PM, I?aki Ucar wrote:
On Sat, 13 Apr 2019 at 03:51, Kevin Ushey <kevinushey at gmail.com> wrote:
I think it's worth saying that mclapply() works as documented
Mostly, yes. But it says nothing about fork's copy-on-write and memory overcommitment, and that this means that it may work nicely or fail spectacularly depending on whether, e.g., you operate on a long vector.
R cannot possibly replicate documentation of the underlying operating systems. It clearly says that fork() is used and readers who may not know what fork() is need to learn it from external sources. Copy-on-write is an elementary property of fork().
Just to be precise, copy-on-write is an optimization widely deployed in most modern *nixes, particularly for the architectures in which R usually runs. But it is not an elementary property; it is not even possible without an MMU.
Yes, old Unix systems without virtual memory had fork eagerly copying. Not relevant today, and certainly not for systems that run R, but indeed people interested in OS internals can look elsewhere for more precise information. Tomas