Speeding up build-from-source
On Apr 27, 2013, at 11:34 AM, Adam Seering wrote:
On 04/27/2013 09:10 AM, Martin Morgan wrote:
On 04/26/2013 07:50 AM, Adam Seering wrote:
Hi,
I've been playing around with the R source code a little; mostly
just
trying to familiarize myself. I have access to some computers on a
reservation
system; so I've been reserving a computer, downloading and compiling
R, and
going from there.
I'm finding that R takes a long time to build, though. (Well,
ok, maybe 5
minutes -- I'm impatient :-) ) Most of that time, it's sitting there
byte-compiling some internal package or another, which uses just one
CPU core so
leaves the system mostly idle.
I'm just curious if anyone has thought about parallelizing that
process?
Hi Adam -- parallel builds are supported by adding the '-j' flag when you invoke make make -j The packages are being built in parallel, in as much as this is possible by their dependency structure. Also, you can configure without byte compilation, see ~/src/R-devel/configure --help to make this part of the build go more quickly. And after an initial build subsets of R, e.g., just the 'main' source or a single package like 'stats', can be built with (assuming R's source, e.g., from svn, is in ~/src/R-devel, and you're building R in ~/bin/R-devel) with cd ~/bin/R-devel/src/main make -j cd ~/bin/R-devel/src/library/stats make -j The definitive source for answers to questions like these is
> RShowDoc("R-admin")
Martin
Hi Martin, Thanks for the reply -- but I'm afraid the question you've answered isn't the question that I intended to ask. Based on your response, I think the answer to my question is likely "no." But let me try rephrasing anyway, just in case: I'm certainly quite aware of "-j" as a make argument; if I weren't, the bottleneck would not be the byte-compilation, and the build would take rather more than 5 minutes :-) That was the very first thing I tried. I don't believe that parallel make is as parallel as it theoretically could be. (In fact, I see almost no parallelism between libraries on my system; individual .c files are parallelized nicely but only one library at a time. This mostly matters at the compiling-bytecode step, since that's the biggest serial operation per library.) My question is, has anyone thought about what it would take to parallelize the build further?
I think you may have failed to notice that installation of packages *is* parallelized. The *output* is shown only en-bloc and to avoid mixing outputs of the parallel installations. But there are dependencies among packages, so those that require most of the others have to be built last -- nonetheless, in the current R you can install 9 recommended packages in parallel.
I'm not sure that this can be done with just the makefiles. But the following comment makes me at least a little suspicious: """ src/library/Makefile ## FIXME: do some of this in parallel? """ Surely some of the 'for' loops there could be unwound into proper make targets with dependency information? I'm not sure if the dependency information would effectively force a serial compilation anyway, though?... Another approach, if the above is hard for some reason: What I'm seeing is that the byte compilation is largely serial; but as you note, byte-compilation is optional. Could the makefiles just defer it?; skip it up front and then do all the byte-compilations for all of the packages concurrently?
The problem is, again, dependencies - you cannot defer the compilation since it would change the package *after* is has already been used by another package which can cause inconsistencies (note that lazy loading is a red herring - it's used regardless of compilation). That said, you won't save significant amount of time anyway (did you actually profile the time or are you relying on your eyes to deceive you? ;)), so it's not worth the bother (try enabling LTO ;)). Personally, I simply disable package compilation for all developments builds. You won't notice the difference for testing anyway. Moreover, you'll be barely doing a full build repeatedly, so the 4 minutes it takes are certainly nothing compared to other projects of such size... It becomes more fun when you start building all CRAN packages ;). Cheers, Simon
From a very cursory read of the code, it looks like the relevant code is in src/library/tools/R/makeLazyLoad.R?; and that file doesn't immediately look like it's doing anything that fundamentally couldn't be parallelized? (ie., running multiple R processes at once, one per library; at a glance the logic looks nicely per-library.) A third approach could be to try to parallelize the logic in makeLazyLoad.R. I would expect that to be at best much more difficult, though. Anyway, there are lots of things that look like they could in theory be done here. And I know just enough at this point to be dangerous; not enough to contribute :-) Hence my asking, has anyone thought about this? If not, I assume the best thing for me to do would be to poke at it; try to figure out own my own how this works and what's most feasible. But if anyone has any pointers, that would likely save me a bunch of time. And if this is something that you prefer to keep serial for some reason, that would be good to know too, so I don't spend time on it. Thanks, Adam
Thanks, Adam
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel