Skip to content
Prev 61324 / 63421 Next

Compiling R-devel on older Linux distributions, e.g. RHEL / CentOS 7

On Wed, Feb 8, 2023 at 12:22 PM I?aki Ucar <iucar at fedoraproject.org> wrote:
We might operate in different environments, but there are lots of labs
that keep the exact same pipeline for years (5-10 years), because "it
works", and because if they change anything, they might have to
re-analyze all their old data to avoid batch effects purely from
different versions of algorithms. I can agree with this strategy too,
especially if your data are huge and staging them back on the compute
environment from cold storage can be a huge task in itself.  Then
there are reasons such as being less savvy, and bad memories from last
time they tried this (e.g. years ago), everything broke, and it took
them weeks and months to sort it out.  I'm not trying to make fun of
anyone here - it's just that on big clusters with many users, the
skill-level spectrum varies a lot.
I'm actually thinking maintenance and support. When you bring in Linux
containers, you basically introduce a bunch of new compute
environments in addition to your host system. So, instead of the
support team (often same as the sysadm) having to understand and
answer questions for a homogeneous environment, they now have to be
up-to-date with different versions of CentOS/Rocky, Ubuntu, Debian,
... and different container images. In R we often have a hard time to
even get users to report their sessionInfo() - now imagine their
container details.  If admins start providing one-off container
images, that becomes an added maintenance load. But, I agree, Linux
containers are great and makes it possible for a lot of users to run
analyzes that they otherwise would not be able to do on the host
system.
Yes, Singularity/Apptainer is awesome, especially since Docker is
mostly considered a no-no in HPC environments. The minimal, or even
zero use of SUID, these days, is great. That it runs as a regular
process as the users itself with good default file mounts is also
neat.  These things get even better with newer Linux kernels, which,
by the way, is another motivation for upgrading the OS.

That said, with Apptainer and likes, the user might run into conflicts
here, similar to what we see when users install software via Conda,
which often generals a parallel software stack to that of the host
system.  Taking R as an example, when a user installs packages, they
end up in R_LIBS_USER=~/R/%p-library/%v (*).  This is the same
directory regardless of running R on the host system, in a Linux
container, and in Linux containers based on different OSes.  So, if
they end up running a little bit here and there, which is not
unreasonable to expect if they work on different projects, then there
will a mishmash of R package binaries that are not compatible with
each other.  This happens a ton when people use Conda.  Of course, a
savvy user will at some point figure this out, and configure their
R_LIBS_USER to be agile to the environment they run, but the majority
won't notice this until it's too late.  And, boom, now you're adding
lots of load on the support team, and troubleshooting and undoing
these conflicts consumes a lot of wasted efforts.  In the worst case,
the user does not reach out for help, but in stead struggle in silence
and might work with something half broken.  From my experience at UCSF
(~2,000 users on two big clusters), this is unfortunately not that
uncommon.

(*) My wish would be if R could to include also the OS name and the OS
version in the default R_LIBS_USER, something like
R_LIBS_USER=~/R/%O-%p-library/%v, where %O would be a new
specification that expands to, say, "centos-7", "ubuntu-22.04".  That
would mitigate lots of these issues automatically.

Thanks for the feedback and questions,

Henrik