Skip to content

R, Nomad, HTCondor, etc... and future

7 messages · David Bellot, George Ostrouchov, Dirk Eddelbuettel +1 more

#
Hi R HPC,

I was wondering if anyone has ever used Nomad from Hashicorp as a backend
engine to run R distributed code.
Moreover, if you use it with *future, *I'd love to hear about your
experience and if you published a package for it, I'd love to use it too.

If not, which other distribution engine do you use? (apart from those
supported in *batchtools*)?
HTCondor, DockerSwarm, BOINC, etc... ?

I didn't make a final decision yet on which engine to use, but it has to be
versatile enough (I know it's not a lof of information but think about
"corporate environment" with various needs. My R need is just one among
many other).

Thanks for your help.
David
#
Hi David,

I live in a large HPC world, where distributed computing is inherently batch, so take my advice with that perspective. Large systems are mostly incompatible with the interactive concept of "backend" and instead support SPMD-style batch programming. SPMD is mostly MPI+X, meaning that the distributed aspect is handled by MPI and within node aspects can vary among several options including MPI, fork, OpenMP, and OpenACC. But even on a medium slurm-managed cluster (possibly in a corporate environment), for R I would recommend a combination of pbdR.org distributed packages and parallel package's mclapply components for within node parallelism.

Best,
George

?-----Original Message-----
From: R-sig-hpc <r-sig-hpc-bounces at r-project.org> on behalf of David Bellot <david.bellot at gmail.com>
Date: Thursday, May 21, 2020 at 8:26 PM
To: <r-sig-hpc at r-project.org>
Subject: [R-sig-hpc] R, Nomad, HTCondor, etc... and future

    Hi R HPC,

    I was wondering if anyone has ever used Nomad from Hashicorp as a backend
    engine to run R distributed code.
    Moreover, if you use it with *future, *I'd love to hear about your
    experience and if you published a package for it, I'd love to use it too.

    If not, which other distribution engine do you use? (apart from those
    supported in *batchtools*)?
    HTCondor, DockerSwarm, BOINC, etc... ?

    I didn't make a final decision yet on which engine to use, but it has to be
    versatile enough (I know it's not a lof of information but think about
    "corporate environment" with various needs. My R need is just one among
    many other).

    Thanks for your help.
    David


    _______________________________________________
    R-sig-hpc mailing list
    R-sig-hpc at r-project.org
    https://stat.ethz.ch/mailman/listinfo/r-sig-hpc
#
Thanks George.
To give more colors to what I'm trying to achieve, let me describe the two
opposite use cases. My use case is in R obviously, and I run one-shot jobs
to explore data sets as fast as possible and run optimization algos in
which the objective function is really cpu-intensive to compute.

At the same time, other people in the same organization want to run
services, written in other languages and use the same cluster of computers.
Those services are very different in nature but in general, the idea is to
have a collection of processes always ready to answer to a request when
needed. Ideally, the same cluster should be used by everyone so that to
maximize its uptime, not waste on expensive resources, etc... And ideally,
I don't want to have many job scheduler/distribution engine to manage at
the same time. Kind of a Holy Grail, I concede.

Hence me looking at things like Nomad, HTCondor, etc...

On Fri, May 22, 2020 at 11:19 AM Ostrouchov, George <georgeost at gmail.com>
wrote:

  
  
#
On 22 May 2020 at 11:46, David Bellot wrote:
| Thanks George.
| To give more colors to what I'm trying to achieve, let me describe the two
| opposite use cases. My use case is in R obviously, and I run one-shot jobs
| to explore data sets as fast as possible and run optimization algos in
| which the objective function is really cpu-intensive to compute.
| 
| At the same time, other people in the same organization want to run
| services, written in other languages and use the same cluster of computers.
| Those services are very different in nature but in general, the idea is to
| have a collection of processes always ready to answer to a request when
| needed. Ideally, the same cluster should be used by everyone so that to
| maximize its uptime, not waste on expensive resources, etc... And ideally,
| I don't want to have many job scheduler/distribution engine to manage at
| the same time. Kind of a Holy Grail, I concede.
| 
| Hence me looking at things like Nomad, HTCondor, etc...

I don't see in the above how your 'one-shot job' is different from your
colleagues need to send spot requests.

I found slurm reasonable in the past, and it has only gotten more widely used
/ available sense.  It will provide you with access to the compute resource,
will account for 'who does what' and can schedule / resource (which I never
really needed, and sounds like you don't either). Plus it will give you easy
view on what is currently up or down, available etc pp.

The devil is as always in the details. I'd say experiment and a little and
take it from there.

Dirk
#
You're right I didn't explain correclty. On one hand, I have experiments to
run.
Think about 'foreach %dopar%' loops and things like that. When it's done, I
look at the result, and the work is done. My program has run and I don't it
need anymore.
On the other hand, they have many small services they want to keep waiting
24/7 and run when called, I mean on-demand. They don't need to be heavy on
CPU, except for the few seconds, maybe when the services are called. In my
use case, I don't need a service to stay up 24/7, but I use the CPU very
intensively.

And describing it like this now, I simply realized that solving these two
different problems with one single solution seems a bit ... huh... silly :-)

I found slurm reasonable in the past, and it has only gotten more widely
I'll give Slurm a try then. You're not the first one to say it's a good
tool.
Thanks Dirk.
#
On 22 May 2020 at 12:44, David Bellot wrote:
| I'll give Slurm a try then. You're not the first one to say it's a good
| tool.

And it works with the older foreach / doSOMETHING world of yore, and in the
newer world of Henrik's future package, and should work with batchtools

Dirk
#
David,

Slurm is a good tool, especially if you are not doing complicated
scheduling things with it.  It is really designed to do HPC, so you
might want to take a quick look at your needs and see whether HPC is
really thing thing you want or whether you might be better off in an
HTC environment, like HTCondor.  They are really designed to do
different things in different ways.

Many, if not most, sites seem to end up building HPC clusters, but
many of the users might be better off with HTC, instead.  I'd counsel
you to take a scan through the HTCondor documentation, and at Open
Science Grid, just to get a sense of what the differences are.

For example, with HTCondor, you could configure workstations to be
part of your available resource pool during off hours, or if they are
idle, and it's much harder to do that with something like Slurm.

Anyway, you're buying the shoe, I would just make sure it fits well
before walking a long way with it.

-- bennet
On Thu, May 21, 2020 at 10:45 PM David Bellot <david.bellot at gmail.com> wrote: