We have about 50 different mac computers, all ARM, distributed across our offices. They range from a few M1's with 8 GB all the way to M4's with 64 GB. (The M4 mini for $600 is an amazing compute engine!) These computers are mostly idle overnight. We have no interest in bitmining and SETI at home doesn't seem so very active any more, either. Alas, it's 2025 now, so maybe there is something better we could do with all this idle compute power when it comes to our own statistical analyses. Maybe we could cluster them overnight. I likely could convince my colleagues to run a cron job (or systemctl, well loadctl) that starts listening at 7pm and ends it around 7am, sharing say 80% of their memory and CPU, plus say 32GB of SSD. I won't be able to actively administer their computers, so the client has to be easy for them to install, turn on, and turn off, accept programs and inputs, cache some of the data, and send back output. (The sharing would only be on the local network, not the entire internet, making them feel more comfortable with it.) Ideally, we would then have a frontend R (controller) that could run `mclapply` statements on this Franken-computer, and be smart enough about how to distribute the load. For example, an M4 is about 1.5x as fast as an M1 on a single CPU, and it's easy to count up CPUs. If my job is estimated to need 4GB per core, presumably I wouldn't want to start 50 processes on a computer that has 10 cores and 8GB. If the frontend estimates that the upload and download will take longer than the savings, it should just forget about distributing it. And so on. Reasonable rules, perhaps indicated by the user and/or assessable from a few local mclapply runs first. It's almost like profiling the job for a few minutes or few iterations locally, and then deciding whether to send off parts of it to all the other computer nodes on this Franken-net. I am not holding my breath on ChatGPT and artificial intelligence, of course. However, this seems like a hard but feasible engineering problem. Is there a vendor who sells a plug-and-play solution to this problem? I am guessing we are not unusual in a setup like this, though an upper price bound on the software here is of course just the cost of buying a giant homogeneous computer or using Amazon resources. Pointers appreciated. /iaw
Overnight Cluster (Whitepaper)?
6 messages · ivo welch, Steven Ellis, Ivan Krylov +3 more
Very interesting problem! Have you posted on Hacker News? This is the only such system I have used -- https://research.google/pubs/large-scale-cluster-management-at-google-with-borg/
On Wed, Apr 30, 2025 at 4:48?AM ivo welch <ivo.welch at ucla.edu> wrote:
We have about 50 different mac computers, all ARM, distributed across our
offices. They range from a few M1's with 8 GB all the way to M4's with 64
GB. (The M4 mini for $600 is an amazing compute engine!)
These computers are mostly idle overnight. We have no interest in
bitmining and SETI at home doesn't seem so very active any more, either.
Alas, it's 2025 now, so maybe there is something better we could do with
all this idle compute power when it comes to our own statistical analyses.
Maybe we could cluster them overnight.
I likely could convince my colleagues to run a cron job (or systemctl, well
loadctl) that starts listening at 7pm and ends it around 7am, sharing say
80% of their memory and CPU, plus say 32GB of SSD. I won't be able to
actively administer their computers, so the client has to be easy for them
to install, turn on, and turn off, accept programs and inputs, cache some
of the data, and send back output. (The sharing would only be on the local
network, not the entire internet, making them feel more comfortable with
it.)
Ideally, we would then have a frontend R (controller) that could run
`mclapply` statements on this Franken-computer, and be smart enough about
how to distribute the load. For example, an M4 is about 1.5x as fast as an
M1 on a single CPU, and it's easy to count up CPUs. If my job is estimated
to need 4GB per core, presumably I wouldn't want to start 50 processes on a
computer that has 10 cores and 8GB. If the frontend estimates that the
upload and download will take longer than the savings, it should just
forget about distributing it. And so on. Reasonable rules, perhaps
indicated by the user and/or assessable from a few local mclapply runs
first. It's almost like profiling the job for a few minutes or few
iterations locally, and then deciding whether to send off parts of it to
all the other computer nodes on this Franken-net.
I am not holding my breath on ChatGPT and artificial intelligence, of
course. However, this seems like a hard but feasible engineering problem.
Is there a vendor who sells a plug-and-play solution to this problem? I am
guessing we are not unusual in a setup like this, though an upper price
bound on the software here is of course just the cost of buying a giant
homogeneous computer or using Amazon resources.
Pointers appreciated.
/iaw
[[alternative HTML version deleted]]
______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide https://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Dear Ivo Welch, Sorry for not answering the question you asked (I don't know such a vendor), but here are a few comments that may help: On Tue, 29 Apr 2025 17:20:25 -0700
ivo welch <ivo.welch at ucla.edu> wrote:
These computers are mostly idle overnight. We have no interest in bitmining and SETI at home doesn't seem so very active any more, either. Alas, it's 2025 now, so maybe there is something better we could do with all this idle compute power when it comes to our own statistical analyses. Maybe we could cluster them overnight.
The state of the art in volunteer computing is still BOINC, the same system that powers most of the "@home" projects. It lets the user control when to run the jobs and when to stop (e.g. run jobs overnight but only if the system is not under load by something else) and doesn't require the job submitter to be able to log in to the worker nodes or even rely on the nodes being able to accept incoming connections. It's possible to run a BOINC server yourself [1], although the server side will take some work to set up, and the jobs need to be specially packaged. In theory, one could package R as a BOINC app and arrange for it to run jobs serialized into *.rds files, but it's a lot of infrastructure work to place all the moving parts in correct positions (package versions alone are a serious problem with no easy solution).
Ideally, we would then have a frontend R (controller) that could run `mclapply` statements on this Franken-computer, and be smart enough about how to distribute the load.
One problem with parLapply() is that it expects the cluster object to be a list containing a fixed number of node objects. I've experimented with a similar problem: I needed to distribute jobs between my colleagues' workstations when they could spare some CPU power, letting computers leave and rejoin the cluster at will. In the end, I had to pretend that my 'parallel' cluster always contained an excessive number of nodes (128) and distribute the larger number of smaller sub-tasks dynamically. A general-purpose interface for a volunteer cluster will probably not work as a drop-in replacement for mclapply(). You might be able to achieve part of what you want using 'mirai', telling every worker node to connect to the client node for tasks. BOINC can set memory and CPU core limits, but it might be unable to save you from inefficient job plans. See 'future.batchtools' for an example of an R interface for cluster job submission systems.
Best regards, Ivan [1] https://github.com/BOINC/boinc/wiki/BOINC-apps-(introduction)
HTCondor has been around for a long time (originally as "Condor", started in 1988!) https://github.com/htcondor/htcondor https://htcondor.org/ https://en.wikipedia.org/wiki/HTCondor I have no idea about the scale of difficulty of setting this up. The developers do offer contract support <https://htcondor.org/uw-support/>
On 2025-04-30 10:12 a.m., Ivan Krylov via R-help wrote:
Dear Ivo Welch, Sorry for not answering the question you asked (I don't know such a vendor), but here are a few comments that may help: On Tue, 29 Apr 2025 17:20:25 -0700 ivo welch <ivo.welch at ucla.edu> wrote:
These computers are mostly idle overnight. We have no interest in bitmining and SETI at home doesn't seem so very active any more, either. Alas, it's 2025 now, so maybe there is something better we could do with all this idle compute power when it comes to our own statistical analyses. Maybe we could cluster them overnight.
The state of the art in volunteer computing is still BOINC, the same system that powers most of the "@home" projects. It lets the user control when to run the jobs and when to stop (e.g. run jobs overnight but only if the system is not under load by something else) and doesn't require the job submitter to be able to log in to the worker nodes or even rely on the nodes being able to accept incoming connections. It's possible to run a BOINC server yourself [1], although the server side will take some work to set up, and the jobs need to be specially packaged. In theory, one could package R as a BOINC app and arrange for it to run jobs serialized into *.rds files, but it's a lot of infrastructure work to place all the moving parts in correct positions (package versions alone are a serious problem with no easy solution).
Ideally, we would then have a frontend R (controller) that could run `mclapply` statements on this Franken-computer, and be smart enough about how to distribute the load.
One problem with parLapply() is that it expects the cluster object to be a list containing a fixed number of node objects. I've experimented with a similar problem: I needed to distribute jobs between my colleagues' workstations when they could spare some CPU power, letting computers leave and rejoin the cluster at will. In the end, I had to pretend that my 'parallel' cluster always contained an excessive number of nodes (128) and distribute the larger number of smaller sub-tasks dynamically. A general-purpose interface for a volunteer cluster will probably not work as a drop-in replacement for mclapply(). You might be able to achieve part of what you want using 'mirai', telling every worker node to connect to the client node for tasks. BOINC can set memory and CPU core limits, but it might be unable to save you from inefficient job plans. See 'future.batchtools' for an example of an R interface for cluster job submission systems.
Dr. Benjamin Bolker Professor, Mathematics & Statistics and Biology, McMaster University Director, School of Computational Science and Engineering > E-mail is sent at my convenience; I don't expect replies outside of working hours.
Aren't most organizations pushing to reduce power consumption at night? Energy costs, thermal wear acceleration, and climate change all point to putting computers to sleep at night unless you have a specific goal in mind. Sounds like a non-problem looking for a solution to me. (I was a BOINC volunteer for several years a couple of decades ago... but got tired of drying-out cpu thermal paste problems.)
On April 29, 2025 5:20:25 PM PDT, ivo welch <ivo.welch at ucla.edu> wrote:
We have about 50 different mac computers, all ARM, distributed across our offices. They range from a few M1's with 8 GB all the way to M4's with 64 GB. (The M4 mini for $600 is an amazing compute engine!) These computers are mostly idle overnight. We have no interest in bitmining and SETI at home doesn't seem so very active any more, either. Alas, it's 2025 now, so maybe there is something better we could do with all this idle compute power when it comes to our own statistical analyses. Maybe we could cluster them overnight. I likely could convince my colleagues to run a cron job (or systemctl, well loadctl) that starts listening at 7pm and ends it around 7am, sharing say 80% of their memory and CPU, plus say 32GB of SSD. I won't be able to actively administer their computers, so the client has to be easy for them to install, turn on, and turn off, accept programs and inputs, cache some of the data, and send back output. (The sharing would only be on the local network, not the entire internet, making them feel more comfortable with it.) Ideally, we would then have a frontend R (controller) that could run `mclapply` statements on this Franken-computer, and be smart enough about how to distribute the load. For example, an M4 is about 1.5x as fast as an M1 on a single CPU, and it's easy to count up CPUs. If my job is estimated to need 4GB per core, presumably I wouldn't want to start 50 processes on a computer that has 10 cores and 8GB. If the frontend estimates that the upload and download will take longer than the savings, it should just forget about distributing it. And so on. Reasonable rules, perhaps indicated by the user and/or assessable from a few local mclapply runs first. It's almost like profiling the job for a few minutes or few iterations locally, and then deciding whether to send off parts of it to all the other computer nodes on this Franken-net. I am not holding my breath on ChatGPT and artificial intelligence, of course. However, this seems like a hard but feasible engineering problem. Is there a vendor who sells a plug-and-play solution to this problem? I am guessing we are not unusual in a setup like this, though an upper price bound on the software here is of course just the cost of buying a giant homogeneous computer or using Amazon resources. Pointers appreciated. /iaw [[alternative HTML version deleted]]
______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide https://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Sent from my phone. Please excuse my brevity.
Jeff, There are multiple agendas here. Yes, lots of items we have do use power even when not in use or even when supposedly turned off. Devices like a TV that is instant-on because some circuitry is on and listening for the remote control and ready to just turn the screen power on are another example. But, realistically, turning off many computers every night is not a great idea if there is something they do at night or on weekends. Examples include various kinds of housekeeping that is often done when the machine is not busy such as checking for updates (and maybe installing them or pre-installing them so you can finish later, tuning hard disks by removing gaps in files, and so on for jobs scheduled to be done when they do not interfere. Other computers may be servers that host web p[ages or other resources that may be asked for from anywhere in the world. My PC is set to record TV shows, often in middle of the night when rebroadcast by PBS or some obscure oldies channel, so as not to interfere with recording other things at prime time, and so on. I think, though, that some system resources can be tuned so they use less power. Chips can be run at slower speeds. Hard disks can stop spinning until an actual request comes in. In principle, they can (almost) shut down and leave some stub that intercepts any requests and then turns them on. Lots of system resources that stay awake to check for periodic needs may instead put themselves in a queue where some central resource listens for anything that might be needed to wake them and runs more efficiently than having hundreds of such services awake and asking. Perhaps a multi-CPU or thread machine can turn some off as long as the load is low. Monitors, of course, can often harmlessly be turned off and perhapsn even keyboards and mice and printers and ... But, realistically, I suspect some devices are designed to just be on and may even degrade faster if regularly turned on and off. I recall advice from years ago that sounded like turning the light on and off repeatedly as in a bathroom might cost a nickel each time and if were coming back in the room soon enough, consider just leaving it on. The more efficient LED light bulbs may be less worthy of turning off if it turns out it shortens lives. Who knows? I note another user of electricity may be the way many of us use a UPS that has a backup battery between us and the power source. This may protect from crashes or work not being saved if we rush and power down gracefully when power goes out, or the outage is brief. But, I suspect it uses at least 5% more electricity overall. -----Original Message----- From: R-help <r-help-bounces at r-project.org> On Behalf Of Jeff Newmiller via R-help Sent: Wednesday, April 30, 2025 1:13 PM To: r-help at r-project.org; ivo welch <ivo.welch at ucla.edu>; r-help <r-help at r-project.org> Subject: Re: [R] Overnight Cluster (Whitepaper)? Aren't most organizations pushing to reduce power consumption at night? Energy costs, thermal wear acceleration, and climate change all point to putting computers to sleep at night unless you have a specific goal in mind. Sounds like a non-problem looking for a solution to me. (I was a BOINC volunteer for several years a couple of decades ago... but got tired of drying-out cpu thermal paste problems.)
On April 29, 2025 5:20:25 PM PDT, ivo welch <ivo.welch at ucla.edu> wrote:
We have about 50 different mac computers, all ARM, distributed across our offices. They range from a few M1's with 8 GB all the way to M4's with 64 GB. (The M4 mini for $600 is an amazing compute engine!) These computers are mostly idle overnight. We have no interest in bitmining and SETI at home doesn't seem so very active any more, either. Alas, it's 2025 now, so maybe there is something better we could do with all this idle compute power when it comes to our own statistical analyses. Maybe we could cluster them overnight. I likely could convince my colleagues to run a cron job (or systemctl, well loadctl) that starts listening at 7pm and ends it around 7am, sharing say 80% of their memory and CPU, plus say 32GB of SSD. I won't be able to actively administer their computers, so the client has to be easy for them to install, turn on, and turn off, accept programs and inputs, cache some of the data, and send back output. (The sharing would only be on the local network, not the entire internet, making them feel more comfortable with it.) Ideally, we would then have a frontend R (controller) that could run `mclapply` statements on this Franken-computer, and be smart enough about how to distribute the load. For example, an M4 is about 1.5x as fast as an M1 on a single CPU, and it's easy to count up CPUs. If my job is estimated to need 4GB per core, presumably I wouldn't want to start 50 processes on a computer that has 10 cores and 8GB. If the frontend estimates that the upload and download will take longer than the savings, it should just forget about distributing it. And so on. Reasonable rules, perhaps indicated by the user and/or assessable from a few local mclapply runs first. It's almost like profiling the job for a few minutes or few iterations locally, and then deciding whether to send off parts of it to all the other computer nodes on this Franken-net. I am not holding my breath on ChatGPT and artificial intelligence, of course. However, this seems like a hard but feasible engineering problem. Is there a vendor who sells a plug-and-play solution to this problem? I am guessing we are not unusual in a setup like this, though an upper price bound on the software here is of course just the cost of buying a giant homogeneous computer or using Amazon resources. Pointers appreciated. /iaw [[alternative HTML version deleted]]
______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide
https://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Sent from my phone. Please excuse my brevity. ______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide https://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.