[Bioc-devel] BiocParallel: BatchJobs backend (Was: Re: BiocParallel)
This slipped under my radar, sorry. Guess there already is some considerable work going on right now to bring queued clusters closer into the Bioconductor world. If my first attempt on this is still of any help I am happy to share it, of course. Just let me know. Michael: I tried to install your BiocParallel package but got the following error: ** preparing package for lazy loading Setting for worker localhost: ncpus=4 Error : objects ?getConfig?, ?setConfig? are not exported by 'namespace:BatchJobs' ERROR: lazy loading failed for package ?BiocParallel? Checking the source code of the latest BatchJobs version on CRAN I do not see getConfig or setConfig to be exported there. Are you relying on a special version of the package, or even a patched one? Florian
On 6/25/13 2:17 PM, "Michel Lang" <michellang at gmail.com> wrote:
Hi Henrik, Sorry for the late response. Suggestions and feedback are always welcome. I just forgot to enable the issue tracker (now enabled). For prototyping I usually use Interactive/Multicore, but I'll regularly test on our local clusters which use Torque or Slurm, respectively. Michel 2013/6/7 Henrik Bengtsson <hb at biostat.ucsf.edu>:
Great - this looks promising already. What's your test system(s), beyond standard SSH and multicore clusters? I'm on a Torque/PBS system. I'm happy to test, give feedback etc. I don't see an 'Issues' tab on the GitHub page. Michel, how do you prefer to get feedback? /Henrik On Thu, Jun 6, 2013 at 5:21 PM, Michael Lawrence <lawrence.michael at gene.com> wrote:
And here is the on-going development of the backend: https://github.com/mllg/BiocParallel/tree/batchjobs Not sure how well it's been tested. Kudos to Michel Lang for making so much progress so quickly. Michael On Thu, Jun 6, 2013 at 1:59 PM, Dan Tenenbaum <dtenenba at fhcrc.org> wrote:
On Thu, Jun 6, 2013 at 1:56 PM, Henrik Bengtsson <hb at biostat.ucsf.edu> wrote:
Hi, I'd like to pick up the discussion on a BatchJobs backend for BiocParallel where it was left back in Dec 2012 (Bioc-devel thread 'BiocParallel'
Florian, would you mind sharing your BatchJobs backend code? Is it independent of BiocParallel and/or have you tried it with the most recent BiocParallel implementation [https://github.com/Bioconductor/BiocParallel/]?
You should be aware that there is Google Summer of Code project in progress to address this. http://www.bioconductor.org/developers/gsoc2013/ (towards the bottom) Dan
/Henrik On Tue, Dec 4, 2012 at 12:38 PM, Henrik Bengtsson
<hb at biostat.ucsf.edu>
wrote:
Thanks. On Tue, Dec 4, 2012 at 3:47 AM, Vincent Carey <stvjc at channing.harvard.edu> wrote:
I have been booked up so no chance to deploy but I do have access
to
SGE and LSF so will try and will report ASAP.
...and I'll try it out on PBS (... but I most likely won't have
time
to do this until the end of the year). Henrik
On Tue, Dec 4, 2012 at 4:08 AM, Hahne, Florian <florian.hahne at novartis.com> wrote:
Hi Henrik, I have now come up now with a relatively generic version of this SGEcluster approach. It does indeed use BatchJobs under the hood
and
should thus support all available cluster queues, assuming that
the
necessary batchJobs routines are available. I could only test
this on
our SGE cluster, but Vince wanted to try other queuing systems. Not
sure
how far he got. For now the code is wrapped in a little package
called
Qcluster with some documentation. If you want to I can send you a version in a separate mail. Would be good to test this on other systems,
and
I am sure there remain some bugs that need to be ironed out. In
particular
the fault tolerance you mentioned needs to be addressed properly. Currently the code may leave unwanted garbage if things fail in the wrong places because all the communication is file-based. Martin, I'll send you my updated version in case you want to
include
this in biocParallel for others to contribute. Florian -- On 12/4/12 5:46 AM, "Henrik Bengtsson" <hb at biostat.ucsf.edu>
wrote:
Picking up this thread in lack of other places (= were should BiocParallel be discussed?) I saw Martin's updates on the BiocParallel - great. Florian's
SGE
scheduler was also mentioned; is that one built on top of
BatchJobs?
If so I'd be interested in looking into that/generalizing that
to
work with any BatchJobs scheduler. I believe there is going to be a new release of BatchJobs rather soon, so it's probably worth waiting until that is available. The main use case I'm interested in is to launch batch jobs on a PBS/Torque cluster, and then use multicore processing on each compute node. It would be nice to be able to do this using the
BiocParallel
model, but maybe it is too optimistic to get everything to work under same model. Also, as Vince hinted, fault tolerance etc needs
to be
addressed and needs to be addressed differently in the different setups. /Henrik On Tue, Nov 20, 2012 at 6:59 AM, Ramon Diaz-Uriarte <rdiaz02 at gmail.com> wrote:
On Sat, 17 Nov 2012 13:05:29 -0800,"Ryan C. Thompson" <rct at thompsonclan.org> wrote:
On 11/17/2012 02:39 AM, Ramon Diaz-Uriarte wrote:
In addition to Steve's comment, is it really a good thing
that
"all
code
stays the same."? I mean, multiple machines vs. multiple
cores
are, often, _very_ different things: for instance, shared vs. distributed memory, communication overhead differences, whether or not
you
can
assume
packages and objects to be automagically present in the slaves/child process, etc. So, given they are different situations, I
think
it sometimes makes sense to want to write different code for
each
situation
(I often do); not to mention Steve's hybrid cases ;-). Since BiocParallel seems to be a major undertaking, maybe
it
would be appropriate to provide a flexible approach, instead of hard wiring
the
foreach approach.
Of course there are cases where the same code simply can't
work
for both multicore and multi-machine situations, but those generally
don't
fall into the category of things that can be done using lapply.
Lapply
and all of its parallelized buddies like mclapply, parLapply, and foreach are designed for data-parallel operations with no
interdependence
between results, and these kinds of operations generally parallelize as well across machines as across cores, unless your network is
not
fast enough (in which case you would choose not to use
multi-machine
parallelism). If you want a parallel algorithm for something
like
the disjoin method of GRanges, you might need to write some
special
purpose code, and that code might be very different for multicore vs multi-machine.
So yes, sometimes there is a fundamental reason that you
have to
change the code to make it run on multiple machines, and neither
foreach
nor any other parallelization framework will save you from
having to
rewrite your code. But often there is no fundamental reason that the
code
has to change, but you end up changing it anyway because of
limitations
in your parallelization framework. This is the case that foreach
saves
you from.
Hummm... I guess you are right, and we are talking about
"often"
or
"most
of the time", which is where all this would fit. Point taken.
Best,
R.
--
Ramon Diaz-Uriarte
Department of Biochemistry, Lab B-25
Facultad de Medicina
Universidad Aut?noma de Madrid
Arzobispo Morcillo, 4
28029 Madrid
Spain
Phone: +34-91-497-2412
Email: rdiaz02 at gmail.com
ramon.diaz at iib.uam.es
http://ligarto.org/rdiaz
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel