Hi all, I was wondering whether it would be possible to have proper generics for some of the functions in the parallel package, e.g. parLapply and clusterCall. The reason I am asking is because I want to build an S4 class that essentially looks like an S3 cluster object but knows how to deal with the SGE. That way I can abstract away all the overhead regarding job submission, job status and reducing the results in the parLapply method of that class, and would be able to supply this new cluster object to all of my existing functions that can be processed in parallel using a cluster object as input. I have played around with the BatchJobs package as an abstraction layer to SGE and that work nicely. As a test case I have created the necessary generics myself in order to supply my own SGEcluster object to a function that normally deals with the "regular" parallel package S3 cluster objects and everything just worked out of the box, but obviously this fails once I am in a name space and my generic is not found anymore. Of course what we would really want is some proper abstraction of parallelization in R, but for now this seem to be at least a cheap compromise. Any thoughts on this? Florian --
[Bioc-devel] parallel package generics
32 messages · Hahne, Florian, Steve Lianoglou, Vincent Carey +6 more
Messages 1–25 of 32
5 days later
On 10/17/2012 05:45 AM, Hahne, Florian wrote:
Hi all, I was wondering whether it would be possible to have proper generics for some of the functions in the parallel package, e.g. parLapply and clusterCall. The reason I am asking is because I want to build an S4 class that essentially looks like an S3 cluster object but knows how to deal with the SGE. That way I can abstract away all the overhead regarding job submission, job status and reducing the results in the parLapply method of that class, and would be able to supply this new cluster object to all of my existing functions that can be processed in parallel using a cluster object as input. I have played around with the BatchJobs package as an abstraction layer to SGE and that work nicely. As a test case I have created the necessary generics myself in order to supply my own SGEcluster object to a function that normally deals with the "regular" parallel package S3 cluster objects and everything just worked out of the box, but obviously this fails once I am in a name space and my generic is not found anymore. Of course what we would really want is some proper abstraction of parallelization in R, but for now this seem to be at least a cheap compromise. Any thoughts on this?
Hi Florian -- we talked about this locally, but I guess we didn't actually send any email! Is there an obstacle to promoting these to generics in your own package? The usual motivation for inclusion in BiocGenerics has been to avoid conflicts between packages, but I'm not sure whether this is the case (yet)? This would also add a dependency fairly deep in the hierarchy. What do you think? Martin
Florian
Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793
Hi Martin, I could define the generics in my own package, but that would mean that those will only be available there, or in the global environment assuming that I also export them, or in all additional packages that explicitly import them from my name space. Now there already are a whole bunch of packages around that all allow for parallelization via a cluster object. Obviously those all import the parLapply function from the parallel package. That means that I can't simply supply my own modified cluster object, because the code that calls parLapply will not know about the generic in my package, even if it is attached. Ideally parLapply would be a generic function already in the parallel package. Not sure who needs to be convinced in order for this to happen, but my gut feeling was that it could be easier to have the generic in BiocGenerics. Maybe I am missing something obvious here, but imo there is no way to overwrite parLapply globally for my own class unless the generic is imported by everyone who wants to make use of the special method. Florian
On 10/23/12 2:20 PM, "Martin Morgan" <mtmorgan at fhcrc.org> wrote: >On 10/17/2012 05:45 AM, Hahne, Florian wrote: >> Hi all, >> I was wondering whether it would be possible to have proper generics for >> some of the functions in the parallel package, e.g. parLapply and >> clusterCall. The reason I am asking is because I want to build an S4 >>class >> that essentially looks like an S3 cluster object but knows how to deal >> with the SGE. That way I can abstract away all the overhead regarding >>job >> submission, job status and reducing the results in the parLapply method >>of >> that class, and would be able to supply this new cluster object to all >>of >> my existing functions that can be processed in parallel using a cluster >> object as input. I have played around with the BatchJobs package as an >> abstraction layer to SGE and that work nicely. As a test case I have >> created the necessary generics myself in order to supply my own >>SGEcluster >> object to a function that normally deals with the "regular" parallel >> package S3 cluster objects and everything just worked out of the box, >>but >> obviously this fails once I am in a name space and my generic is not >>found >> anymore. Of course what we would really want is some proper abstraction >>of >> parallelization in R, but for now this seem to be at least a cheap >> compromise. Any thoughts on this? > >Hi Florian -- we talked about this locally, but I guess we didn't >actually send >any email! > >Is there an obstacle to promoting these to generics in your own package? >The >usual motivation for inclusion in BiocGenerics has been to avoid >conflicts >between packages, but I'm not sure whether this is the case (yet)? This >would >also add a dependency fairly deep in the hierarchy. > >What do you think? > >Martin > >> Florian >> > > >-- >Computational Biology / Fred Hutchinson Cancer Research Center >1100 Fairview Ave. N. >PO Box 19024 Seattle, WA 98109 > >Location: Arnold Building M1 B861 >Phone: (206) 667-2793
In response to a question from yesterday, I pointed someone to the ShortRead `srapply` function and I wondered to myself why it had to necessarily by "burried" in the ShortRead package (aside from it having a `sr` prefix). I had thought it might be a good idea to move that (or something like that) to BiocGenerics (unless implementations aren't allowed there) but also realized that it would add more dependencies where someone might not necessarily need them. But, almost surely, a large majority of the people will be happy to do some form of ||-ization, so in my mind it's not such an onerous thing to add -- on the other hand, this large majority is probably enriched for people who are doing NGS analysis, in which case, keeping it in ShortRead can make some sense. Taking one step back, I recall some chatter last week (or two) about some better ||-ization "primitives" -- something about a pvec doo-dad, and there being ideas to wrap different types of ||-ization behind an easy to use interface (I think this was the convo), and then I took a further step back and often wonder why we just don't bite the bullet and take advantage of the `foreach` infrastructure that is already out there -- in which case, I could imagne a "doSGE" package that might handle the particulars of what Florain is referring to. You could then configure it externally via some `registerDoSGE(some.config.object)` and just have the package code happily run it through `foreach(...) %dopar%` and be done w/ it. ... at least, I thought this is what was being talked about here (and popped up a week or two ago) -- sorry if I completely missed the mark ... -steve On Tue, Oct 23, 2012 at 10:38 AM, Hahne, Florian
<florian.hahne at novartis.com> wrote:
Hi Martin, I could define the generics in my own package, but that would mean that those will only be available there, or in the global environment assuming that I also export them, or in all additional packages that explicitly import them from my name space. Now there already are a whole bunch of packages around that all allow for parallelization via a cluster object. Obviously those all import the parLapply function from the parallel package. That means that I can't simply supply my own modified cluster object, because the code that calls parLapply will not know about the generic in my package, even if it is attached. Ideally parLapply would be a generic function already in the parallel package. Not sure who needs to be convinced in order for this to happen, but my gut feeling was that it could be easier to have the generic in BiocGenerics. Maybe I am missing something obvious here, but imo there is no way to overwrite parLapply globally for my own class unless the generic is imported by everyone who wants to make use of the special method. Florian -- On 10/23/12 2:20 PM, "Martin Morgan" <mtmorgan at fhcrc.org> wrote:
On 10/17/2012 05:45 AM, Hahne, Florian wrote:
Hi all, I was wondering whether it would be possible to have proper generics for some of the functions in the parallel package, e.g. parLapply and clusterCall. The reason I am asking is because I want to build an S4 class that essentially looks like an S3 cluster object but knows how to deal with the SGE. That way I can abstract away all the overhead regarding job submission, job status and reducing the results in the parLapply method of that class, and would be able to supply this new cluster object to all of my existing functions that can be processed in parallel using a cluster object as input. I have played around with the BatchJobs package as an abstraction layer to SGE and that work nicely. As a test case I have created the necessary generics myself in order to supply my own SGEcluster object to a function that normally deals with the "regular" parallel package S3 cluster objects and everything just worked out of the box, but obviously this fails once I am in a name space and my generic is not found anymore. Of course what we would really want is some proper abstraction of parallelization in R, but for now this seem to be at least a cheap compromise. Any thoughts on this?
Hi Florian -- we talked about this locally, but I guess we didn't actually send any email! Is there an obstacle to promoting these to generics in your own package? The usual motivation for inclusion in BiocGenerics has been to avoid conflicts between packages, but I'm not sure whether this is the case (yet)? This would also add a dependency fairly deep in the hierarchy. What do you think? Martin
Florian
-- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/bioc-devel/attachments/20121023/f9852b96/attachment.pl>
Hi Steve --
On 10/23/2012 10:20 AM, Vincent Carey wrote:
On Tue, Oct 23, 2012 at 12:13 PM, Steve Lianoglou < mailinglist.honeypot at gmail.com> wrote:
In response to a question from yesterday, I pointed someone to the ShortRead `srapply` function and I wondered to myself why it had to necessarily by "burried" in the ShortRead package (aside from it having a `sr` prefix).
I don't know that srapply necessarily 'got it right'...
I had thought it might be a good idea to move that (or something like that) to BiocGenerics (unless implementations aren't allowed there) but also realized that it would add more dependencies where someone might not necessarily need them. But, almost surely, a large majority of the people will be happy to do some form of ||-ization, so in my mind it's not such an onerous thing to add -- on the other hand, this large majority is probably enriched for people who are doing NGS analysis, in which case, keeping it in ShortRead can make some sense. Taking one step back, I recall some chatter last week (or two) about some better ||-ization "primitives" -- something about a pvec doo-dad, and there being ideas to wrap different types of ||-ization behind an easy to use interface (I think this was the convo), and then I took a further step back and often wonder why we just don't bite the bullet and take advantage of the `foreach` infrastructure that is already out there -- in which case, I could imagne a "doSGE" package that might handle the particulars of what Florain is referring to. You could then configure it externally via some `registerDoSGE(some.config.object)` and just have the package code happily run it through `foreach(...) %dopar%` and be done w/ it.
IMHO it is relevant. I have not looked for other abstractions, and this one seems to work. Florian's objectives might be a good test case for adequacy.
The registerDoDah does seem to be a useful abstraction. I think there's a lot of work to do for some sort of coordinated parallelization that putting parLapply into BiocGenerics might encourage; not good things will happen when everyone in a call stack tries to parallelize independently. But I'm in favor of parLapply in BiocGenerics at least for the moment. Martin
... at least, I thought this is what was being talked about here (and popped up a week or two ago) -- sorry if I completely missed the mark ... -steve On Tue, Oct 23, 2012 at 10:38 AM, Hahne, Florian <florian.hahne at novartis.com> wrote:
Hi Martin, I could define the generics in my own package, but that would mean that those will only be available there, or in the global environment assuming that I also export them, or in all additional packages that explicitly import them from my name space. Now there already are a whole bunch of packages around that all allow for parallelization via a cluster object. Obviously those all import the parLapply function from the parallel package. That means that I can't simply supply my own modified cluster object, because the code that calls parLapply will not know about the generic in my package, even if it is attached. Ideally parLapply would be a generic function already in the parallel package. Not sure who needs to be convinced in order for this to happen, but my gut feeling was that it could be easier to have the generic in BiocGenerics. Maybe I am missing something obvious here, but imo there is no way to overwrite parLapply globally for my own class unless the generic is imported by everyone who wants to make use of the special method. Florian -- On 10/23/12 2:20 PM, "Martin Morgan" <mtmorgan at fhcrc.org> wrote:
On 10/17/2012 05:45 AM, Hahne, Florian wrote:
Hi all, I was wondering whether it would be possible to have proper generics
for
some of the functions in the parallel package, e.g. parLapply and clusterCall. The reason I am asking is because I want to build an S4 class that essentially looks like an S3 cluster object but knows how to deal with the SGE. That way I can abstract away all the overhead regarding job submission, job status and reducing the results in the parLapply method of that class, and would be able to supply this new cluster object to all of my existing functions that can be processed in parallel using a cluster object as input. I have played around with the BatchJobs package as an abstraction layer to SGE and that work nicely. As a test case I have created the necessary generics myself in order to supply my own SGEcluster object to a function that normally deals with the "regular" parallel package S3 cluster objects and everything just worked out of the box, but obviously this fails once I am in a name space and my generic is not found anymore. Of course what we would really want is some proper abstraction of parallelization in R, but for now this seem to be at least a cheap compromise. Any thoughts on this?
Hi Florian -- we talked about this locally, but I guess we didn't actually send any email! Is there an obstacle to promoting these to generics in your own package? The usual motivation for inclusion in BiocGenerics has been to avoid conflicts between packages, but I'm not sure whether this is the case (yet)? This would also add a dependency fairly deep in the hierarchy. What do you think? Martin
Florian
-- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
-- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
[[alternative HTML version deleted]]
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/bioc-devel/attachments/20121023/0689f968/attachment.pl>
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/bioc-devel/attachments/20121023/f7c2370b/attachment.pl>
On 10/24/12 12:44 AM, "Michael Lawrence" <lawrence.michael at gene.com> wrote:
I agree that it would fruitful to have parLapply in BiocGenerics. It looks to be a flexible abstraction and its presence in the parallel package makes it ubiquitous. If it hasn't been done already, mclapply (and mcmapply) would be good candidates, as well. The fork-based parallelism is substantively different in terms of the API from the more general parallelism of parLapply. Someone was working on some more robust and convenient wrappers around mclapply. Did that ever see the light of day?
If you are referring to http://thread.gmane.org/gmane.science.biology.informatics.conductor/43660 in which I had offered some small changes to parallel::pvec https://gist.github.com/3757873/ and after which Martin had provided me with a route I have not (yet?) followed toward submitting a patch to R for consideration by R-devel / Simon Urbanek in http://grokbase.com/t/r/bioc-devel/129rbmxp5b/applying-over-granges-and-oth er-vectors-of-ranges#201209248dcn0tpwt7k7g9zsjr4dha6f1c
On Tue, Oct 23, 2012 at 12:13 PM, Steve Lianoglou < mailinglist.honeypot at gmail.com**> wrote: In response to a question from yesterday, I pointed someone to the
ShortRead `srapply` function and I wondered to myself why it had to necessarily by "burried" in the ShortRead package (aside from it having a `sr` prefix).
I don't know that srapply necessarily 'got it right'...
One thing I like about srapply is its support for a reduce argument.
I had thought it might be a good idea to move that (or something like that) to BiocGenerics (unless implementations aren't allowed there) but also realized that it would add more dependencies where someone might not necessarily need them.
But, almost surely, a large majority of the people will be happy to do some form of ||-ization, so in my mind it's not such an onerous thing to add -- on the other hand, this large majority is probably enriched for people who are doing NGS analysis, in which case, keeping it in ShortRead can make some sense.
I remain confused about the need for putting any of this into BiocGenerics at all. It seems to me that properly construed parallization primitives ought to 'just work' with any object which supports indexing and length. I would appreciate hearing arguments to the contrary. Florian, in a similar vein, could we not seek to change parallel::makeCluster to be extensible to, say, support SGE cluster? THis seems like the 'right thing to do'. ??? Regardless, I think we have raised some considerations that might inform improvements to parallel, including points about error handling, reducing results, block-level parallization over List/Vector (in addition to vector), etc. I think perhaps having a google doc that we can collectively edit to contain the requirements we are trying to achieve might move us forward effectively. Would this help? Or perhaps a page under http://wiki.fhcrc.org/bioc/DeveloperPage/#discussions ???
Taking one step back, I recall some chatter last week (or two) about some better ||-ization "primitives" -- something about a pvec doo-dad, and there being ideas to wrap different types of ||-ization behind an easy to use interface (I think this was the convo), and then I took a further step back and often wonder why we just don't bite the bullet and take advantage of the `foreach` infrastructure that is already out there -- in which case, I could imagne a "doSGE" package that might handle the particulars of what Florain is referring to. You could then configure it externally via some `registerDoSGE(some.config.**object)` and just have the package code happily run it through `foreach(...) %dopar%` and be done w/ it. IMHO it is relevant. I have not looked for other abstractions, and this
one seems to work. Florian's objectives might be a good test case for adequacy.
The registerDoDah does seem to be a useful abstraction.
Is this not more-or-less the intention of parallel::setDefaultCluster? --Malcolm
I think there's a lot of work to do for some sort of coordinated parallelization that putting parLapply into BiocGenerics might encourage; not good things will happen when everyone in a call stack tries to parallelize independently. But I'm in favor of parLapply in BiocGenerics at least for the moment. Martin
... at least, I thought this is what was being talked about here (and
popped up a week or two ago) -- sorry if I completely missed the mark ... -steve On Tue, Oct 23, 2012 at 10:38 AM, Hahne, Florian <florian.hahne at novartis.com> wrote:
Hi Martin, I could define the generics in my own package, but that would mean that those will only be available there, or in the global environment assuming that I also export them, or in all additional packages that explicitly import them from my name space. Now there already are a whole bunch of packages around that all allow for parallelization via a cluster object. Obviously those all import the parLapply function from the parallel package. That means that I can't simply supply my own modified cluster object, because the code that calls parLapply will not know about the generic in my package, even if it is attached. Ideally parLapply would be a generic function already in the parallel package. Not sure who needs to be convinced in order for this to happen, but my gut feeling was that it could be easier to have the generic in BiocGenerics. Maybe I am missing something obvious here, but imo there is no way to overwrite parLapply globally for my own class unless the generic is imported by everyone who wants to make use of the special method. Florian -- On 10/23/12 2:20 PM, "Martin Morgan" <mtmorgan at fhcrc.org> wrote: On 10/17/2012 05:45 AM, Hahne, Florian wrote:
Hi all, I was wondering whether it would be possible to have proper generics
for
some of the functions in the parallel package, e.g. parLapply and
clusterCall. The reason I am asking is because I want to build an S4 class that essentially looks like an S3 cluster object but knows how to deal with the SGE. That way I can abstract away all the overhead regarding job submission, job status and reducing the results in the parLapply method of that class, and would be able to supply this new cluster object to all of my existing functions that can be processed in parallel using a cluster object as input. I have played around with the BatchJobs package as an abstraction layer to SGE and that work nicely. As a test case I have created the necessary generics myself in order to supply my own SGEcluster object to a function that normally deals with the "regular" parallel package S3 cluster objects and everything just worked out of the box, but obviously this fails once I am in a name space and my generic is not found anymore. Of course what we would really want is some proper abstraction of parallelization in R, but for now this seem to be at least a cheap compromise. Any thoughts on this?
Hi Florian -- we talked about this locally, but I guess we didn't actually send any email! Is there an obstacle to promoting these to generics in your own package? The usual motivation for inclusion in BiocGenerics has been to avoid conflicts between packages, but I'm not sure whether this is the case (yet)? This would also add a dependency fairly deep in the hierarchy. What do you think? Martin Florian
-- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793
______________________________**_________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/**listinfo/bioc-devel<https://stat.ethz.c h/mailman/listinfo/bioc-devel>
-- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/**contact<http://cbio.mskcc.org/%7Elianos /contact>
______________________________**_________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/**listinfo/bioc-devel<https://stat.ethz.ch /mailman/listinfo/bioc-devel>
[[alternative HTML version deleted]]
______________________________**_________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/**listinfo/bioc-devel<https://stat.ethz.ch/ mailman/listinfo/bioc-devel>
-- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793
______________________________**_________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/**listinfo/bioc-devel<https://stat.ethz.ch/m ailman/listinfo/bioc-devel>
[[alternative HTML version deleted]]
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Hi,
With Florian use case, there seems to be a strong/immediate need for
dispatching on the cluster-like object passed as the 1st argument to
parLapply() and all the other functions in the parallel package that
belong to the "snow family" (14 functions in total, all documented in
?parallel::parLapply). So we've just added those 14 generics to
BiocGenerics 0.5.1. We're postponing the "multicore family" (i.e.
mclapply(), mcmapply(), and pvec()) for now.
Note that the 14 new generics dispatch at least on their 1st argument
('cl'), but also on their 2nd argument when this argument is 'x', 'X'
or 'seq' (expected to be a vector-like or matrix-like object). This
opens the door to defining methods that take advantage of the of the
implementation of particular vector-like or matrix-like objects.
Also note that, even if some of the 14 functions in the "snow family"
are simple convenience wrappers to other functions in the family, we've
made all of them generics. For example clusterEvalQ() is a simple
wrapper to clusterCall():
> clusterEvalQ
function (cl = NULL, expr)
clusterCall(cl, eval, substitute(expr), env = .GlobalEnv)
<environment: namespace:parallel>
And it seems (at least intuitively) that implementing a "clusterCall"
method for my cluster-like objects should be enough to have
clusterEvalQ() work out-of-the-box on those objects. But, sadly enough,
this is not the case:
setClass("FakeCluster", representation(nnodes="integer"))
setMethod("clusterCall", "FakeCluster",
function (cl=NULL, fun, ...) fun(...)
)
Then:
> mycluster <- new("FakeCluster", nnodes=10L)
> clusterCall(mycluster, print, 1:6)
[1] 1 2 3 4 5 6
> clusterEvalQ(mycluster, print(1:6))
Error in checkCluster(cl) : not a valid cluster
This is because the "clusterEvalQ" default method is calling
parallel::clusterCall() (which is *not* the generic), instead of
calling BiocGenerics::clusterCall() (which *is* the generic).
This would be avoided if clusterCall() was a generic defined in
the parallel package itself (or in a package that parallel depends
on). And this would of course be a better solution than having those
generics in BiocGenerics. Is someone willing to bring that case to
R-devel?
In the mean time I need to define a "clusterEvalQ" method:
setMethod("clusterEvalQ", "FakeCluster",
function (cl=NULL, expr)
clusterCall(cl, eval, substitute(expr), env=.GlobalEnv)
)
And then:
> clusterEvalQ(mycluster, print(1:6))
[1] 1 2 3 4 5 6
Finally note that this method I defined for my objects could be made the
default "clusterEvalQ" method (i.e. the clusterEvalQ,ANY method) and we
could put it in BiocGenerics. Or, since there is apparently nothing to
win by having clusterEvalQ() being a generic in the first place, we
could redefine clusterEvalQ() as an ordinary function in BiocGenerics.
This function would be implemented *exactly* like
parallel::clusterEvalQ() (and it would mask it), except that now
it would call BiocGenerics::clusterCall() internally.
What should we do?
H.
On 10/24/2012 09:07 AM, Cook, Malcolm wrote:
On 10/24/12 12:44 AM, "Michael Lawrence" <lawrence.michael at gene.com> wrote:
I agree that it would fruitful to have parLapply in BiocGenerics. It looks to be a flexible abstraction and its presence in the parallel package makes it ubiquitous. If it hasn't been done already, mclapply (and mcmapply) would be good candidates, as well. The fork-based parallelism is substantively different in terms of the API from the more general parallelism of parLapply. Someone was working on some more robust and convenient wrappers around mclapply. Did that ever see the light of day?
If you are referring to http://thread.gmane.org/gmane.science.biology.informatics.conductor/43660 in which I had offered some small changes to parallel::pvec https://gist.github.com/3757873/ and after which Martin had provided me with a route I have not (yet?) followed toward submitting a patch to R for consideration by R-devel / Simon Urbanek in http://grokbase.com/t/r/bioc-devel/129rbmxp5b/applying-over-granges-and-oth er-vectors-of-ranges#201209248dcn0tpwt7k7g9zsjr4dha6f1c
On Tue, Oct 23, 2012 at 12:13 PM, Steve Lianoglou < mailinglist.honeypot at gmail.com**> wrote: In response to a question from yesterday, I pointed someone to the
ShortRead `srapply` function and I wondered to myself why it had to necessarily by "burried" in the ShortRead package (aside from it having a `sr` prefix).
I don't know that srapply necessarily 'got it right'...
One thing I like about srapply is its support for a reduce argument.
I had thought it might be a good idea to move that (or something like that) to BiocGenerics (unless implementations aren't allowed there) but also realized that it would add more dependencies where someone might not necessarily need them.
But, almost surely, a large majority of the people will be happy to do some form of ||-ization, so in my mind it's not such an onerous thing to add -- on the other hand, this large majority is probably enriched for people who are doing NGS analysis, in which case, keeping it in ShortRead can make some sense.
I remain confused about the need for putting any of this into BiocGenerics at all. It seems to me that properly construed parallization primitives ought to 'just work' with any object which supports indexing and length. I would appreciate hearing arguments to the contrary. Florian, in a similar vein, could we not seek to change parallel::makeCluster to be extensible to, say, support SGE cluster? THis seems like the 'right thing to do'. ??? Regardless, I think we have raised some considerations that might inform improvements to parallel, including points about error handling, reducing results, block-level parallization over List/Vector (in addition to vector), etc. I think perhaps having a google doc that we can collectively edit to contain the requirements we are trying to achieve might move us forward effectively. Would this help? Or perhaps a page under http://wiki.fhcrc.org/bioc/DeveloperPage/#discussions ???
Taking one step back, I recall some chatter last week (or two) about some better ||-ization "primitives" -- something about a pvec doo-dad, and there being ideas to wrap different types of ||-ization behind an easy to use interface (I think this was the convo), and then I took a further step back and often wonder why we just don't bite the bullet and take advantage of the `foreach` infrastructure that is already out there -- in which case, I could imagne a "doSGE" package that might handle the particulars of what Florain is referring to. You could then configure it externally via some `registerDoSGE(some.config.**object)` and just have the package code happily run it through `foreach(...) %dopar%` and be done w/ it. IMHO it is relevant. I have not looked for other abstractions, and this
one seems to work. Florian's objectives might be a good test case for adequacy.
The registerDoDah does seem to be a useful abstraction.
Is this not more-or-less the intention of parallel::setDefaultCluster? --Malcolm
I think there's a lot of work to do for some sort of coordinated parallelization that putting parLapply into BiocGenerics might encourage; not good things will happen when everyone in a call stack tries to parallelize independently. But I'm in favor of parLapply in BiocGenerics at least for the moment. Martin
... at least, I thought this is what was being talked about here (and
popped up a week or two ago) -- sorry if I completely missed the mark ... -steve On Tue, Oct 23, 2012 at 10:38 AM, Hahne, Florian <florian.hahne at novartis.com> wrote:
Hi Martin, I could define the generics in my own package, but that would mean that those will only be available there, or in the global environment assuming that I also export them, or in all additional packages that explicitly import them from my name space. Now there already are a whole bunch of packages around that all allow for parallelization via a cluster object. Obviously those all import the parLapply function from the parallel package. That means that I can't simply supply my own modified cluster object, because the code that calls parLapply will not know about the generic in my package, even if it is attached. Ideally parLapply would be a generic function already in the parallel package. Not sure who needs to be convinced in order for this to happen, but my gut feeling was that it could be easier to have the generic in BiocGenerics. Maybe I am missing something obvious here, but imo there is no way to overwrite parLapply globally for my own class unless the generic is imported by everyone who wants to make use of the special method. Florian -- On 10/23/12 2:20 PM, "Martin Morgan" <mtmorgan at fhcrc.org> wrote: On 10/17/2012 05:45 AM, Hahne, Florian wrote:
Hi all, I was wondering whether it would be possible to have proper generics
for
some of the functions in the parallel package, e.g. parLapply and
clusterCall. The reason I am asking is because I want to build an S4 class that essentially looks like an S3 cluster object but knows how to deal with the SGE. That way I can abstract away all the overhead regarding job submission, job status and reducing the results in the parLapply method of that class, and would be able to supply this new cluster object to all of my existing functions that can be processed in parallel using a cluster object as input. I have played around with the BatchJobs package as an abstraction layer to SGE and that work nicely. As a test case I have created the necessary generics myself in order to supply my own SGEcluster object to a function that normally deals with the "regular" parallel package S3 cluster objects and everything just worked out of the box, but obviously this fails once I am in a name space and my generic is not found anymore. Of course what we would really want is some proper abstraction of parallelization in R, but for now this seem to be at least a cheap compromise. Any thoughts on this?
Hi Florian -- we talked about this locally, but I guess we didn't actually send any email! Is there an obstacle to promoting these to generics in your own package? The usual motivation for inclusion in BiocGenerics has been to avoid conflicts between packages, but I'm not sure whether this is the case (yet)? This would also add a dependency fairly deep in the hierarchy. What do you think? Martin Florian
-- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793
______________________________**_________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/**listinfo/bioc-devel<https://stat.ethz.c h/mailman/listinfo/bioc-devel>
-- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/**contact<http://cbio.mskcc.org/%7Elianos /contact>
______________________________**_________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/**listinfo/bioc-devel<https://stat.ethz.ch /mailman/listinfo/bioc-devel>
[[alternative HTML version deleted]]
______________________________**_________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/**listinfo/bioc-devel<https://stat.ethz.ch/ mailman/listinfo/bioc-devel>
-- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793
______________________________**_________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/**listinfo/bioc-devel<https://stat.ethz.ch/m ailman/listinfo/bioc-devel>
[[alternative HTML version deleted]]
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Herv? Pag?s Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
--
On 10/24/12 6:07 PM, "Cook, Malcolm" <MEC at stowers.org> wrote:
On 10/24/12 12:44 AM, "Michael Lawrence" <lawrence.michael at gene.com> wrote:
I agree that it would fruitful to have parLapply in BiocGenerics. It looks to be a flexible abstraction and its presence in the parallel package makes it ubiquitous. If it hasn't been done already, mclapply (and mcmapply) would be good candidates, as well. The fork-based parallelism is substantively different in terms of the API from the more general parallelism of parLapply. Someone was working on some more robust and convenient wrappers around mclapply. Did that ever see the light of day?
If you are referring to http://thread.gmane.org/gmane.science.biology.informatics.conductor/43660 in which I had offered some small changes to parallel::pvec https://gist.github.com/3757873/ and after which Martin had provided me with a route I have not (yet?) followed toward submitting a patch to R for consideration by R-devel / Simon Urbanek in http://grokbase.com/t/r/bioc-devel/129rbmxp5b/applying-over-granges-and-ot h er-vectors-of-ranges#201209248dcn0tpwt7k7g9zsjr4dha6f1c
On Tue, Oct 23, 2012 at 12:13 PM, Steve Lianoglou < mailinglist.honeypot at gmail.com**> wrote: In response to a question from yesterday, I pointed someone to the
ShortRead `srapply` function and I wondered to myself why it had to necessarily by "burried" in the ShortRead package (aside from it having a `sr` prefix).
I don't know that srapply necessarily 'got it right'...
One thing I like about srapply is its support for a reduce argument.
I had thought it might be a good idea to move that (or something like that) to BiocGenerics (unless implementations aren't allowed there) but also realized that it would add more dependencies where someone might not necessarily need them.
But, almost surely, a large majority of the people will be happy to do some form of ||-ization, so in my mind it's not such an onerous thing to add -- on the other hand, this large majority is probably enriched for people who are doing NGS analysis, in which case, keeping it in ShortRead can make some sense.
I remain confused about the need for putting any of this into BiocGenerics at all. It seems to me that properly construed parallization primitives ought to 'just work' with any object which supports indexing and length.
That may well be true, but currently we have a whole lot of legacy code were folks opted to use one of the snow-family functions as the one and only mode for parallel processing. My argument here is simply that we could make all of these implementations much more flexible by allowing for arbitrary cluster objects in there, at very little cost for the package maintainers. Obviously we would ideally have a proper parallel abstraction layer that everybody agrees to use, but currently we are still far far away from that. Plus the problem of legacy code remains.
I would appreciate hearing arguments to the contrary. Florian, in a similar vein, could we not seek to change parallel::makeCluster to be extensible to, say, support SGE cluster? THis seems like the 'right thing to do'. ???
Not necessarily. I can't see how this could easily be archived without knowing all the possible cluster subtypes in advance. We could not turn it into a generic and dispatch to different methods because the signatures are not necessarily different. It seems to me that the paradigm at least in Bioconductor for these cases is to have a class that encapsulates all the necessary setting parameters, and a constructor function for this class. Again, if we had something more generic in place we certainly would not need all of this.
Regardless, I think we have raised some considerations that might inform improvements to parallel, including points about error handling, reducing results, block-level parallization over List/Vector (in addition to vector), etc. I think perhaps having a google doc that we can collectively edit to contain the requirements we are trying to achieve might move us forward effectively. Would this help? Or perhaps a page under http://wiki.fhcrc.org/bioc/DeveloperPage/#discussions ???
Taking one step back, I recall some chatter last week (or two) about some better ||-ization "primitives" -- something about a pvec doo-dad, and there being ideas to wrap different types of ||-ization behind an easy to use interface (I think this was the convo), and then I took a further step back and often wonder why we just don't bite the bullet and take advantage of the `foreach` infrastructure that is already out there -- in which case, I could imagne a "doSGE" package that might handle the particulars of what Florain is referring to. You could then configure it externally via some `registerDoSGE(some.config.**object)` and just have the package code happily run it through `foreach(...) %dopar%` and be done w/ it. IMHO it is relevant. I have not looked for other abstractions, and this
one seems to work. Florian's objectives might be a good test case for adequacy.
The registerDoDah does seem to be a useful abstraction.
Is this not more-or-less the intention of parallel::setDefaultCluster? --Malcolm
I think there's a lot of work to do for some sort of coordinated parallelization that putting parLapply into BiocGenerics might encourage; not good things will happen when everyone in a call stack tries to parallelize independently. But I'm in favor of parLapply in BiocGenerics at least for the moment. Martin
... at least, I thought this is what was being talked about here (and
popped up a week or two ago) -- sorry if I completely missed the mark ... -steve On Tue, Oct 23, 2012 at 10:38 AM, Hahne, Florian <florian.hahne at novartis.com> wrote:
Hi Martin, I could define the generics in my own package, but that would mean that those will only be available there, or in the global environment assuming that I also export them, or in all additional packages that explicitly import them from my name space. Now there already are a whole bunch of packages around that all allow for parallelization via a cluster object. Obviously those all import the parLapply function from the parallel package. That means that I can't simply supply my own modified cluster object, because the code that calls parLapply will not know about the generic in my package, even if it is attached. Ideally parLapply would be a generic function already in the parallel package. Not sure who needs to be convinced in order for this to happen, but my gut feeling was that it could be easier to have the generic in BiocGenerics. Maybe I am missing something obvious here, but imo there is no way to overwrite parLapply globally for my own class unless the generic is imported by everyone who wants to make use of the special method. Florian -- On 10/23/12 2:20 PM, "Martin Morgan" <mtmorgan at fhcrc.org> wrote: On 10/17/2012 05:45 AM, Hahne, Florian wrote:
Hi all, I was wondering whether it would be possible to have proper generics
for
some of the functions in the parallel package, e.g. parLapply and
clusterCall. The reason I am asking is because I want to build an S4 class that essentially looks like an S3 cluster object but knows how to deal with the SGE. That way I can abstract away all the overhead regarding job submission, job status and reducing the results in the parLapply method of that class, and would be able to supply this new cluster object to all of my existing functions that can be processed in parallel using a cluster object as input. I have played around with the BatchJobs package as an abstraction layer to SGE and that work nicely. As a test case I have created the necessary generics myself in order to supply my own SGEcluster object to a function that normally deals with the "regular" parallel package S3 cluster objects and everything just worked out of the box, but obviously this fails once I am in a name space and my generic is not found anymore. Of course what we would really want is some proper abstraction of parallelization in R, but for now this seem to be at least a cheap compromise. Any thoughts on this?
Hi Florian -- we talked about this locally, but I guess we didn't actually send any email! Is there an obstacle to promoting these to generics in your own package? The usual motivation for inclusion in BiocGenerics has been to avoid conflicts between packages, but I'm not sure whether this is the case (yet)? This would also add a dependency fairly deep in the hierarchy. What do you think? Martin Florian
-- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793
______________________________**_________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/**listinfo/bioc-devel<https://stat.ethz. c h/mailman/listinfo/bioc-devel>
-- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/**contact<http://cbio.mskcc.org/%7Eliano s /contact>
______________________________**_________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/**listinfo/bioc-devel<https://stat.ethz.c h /mailman/listinfo/bioc-devel>
[[alternative HTML version deleted]]
______________________________**_________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/**listinfo/bioc-devel<https://stat.ethz.ch / mailman/listinfo/bioc-devel>
-- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793
______________________________**_________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/**listinfo/bioc-devel<https://stat.ethz.ch/ m ailman/listinfo/bioc-devel>
[[alternative HTML version deleted]]
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
On 10/24/12 5:08 PM, "Herv? Pag?s" <hpages at fhcrc.org> wrote:
Hi,
With Florian use case, there seems to be a strong/immediate need for
dispatching on the cluster-like object passed as the 1st argument to
parLapply() and all the other functions in the parallel package that
belong to the "snow family" (14 functions in total, all documented in
?parallel::parLapply). So we've just added those 14 generics to
BiocGenerics 0.5.1. We're postponing the "multicore family" (i.e.
mclapply(), mcmapply(), and pvec()) for now.
Note that the 14 new generics dispatch at least on their 1st argument
('cl'), but also on their 2nd argument when this argument is 'x', 'X'
or 'seq' (expected to be a vector-like or matrix-like object). This
opens the door to defining methods that take advantage of the of the
implementation of particular vector-like or matrix-like objects.
Also note that, even if some of the 14 functions in the "snow family"
are simple convenience wrappers to other functions in the family, we've
made all of them generics. For example clusterEvalQ() is a simple
wrapper to clusterCall():
> clusterEvalQ
function (cl = NULL, expr)
clusterCall(cl, eval, substitute(expr), env = .GlobalEnv)
<environment: namespace:parallel>
And it seems (at least intuitively) that implementing a "clusterCall"
method for my cluster-like objects should be enough to have
clusterEvalQ() work out-of-the-box on those objects. But, sadly enough,
this is not the case:
setClass("FakeCluster", representation(nnodes="integer"))
setMethod("clusterCall", "FakeCluster",
function (cl=NULL, fun, ...) fun(...)
)
Then:
> mycluster <- new("FakeCluster", nnodes=10L)
> clusterCall(mycluster, print, 1:6)
[1] 1 2 3 4 5 6
> clusterEvalQ(mycluster, print(1:6))
Error in checkCluster(cl) : not a valid cluster
This is because the "clusterEvalQ" default method is calling
parallel::clusterCall() (which is *not* the generic), instead of
calling BiocGenerics::clusterCall() (which *is* the generic).
This would be avoided if clusterCall() was a generic defined in
the parallel package itself (or in a package that parallel depends
on). And this would of course be a better solution than having those
generics in BiocGenerics. Is someone willing to bring that case to
R-devel?
In the mean time I need to define a "clusterEvalQ" method:
setMethod("clusterEvalQ", "FakeCluster",
function (cl=NULL, expr)
clusterCall(cl, eval, substitute(expr), env=.GlobalEnv)
)
And then:
> clusterEvalQ(mycluster, print(1:6))
[1] 1 2 3 4 5 6 Finally note that this method I defined for my objects could be made the default "clusterEvalQ" method (i.e. the clusterEvalQ,ANY method) and we could put it in BiocGenerics. Or, since there is apparently nothing to win by having clusterEvalQ() being a generic in the first place, we could redefine clusterEvalQ() as an ordinary function in BiocGenerics. This function would be implemented *exactly* like parallel::clusterEvalQ() (and it would mask it), except that now it would call BiocGenerics::clusterCall() internally. What should we do?
We have the identical problem already when we try to use parallel mcmapply on a BioC List (i.e. GRangesList). Witness: The casual user (ehrm, myself at least) expects that since I can 'lapply' on a BioC GRangesList (or any other List) that I should be able to mclapply on it. Sadly the casual user is wrong, and gets an error. Why? Because parallel::mclapply(X... calls as.list on X. Which yields 'Error in as.list.default : no method for coercing this S4 class to a vector' But, you say, IRanges defines as.list for Lists, as can be demonstrated by calling as.list(myGRL) on a GRangesList. Here I yield the floor to someone who can explain why this is so, for I have not studied enough how namespaces/packages/symboltables/whatever work in R. Anyone? Regardless, one BAD workaround I found works is to snarf (tm) the source for mclapply, evaluate it in the global namespace, after prefixing all parallel internal functions with 'parallel:::'. AFter doing this, the modified mclapply works as one might expect. So, there is at least an issue regarding how method dispatch works across namespaces. Again I yield the floor, but, expect that it can be fixed. BUT, FURTHERMORE, MCLAPPLY SHOULD NOT COERCE X TO LIST ANYWAY Why? Because calling `as.list` incurs the overhead of (needlessly!?!) coercing this nice tight GRangesList into a base::list. There is NO REASON for it to be coercing X to a list at all. By my lights, mclapply only needs `length` and `seq_along` defined on X, which ARE ALREADY available to a GRangesList from Vector. Indeed, commenting out the X<-as.list(X) coercion in mclapply and, lo, it still works on a GRangesList as hoped, and on a 1000 element GRanges list takes ~18x less user time to mclapply(myGRL,length). (and even short just to use elementLengths, but that is not the point). In this case the solution appears to be to FIX the upstream package so that method dispatch works correctly (I expect that length and seq_along are only visible to my snarfed mclapply and would suffer from similar error without adressing the package issue). Indeed, similarly, in my proposed changed to parallel::pvec, I found a simple change that made it work with Vector as well as vector, since Vector implements `[` and `length`. I still think the solution to getting an SGE (et. al.) parallel back-end is to seek to improve the upstream package to make 'pluggable' for different parallel backends. I don't think I'm the right person to represent this to R-devel as obviously I am not schooled (yet!?!?) in the workings of S3/S4/signatures/methods/etc. Herve, I have a hunch that your 'In the mean time' solution is a workaround that has the potential to invite further confusion. Anyone, as a perhaps related issue, and as an opportunity to educate me, can you explain why untrace does NOT completely work on `lapply` (with BiocGenerics loaded). Viz: trace(lapply) untrace(lapply) IRanges(1,2) IRanges of length 1 trace: lapply(dots, methods:::.class1) .... --Malcolm
H. On 10/24/2012 09:07 AM, Cook, Malcolm wrote:
On 10/24/12 12:44 AM, "Michael Lawrence" <lawrence.michael at gene.com> wrote:
I agree that it would fruitful to have parLapply in BiocGenerics. It looks to be a flexible abstraction and its presence in the parallel package makes it ubiquitous. If it hasn't been done already, mclapply (and mcmapply) would be good candidates, as well. The fork-based parallelism is substantively different in terms of the API from the more general parallelism of parLapply. Someone was working on some more robust and convenient wrappers around mclapply. Did that ever see the light of day?
If you are referring to http://thread.gmane.org/gmane.science.biology.informatics.conductor/43660 in which I had offered some small changes to parallel::pvec https://gist.github.com/3757873/ and after which Martin had provided me with a route I have not (yet?) followed toward submitting a patch to R for consideration by R-devel / Simon Urbanek in http://grokbase.com/t/r/bioc-devel/129rbmxp5b/applying-over-granges-and-o th er-vectors-of-ranges#201209248dcn0tpwt7k7g9zsjr4dha6f1c
On Tue, Oct 23, 2012 at 12:13 PM, Steve Lianoglou < mailinglist.honeypot at gmail.com**> wrote: In response to a question from yesterday, I pointed someone to the
ShortRead `srapply` function and I wondered to myself why it had to necessarily by "burried" in the ShortRead package (aside from it having a `sr` prefix).
I don't know that srapply necessarily 'got it right'...
One thing I like about srapply is its support for a reduce argument.
I had thought it might be a good idea to move that (or something like that) to BiocGenerics (unless implementations aren't allowed there) but also realized that it would add more dependencies where someone might not necessarily need them.
But, almost surely, a large majority of the people will be happy to do some form of ||-ization, so in my mind it's not such an onerous thing to add -- on the other hand, this large majority is probably enriched for people who are doing NGS analysis, in which case, keeping it in ShortRead can make some sense.
I remain confused about the need for putting any of this into BiocGenerics at all. It seems to me that properly construed parallization primitives ought to 'just work' with any object which supports indexing and length. I would appreciate hearing arguments to the contrary. Florian, in a similar vein, could we not seek to change parallel::makeCluster to be extensible to, say, support SGE cluster? THis seems like the 'right thing to do'. ??? Regardless, I think we have raised some considerations that might inform improvements to parallel, including points about error handling, reducing results, block-level parallization over List/Vector (in addition to vector), etc. I think perhaps having a google doc that we can collectively edit to contain the requirements we are trying to achieve might move us forward effectively. Would this help? Or perhaps a page under http://wiki.fhcrc.org/bioc/DeveloperPage/#discussions ???
Taking one step back, I recall some chatter last week (or two) about some better ||-ization "primitives" -- something about a pvec doo-dad, and there being ideas to wrap different types of ||-ization behind an easy to use interface (I think this was the convo), and then I took a further step back and often wonder why we just don't bite the bullet and take advantage of the `foreach` infrastructure that is already out there -- in which case, I could imagne a "doSGE" package that might handle the particulars of what Florain is referring to. You could then configure it externally via some `registerDoSGE(some.config.**object)` and just have the package code happily run it through `foreach(...) %dopar%` and be done w/ it. IMHO it is relevant. I have not looked for other abstractions, and this
one seems to work. Florian's objectives might be a good test case for adequacy.
The registerDoDah does seem to be a useful abstraction.
Is this not more-or-less the intention of parallel::setDefaultCluster? --Malcolm
I think there's a lot of work to do for some sort of coordinated parallelization that putting parLapply into BiocGenerics might encourage; not good things will happen when everyone in a call stack tries to parallelize independently. But I'm in favor of parLapply in BiocGenerics at least for the moment. Martin
... at least, I thought this is what was being talked about here (and
popped up a week or two ago) -- sorry if I completely missed the mark ... -steve On Tue, Oct 23, 2012 at 10:38 AM, Hahne, Florian <florian.hahne at novartis.com> wrote:
Hi Martin, I could define the generics in my own package, but that would mean that those will only be available there, or in the global environment assuming that I also export them, or in all additional packages that explicitly import them from my name space. Now there already are a whole bunch of packages around that all allow for parallelization via a cluster object. Obviously those all import the parLapply function from the parallel package. That means that I can't simply supply my own modified cluster object, because the code that calls parLapply will not know about the generic in my package, even if it is attached. Ideally parLapply would be a generic function already in the parallel package. Not sure who needs to be convinced in order for this to happen, but my gut feeling was that it could be easier to have the generic in BiocGenerics. Maybe I am missing something obvious here, but imo there is no way to overwrite parLapply globally for my own class unless the generic is imported by everyone who wants to make use of the special method. Florian -- On 10/23/12 2:20 PM, "Martin Morgan" <mtmorgan at fhcrc.org> wrote: On 10/17/2012 05:45 AM, Hahne, Florian wrote:
Hi all, I was wondering whether it would be possible to have proper generics
for
some of the functions in the parallel package, e.g. parLapply and
clusterCall. The reason I am asking is because I want to build an S4 class that essentially looks like an S3 cluster object but knows how to deal with the SGE. That way I can abstract away all the overhead regarding job submission, job status and reducing the results in the parLapply method of that class, and would be able to supply this new cluster object to all of my existing functions that can be processed in parallel using a cluster object as input. I have played around with the BatchJobs package as an abstraction layer to SGE and that work nicely. As a test case I have created the necessary generics myself in order to supply my own SGEcluster object to a function that normally deals with the "regular" parallel package S3 cluster objects and everything just worked out of the box, but obviously this fails once I am in a name space and my generic is not found anymore. Of course what we would really want is some proper abstraction of parallelization in R, but for now this seem to be at least a cheap compromise. Any thoughts on this?
Hi Florian -- we talked about this locally, but I guess we didn't actually send any email! Is there an obstacle to promoting these to generics in your own package? The usual motivation for inclusion in BiocGenerics has been to avoid conflicts between packages, but I'm not sure whether this is the case (yet)? This would also add a dependency fairly deep in the hierarchy. What do you think? Martin Florian
-- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793
______________________________**_________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/**listinfo/bioc-devel<https://stat.ethz .c h/mailman/listinfo/bioc-devel>
-- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/**contact<http://cbio.mskcc.org/%7Elian os /contact>
______________________________**_________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/**listinfo/bioc-devel<https://stat.ethz. ch /mailman/listinfo/bioc-devel>
[[alternative HTML version deleted]]
______________________________**_________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/**listinfo/bioc-devel<https://stat.ethz.c h/ mailman/listinfo/bioc-devel>
-- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793
______________________________**_________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/**listinfo/bioc-devel<https://stat.ethz.ch /m ailman/listinfo/bioc-devel>
[[alternative HTML version deleted]]
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
-- Herv? Pag?s Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/bioc-devel/attachments/20121025/65046da4/attachment.pl>
For me the cleanest option with the least impact would be to have this fixed directly in the parallel package. However I think that somebody with more influence should suggest that to Rdevel. If they will not do it, the other options seem all more or less equivalent to me. Florian
On 10/25/12 12:08 AM, "Herv? Pag?s" <hpages at fhcrc.org> wrote:
>Hi,
>
>With Florian use case, there seems to be a strong/immediate need for
>dispatching on the cluster-like object passed as the 1st argument to
>parLapply() and all the other functions in the parallel package that
>belong to the "snow family" (14 functions in total, all documented in
>?parallel::parLapply). So we've just added those 14 generics to
>BiocGenerics 0.5.1. We're postponing the "multicore family" (i.e.
>mclapply(), mcmapply(), and pvec()) for now.
>
>Note that the 14 new generics dispatch at least on their 1st argument
>('cl'), but also on their 2nd argument when this argument is 'x', 'X'
>or 'seq' (expected to be a vector-like or matrix-like object). This
>opens the door to defining methods that take advantage of the of the
>implementation of particular vector-like or matrix-like objects.
>
>Also note that, even if some of the 14 functions in the "snow family"
>are simple convenience wrappers to other functions in the family, we've
>made all of them generics. For example clusterEvalQ() is a simple
>wrapper to clusterCall():
>
> > clusterEvalQ
> function (cl = NULL, expr)
> clusterCall(cl, eval, substitute(expr), env = .GlobalEnv)
> <environment: namespace:parallel>
>
>And it seems (at least intuitively) that implementing a "clusterCall"
>method for my cluster-like objects should be enough to have
>clusterEvalQ() work out-of-the-box on those objects. But, sadly enough,
>this is not the case:
>
> setClass("FakeCluster", representation(nnodes="integer"))
>
> setMethod("clusterCall", "FakeCluster",
> function (cl=NULL, fun, ...) fun(...)
> )
>
>Then:
>
> > mycluster <- new("FakeCluster", nnodes=10L)
> > clusterCall(mycluster, print, 1:6)
> [1] 1 2 3 4 5 6
> > clusterEvalQ(mycluster, print(1:6))
> Error in checkCluster(cl) : not a valid cluster
>
>This is because the "clusterEvalQ" default method is calling
>parallel::clusterCall() (which is *not* the generic), instead of
>calling BiocGenerics::clusterCall() (which *is* the generic).
>
>This would be avoided if clusterCall() was a generic defined in
>the parallel package itself (or in a package that parallel depends
>on). And this would of course be a better solution than having those
>generics in BiocGenerics. Is someone willing to bring that case to
>R-devel?
>
>In the mean time I need to define a "clusterEvalQ" method:
>
> setMethod("clusterEvalQ", "FakeCluster",
> function (cl=NULL, expr)
> clusterCall(cl, eval, substitute(expr), env=.GlobalEnv)
> )
>
>And then:
>
> > clusterEvalQ(mycluster, print(1:6))
> [1] 1 2 3 4 5 6
>
>Finally note that this method I defined for my objects could be made the
>default "clusterEvalQ" method (i.e. the clusterEvalQ,ANY method) and we
>could put it in BiocGenerics. Or, since there is apparently nothing to
>win by having clusterEvalQ() being a generic in the first place, we
>could redefine clusterEvalQ() as an ordinary function in BiocGenerics.
>This function would be implemented *exactly* like
>parallel::clusterEvalQ() (and it would mask it), except that now
>it would call BiocGenerics::clusterCall() internally.
>
>What should we do?
>
>H.
>
>
>On 10/24/2012 09:07 AM, Cook, Malcolm wrote:
>> On 10/24/12 12:44 AM, "Michael Lawrence" <lawrence.michael at gene.com>
>>wrote:
>>
>>> I agree that it would fruitful to have parLapply in BiocGenerics. It
>>>looks
>>> to be a flexible abstraction and its presence in the parallel package
>>> makes
>>> it ubiquitous. If it hasn't been done already, mclapply (and mcmapply)
>>> would be good candidates, as well. The fork-based parallelism is
>>> substantively different in terms of the API from the more general
>>> parallelism of parLapply.
>>>
>>> Someone was working on some more robust and convenient wrappers around
>>> mclapply. Did that ever see the light of day?
>>
>>
>> If you are referring to
>>
>>http://thread.gmane.org/gmane.science.biology.informatics.conductor/43660
>>
>> in which I had offered some small changes to parallel::pvec
>>
>> https://gist.github.com/3757873/
>>
>> and after which Martin had provided me with a route I have not (yet?)
>> followed toward submitting a patch to R for consideration by R-devel /
>> Simon Urbanek in
>>
>>
>>http://grokbase.com/t/r/bioc-devel/129rbmxp5b/applying-over-granges-and-o
>>th
>> er-vectors-of-ranges#201209248dcn0tpwt7k7g9zsjr4dha6f1c
>>
>>
>>
>>
>>>>> On Tue, Oct 23, 2012 at 12:13 PM, Steve Lianoglou <
>>>>> mailinglist.honeypot at gmail.com**> wrote:
>>>>>
>>>>> In response to a question from yesterday, I pointed someone to the
>>>>>> ShortRead `srapply` function and I wondered to myself why it had to
>>>>>> necessarily by "burried" in the ShortRead package (aside from it
>>>>>> having a `sr` prefix).
>>>>>>
>>>>>
>>>> I don't know that srapply necessarily 'got it right'...
>>
>>
>> One thing I like about srapply is its support for a reduce argument.
>>
>>>>>> I had thought it might be a good idea to move that (or something
>>>>>>like
>>>>>> that) to BiocGenerics (unless implementations aren't allowed there)
>>>>>> but also realized that it would add more dependencies where someone
>>>>>> might not necessarily need them.
>>
>>
>>>>>>
>>>>>> But, almost surely, a large majority of the people will be happy to
>>>>>>do
>>>>>> some form of ||-ization, so in my mind it's not such an onerous
>>>>>>thing
>>>>>> to add -- on the other hand, this large majority is probably
>>>>>>enriched
>>>>>> for people who are doing NGS analysis, in which case, keeping it in
>>>>>> ShortRead can make some sense.
>>
>> I remain confused about the need for putting any of this into
>>BiocGenerics
>> at all. It seems to me that properly construed parallization primitives
>> ought to 'just work' with any object which supports indexing and length.
>>
>> I would appreciate hearing arguments to the contrary.
>>
>> Florian, in a similar vein, could we not seek to change
>> parallel::makeCluster to be extensible to, say, support SGE cluster?
>>THis
>> seems like the 'right thing to do'. ???
>>
>>
>> Regardless, I think we have raised some considerations that might inform
>> improvements to parallel, including points about error handling,
>>reducing
>> results, block-level parallization over List/Vector (in addition to
>> vector), etc.
>>
>> I think perhaps having a google doc that we can collectively edit to
>> contain the requirements we are trying to achieve might move us forward
>> effectively. Would this help? Or perhaps a page under
>> http://wiki.fhcrc.org/bioc/DeveloperPage/#discussions ???
>>
>>
>>>>>> Taking one step back, I recall some chatter last week (or two) about
>>>>>> some better ||-ization "primitives" -- something about a pvec
>>>>>>doo-dad,
>>>>>> and there being ideas to wrap different types of ||-ization behind
>>>>>>an
>>>>>> easy to use interface (I think this was the convo), and then I took
>>>>>>a
>>>>>> further step back and often wonder why we just don't bite the bullet
>>>>>> and take advantage of the `foreach` infrastructure that is already
>>>>>>out
>>>>>> there -- in which case, I could imagne a "doSGE" package that might
>>>>>> handle the particulars of what Florain is referring to. You could
>>>>>>then
>>>>>> configure it externally via some
>>>>>>`registerDoSGE(some.config.**object)`
>>>>>> and just have the package code happily run it through `foreach(...)
>>>>>> %dopar%` and be done w/ it.
>>>>>>
>>>>>>
>>>>>> IMHO it is relevant. I have not looked for other abstractions,
>>>>>>and
>>>>>> this
>>>>> one seems
>>>>> to work. Florian's objectives might be a good test case for
>>>>>adequacy.
>>>>>
>>>>
>>>> The registerDoDah does seem to be a useful abstraction.
>>
>> Is this not more-or-less the intention of parallel::setDefaultCluster?
>>
>> --Malcolm
>>
>>
>>
>>>>
>>>> I think there's a lot of work to do for some sort of coordinated
>>>> parallelization that putting parLapply into BiocGenerics might
>>>> encourage;
>>>> not good things will happen when everyone in a call stack tries to
>>>> parallelize independently. But I'm in favor of parLapply in
>>>> BiocGenerics at
>>>> least for the moment.
>>>>
>>>> Martin
>>>>
>>>>
>>>>
>>>>>
>>>>> ... at least, I thought this is what was being talked about here
>>>>>(and
>>>>>> popped up a week or two ago) -- sorry if I completely missed the
>>>>>>mark
>>>>>> ...
>>>>>>
>>>>>> -steve
>>>>>>
>>>>>>
>>>>>> On Tue, Oct 23, 2012 at 10:38 AM, Hahne, Florian
>>>>>> <florian.hahne at novartis.com> wrote:
>>>>>>
>>>>>>> Hi Martin,
>>>>>>> I could define the generics in my own package, but that would mean
>>>>>>> that
>>>>>>> those will only be available there, or in the global environment
>>>>>>> assuming
>>>>>>> that I also export them, or in all additional packages that
>>>>>>> explicitly
>>>>>>> import them from my name space. Now there already are a whole bunch
>>>>>>> of
>>>>>>> packages around that all allow for parallelization via a cluster
>>>>>>> object.
>>>>>>> Obviously those all import the parLapply function from the parallel
>>>>>>> package. That means that I can't simply supply my own modified
>>>>>>> cluster
>>>>>>> object, because the code that calls parLapply will not know about
>>>>>>>the
>>>>>>> generic in my package, even if it is attached. Ideally parLapply
>>>>>>> would
>>>>>>> be
>>>>>>> a generic function already in the parallel package. Not sure who
>>>>>>> needs
>>>>>>> to
>>>>>>> be convinced in order for this to happen, but my gut feeling was
>>>>>>> that it
>>>>>>> could be easier to have the generic in BiocGenerics.
>>>>>>> Maybe I am missing something obvious here, but imo there is no way
>>>>>>>to
>>>>>>> overwrite parLapply globally for my own class unless the generic is
>>>>>>> imported by everyone who wants to make use of the special method.
>>>>>>> Florian
>>>>>>> --
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 10/23/12 2:20 PM, "Martin Morgan" <mtmorgan at fhcrc.org> wrote:
>>>>>>>
>>>>>>> On 10/17/2012 05:45 AM, Hahne, Florian wrote:
>>>>>>>>
>>>>>>>>> Hi all,
>>>>>>>>> I was wondering whether it would be possible to have proper
>>>>>>>>> generics
>>>>>>>>>
>>>>>>>> for
>>>>>>
>>>>>>> some of the functions in the parallel package, e.g. parLapply and
>>>>>>>>> clusterCall. The reason I am asking is because I want to build an
>>>>>>>>> S4
>>>>>>>>> class
>>>>>>>>> that essentially looks like an S3 cluster object but knows how to
>>>>>>>>> deal
>>>>>>>>> with the SGE. That way I can abstract away all the overhead
>>>>>>>>> regarding
>>>>>>>>> job
>>>>>>>>> submission, job status and reducing the results in the parLapply
>>>>>>>>> method
>>>>>>>>> of
>>>>>>>>> that class, and would be able to supply this new cluster object
>>>>>>>>>to
>>>>>>>>> all
>>>>>>>>> of
>>>>>>>>> my existing functions that can be processed in parallel using a
>>>>>>>>> cluster
>>>>>>>>> object as input. I have played around with the BatchJobs package
>>>>>>>>> as an
>>>>>>>>> abstraction layer to SGE and that work nicely. As a test case I
>>>>>>>>> have
>>>>>>>>> created the necessary generics myself in order to supply my own
>>>>>>>>> SGEcluster
>>>>>>>>> object to a function that normally deals with the "regular"
>>>>>>>>> parallel
>>>>>>>>> package S3 cluster objects and everything just worked out of the
>>>>>>>>> box,
>>>>>>>>> but
>>>>>>>>> obviously this fails once I am in a name space and my generic is
>>>>>>>>> not
>>>>>>>>> found
>>>>>>>>> anymore. Of course what we would really want is some proper
>>>>>>>>> abstraction
>>>>>>>>> of
>>>>>>>>> parallelization in R, but for now this seem to be at least a
>>>>>>>>>cheap
>>>>>>>>> compromise. Any thoughts on this?
>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> Hi Florian -- we talked about this locally, but I guess we didn't
>>>>>>>> actually send
>>>>>>>> any email!
>>>>>>>>
>>>>>>>> Is there an obstacle to promoting these to generics in your own
>>>>>>>> package?
>>>>>>>> The
>>>>>>>> usual motivation for inclusion in BiocGenerics has been to avoid
>>>>>>>> conflicts
>>>>>>>> between packages, but I'm not sure whether this is the case (yet)?
>>>>>>>> This
>>>>>>>> would
>>>>>>>> also add a dependency fairly deep in the hierarchy.
>>>>>>>>
>>>>>>>> What do you think?
>>>>>>>>
>>>>>>>> Martin
>>>>>>>>
>>>>>>>> Florian
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Computational Biology / Fred Hutchinson Cancer Research Center
>>>>>>>> 1100 Fairview Ave. N.
>>>>>>>> PO Box 19024 Seattle, WA 98109
>>>>>>>>
>>>>>>>> Location: Arnold Building M1 B861
>>>>>>>> Phone: (206) 667-2793
>>>>>>>>
>>>>>>>
>>>>>>> ______________________________**_________________
>>>>>>> Bioc-devel at r-project.org mailing list
>>>>>>>
>>>>>>>
>>>>>>>https://stat.ethz.ch/mailman/**listinfo/bioc-devel<https://stat.ethz
>>>>>>>.c
>>>>>>> h/mailman/listinfo/bioc-devel>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Steve Lianoglou
>>>>>> Graduate Student: Computational Systems Biology
>>>>>> | Memorial Sloan-Kettering Cancer Center
>>>>>> | Weill Medical College of Cornell University
>>>>>> Contact Info:
>>>>>>
>>>>>>http://cbio.mskcc.org/~lianos/**contact<http://cbio.mskcc.org/%7Elian
>>>>>>os
>>>>>> /contact>
>>>>>>
>>>>>> ______________________________**_________________
>>>>>> Bioc-devel at r-project.org mailing list
>>>>>>
>>>>>>
>>>>>>https://stat.ethz.ch/mailman/**listinfo/bioc-devel<https://stat.ethz.
>>>>>>ch
>>>>>> /mailman/listinfo/bioc-devel>
>>>>>>
>>>>>>
>>>>> [[alternative HTML version deleted]]
>>>>>
>>>>> ______________________________**_________________
>>>>> Bioc-devel at r-project.org mailing list
>>>>>
>>>>>
>>>>>https://stat.ethz.ch/mailman/**listinfo/bioc-devel<https://stat.ethz.c
>>>>>h/
>>>>> mailman/listinfo/bioc-devel>
>>>>>
>>>>>
>>>>
>>>> --
>>>> Computational Biology / Fred Hutchinson Cancer Research Center
>>>> 1100 Fairview Ave. N.
>>>> PO Box 19024 Seattle, WA 98109
>>>>
>>>> Location: Arnold Building M1 B861
>>>> Phone: (206) 667-2793
>>>>
>>>> ______________________________**_________________
>>>> Bioc-devel at r-project.org mailing list
>>>>
>>>>
>>>>https://stat.ethz.ch/mailman/**listinfo/bioc-devel<https://stat.ethz.ch
>>>>/m
>>>> ailman/listinfo/bioc-devel>
>>>>
>>>
>>> [[alternative HTML version deleted]]
>>>
>>> _______________________________________________
>>> Bioc-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>
>--
>Herv? Pag?s
>
>Program in Computational Biology
>Division of Public Health Sciences
>Fred Hutchinson Cancer Research Center
>1100 Fairview Ave. N, M1-B514
>P.O. Box 19024
>Seattle, WA 98109-1024
>
>E-mail: hpages at fhcrc.org
>Phone: (206) 667-5791
>Fax: (206) 667-1319
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/bioc-devel/attachments/20121025/7b2a759b/attachment.pl>
--
On 10/25/12 6:44 PM, "Cook, Malcolm" <MEC at stowers.org> wrote:
On 10/24/12 5:08 PM, "Herv? Pag?s" <hpages at fhcrc.org> wrote:
Hi,
With Florian use case, there seems to be a strong/immediate need for
dispatching on the cluster-like object passed as the 1st argument to
parLapply() and all the other functions in the parallel package that
belong to the "snow family" (14 functions in total, all documented in
?parallel::parLapply). So we've just added those 14 generics to
BiocGenerics 0.5.1. We're postponing the "multicore family" (i.e.
mclapply(), mcmapply(), and pvec()) for now.
Note that the 14 new generics dispatch at least on their 1st argument
('cl'), but also on their 2nd argument when this argument is 'x', 'X'
or 'seq' (expected to be a vector-like or matrix-like object). This
opens the door to defining methods that take advantage of the of the
implementation of particular vector-like or matrix-like objects.
Also note that, even if some of the 14 functions in the "snow family"
are simple convenience wrappers to other functions in the family, we've
made all of them generics. For example clusterEvalQ() is a simple
wrapper to clusterCall():
> clusterEvalQ
function (cl = NULL, expr)
clusterCall(cl, eval, substitute(expr), env = .GlobalEnv)
<environment: namespace:parallel>
And it seems (at least intuitively) that implementing a "clusterCall"
method for my cluster-like objects should be enough to have
clusterEvalQ() work out-of-the-box on those objects. But, sadly enough,
this is not the case:
setClass("FakeCluster", representation(nnodes="integer"))
setMethod("clusterCall", "FakeCluster",
function (cl=NULL, fun, ...) fun(...)
)
Then:
> mycluster <- new("FakeCluster", nnodes=10L)
> clusterCall(mycluster, print, 1:6)
[1] 1 2 3 4 5 6
> clusterEvalQ(mycluster, print(1:6))
Error in checkCluster(cl) : not a valid cluster
This is because the "clusterEvalQ" default method is calling
parallel::clusterCall() (which is *not* the generic), instead of
calling BiocGenerics::clusterCall() (which *is* the generic).
This would be avoided if clusterCall() was a generic defined in
the parallel package itself (or in a package that parallel depends
on). And this would of course be a better solution than having those
generics in BiocGenerics. Is someone willing to bring that case to
R-devel?
In the mean time I need to define a "clusterEvalQ" method:
setMethod("clusterEvalQ", "FakeCluster",
function (cl=NULL, expr)
clusterCall(cl, eval, substitute(expr), env=.GlobalEnv)
)
And then:
> clusterEvalQ(mycluster, print(1:6))
[1] 1 2 3 4 5 6 Finally note that this method I defined for my objects could be made the default "clusterEvalQ" method (i.e. the clusterEvalQ,ANY method) and we could put it in BiocGenerics. Or, since there is apparently nothing to win by having clusterEvalQ() being a generic in the first place, we could redefine clusterEvalQ() as an ordinary function in BiocGenerics. This function would be implemented *exactly* like parallel::clusterEvalQ() (and it would mask it), except that now it would call BiocGenerics::clusterCall() internally. What should we do?
We have the identical problem already when we try to use parallel mcmapply on a BioC List (i.e. GRangesList). Witness: The casual user (ehrm, myself at least) expects that since I can 'lapply' on a BioC GRangesList (or any other List) that I should be able to mclapply on it. Sadly the casual user is wrong, and gets an error. Why? Because parallel::mclapply(X... calls as.list on X. Which yields 'Error in as.list.default : no method for coercing this S4 class to a vector' But, you say, IRanges defines as.list for Lists, as can be demonstrated by calling as.list(myGRL) on a GRangesList. Here I yield the floor to someone who can explain why this is so, for I have not studied enough how namespaces/packages/symboltables/whatever work in R. Anyone?
Because the parallel:::mclapply does not know about the as.list that is defined in IRanges. If you hijack the function and execute it in the global environment then it will see the as.list S4 generic that is defined in the IRanges package and correctly dispatch to the IRanges method. However in the parallel name space inly the base S3 as.list method is visible, alas no dispatch.
Regardless, one BAD workaround I found works is to snarf (tm) the source for mclapply, evaluate it in the global namespace, after prefixing all parallel internal functions with 'parallel:::'. AFter doing this, the modified mclapply works as one might expect. So, there is at least an issue regarding how method dispatch works across namespaces. Again I yield the floor, but, expect that it can be fixed. BUT, FURTHERMORE, MCLAPPLY SHOULD NOT COERCE X TO LIST ANYWAY Why? Because calling `as.list` incurs the overhead of (needlessly!?!) coercing this nice tight GRangesList into a base::list.
I fully agree, and I think I pointed this out a couple of weeks ago in another thread. One suggestion by Vince was to use an index vector as the parallel argument to mclapply, and do the indexing in the function, e.g. mclapply(seq_along(granges), function(i, gr) dosomething(gr[[i]], granges) as opposed to the more logical mclapply(granges, function(x) dosomething(x)) Obviously that fixes the issue, but is also certain to cause a lot of confusion to the na?ve user.
There is NO REASON for it to be coercing X to a list at all. By my lights, mclapply only needs `length` and `seq_along` defined on X, which ARE ALREADY available to a GRangesList from Vector. Indeed, commenting out the X<-as.list(X) coercion in mclapply and, lo, it still works on a GRangesList as hoped, and on a 1000 element GRanges list takes ~18x less user time to mclapply(myGRL,length). (and even short just to use elementLengths, but that is not the point). In this case the solution appears to be to FIX the upstream package so that method dispatch works correctly (I expect that length and seq_along are only visible to my snarfed mclapply and would suffer from similar error without adressing the package issue). Indeed, similarly, in my proposed changed to parallel::pvec, I found a simple change that made it work with Vector as well as vector, since Vector implements `[` and `length`. I still think the solution to getting an SGE (et. al.) parallel back-end is to seek to improve the upstream package to make 'pluggable' for different parallel backends. I don't think I'm the right person to represent this to R-devel as obviously I am not schooled (yet!?!?) in the workings of S3/S4/signatures/methods/etc. Herve, I have a hunch that your 'In the mean time' solution is a workaround that has the potential to invite further confusion. Anyone, as a perhaps related issue, and as an opportunity to educate me, can you explain why untrace does NOT completely work on `lapply` (with BiocGenerics loaded). Viz: trace(lapply) untrace(lapply) IRanges(1,2) IRanges of length 1 trace: lapply(dots, methods:::.class1) .... --Malcolm
H. On 10/24/2012 09:07 AM, Cook, Malcolm wrote:
On 10/24/12 12:44 AM, "Michael Lawrence" <lawrence.michael at gene.com> wrote:
I agree that it would fruitful to have parLapply in BiocGenerics. It looks to be a flexible abstraction and its presence in the parallel package makes it ubiquitous. If it hasn't been done already, mclapply (and mcmapply) would be good candidates, as well. The fork-based parallelism is substantively different in terms of the API from the more general parallelism of parLapply. Someone was working on some more robust and convenient wrappers around mclapply. Did that ever see the light of day?
If you are referring to http://thread.gmane.org/gmane.science.biology.informatics.conductor/4366 0 in which I had offered some small changes to parallel::pvec https://gist.github.com/3757873/ and after which Martin had provided me with a route I have not (yet?) followed toward submitting a patch to R for consideration by R-devel / Simon Urbanek in http://grokbase.com/t/r/bioc-devel/129rbmxp5b/applying-over-granges-and- o th er-vectors-of-ranges#201209248dcn0tpwt7k7g9zsjr4dha6f1c
On Tue, Oct 23, 2012 at 12:13 PM, Steve Lianoglou < mailinglist.honeypot at gmail.com**> wrote: In response to a question from yesterday, I pointed someone to the
ShortRead `srapply` function and I wondered to myself why it had to necessarily by "burried" in the ShortRead package (aside from it having a `sr` prefix).
I don't know that srapply necessarily 'got it right'...
One thing I like about srapply is its support for a reduce argument.
I had thought it might be a good idea to move that (or something like that) to BiocGenerics (unless implementations aren't allowed there) but also realized that it would add more dependencies where someone might not necessarily need them.
But, almost surely, a large majority of the people will be happy to do some form of ||-ization, so in my mind it's not such an onerous thing to add -- on the other hand, this large majority is probably enriched for people who are doing NGS analysis, in which case, keeping it in ShortRead can make some sense.
I remain confused about the need for putting any of this into BiocGenerics at all. It seems to me that properly construed parallization primitives ought to 'just work' with any object which supports indexing and length. I would appreciate hearing arguments to the contrary. Florian, in a similar vein, could we not seek to change parallel::makeCluster to be extensible to, say, support SGE cluster? THis seems like the 'right thing to do'. ??? Regardless, I think we have raised some considerations that might inform improvements to parallel, including points about error handling, reducing results, block-level parallization over List/Vector (in addition to vector), etc. I think perhaps having a google doc that we can collectively edit to contain the requirements we are trying to achieve might move us forward effectively. Would this help? Or perhaps a page under http://wiki.fhcrc.org/bioc/DeveloperPage/#discussions ???
Taking one step back, I recall some chatter last week (or two) about some better ||-ization "primitives" -- something about a pvec doo-dad, and there being ideas to wrap different types of ||-ization behind an easy to use interface (I think this was the convo), and then I took a further step back and often wonder why we just don't bite the bullet and take advantage of the `foreach` infrastructure that is already out there -- in which case, I could imagne a "doSGE" package that might handle the particulars of what Florain is referring to. You could then configure it externally via some `registerDoSGE(some.config.**object)` and just have the package code happily run it through `foreach(...) %dopar%` and be done w/ it. IMHO it is relevant. I have not looked for other abstractions, and this
one seems to work. Florian's objectives might be a good test case for adequacy.
The registerDoDah does seem to be a useful abstraction.
Is this not more-or-less the intention of parallel::setDefaultCluster? --Malcolm
I think there's a lot of work to do for some sort of coordinated parallelization that putting parLapply into BiocGenerics might encourage; not good things will happen when everyone in a call stack tries to parallelize independently. But I'm in favor of parLapply in BiocGenerics at least for the moment. Martin
... at least, I thought this is what was being talked about here (and
popped up a week or two ago) -- sorry if I completely missed the mark ... -steve On Tue, Oct 23, 2012 at 10:38 AM, Hahne, Florian <florian.hahne at novartis.com> wrote:
Hi Martin, I could define the generics in my own package, but that would mean that those will only be available there, or in the global environment assuming that I also export them, or in all additional packages that explicitly import them from my name space. Now there already are a whole bunch of packages around that all allow for parallelization via a cluster object. Obviously those all import the parLapply function from the parallel package. That means that I can't simply supply my own modified cluster object, because the code that calls parLapply will not know about the generic in my package, even if it is attached. Ideally parLapply would be a generic function already in the parallel package. Not sure who needs to be convinced in order for this to happen, but my gut feeling was that it could be easier to have the generic in BiocGenerics. Maybe I am missing something obvious here, but imo there is no way to overwrite parLapply globally for my own class unless the generic is imported by everyone who wants to make use of the special method. Florian -- On 10/23/12 2:20 PM, "Martin Morgan" <mtmorgan at fhcrc.org> wrote: On 10/17/2012 05:45 AM, Hahne, Florian wrote:
Hi all, I was wondering whether it would be possible to have proper generics
for
some of the functions in the parallel package, e.g. parLapply and
clusterCall. The reason I am asking is because I want to build an S4 class that essentially looks like an S3 cluster object but knows how to deal with the SGE. That way I can abstract away all the overhead regarding job submission, job status and reducing the results in the parLapply method of that class, and would be able to supply this new cluster object to all of my existing functions that can be processed in parallel using a cluster object as input. I have played around with the BatchJobs package as an abstraction layer to SGE and that work nicely. As a test case I have created the necessary generics myself in order to supply my own SGEcluster object to a function that normally deals with the "regular" parallel package S3 cluster objects and everything just worked out of the box, but obviously this fails once I am in a name space and my generic is not found anymore. Of course what we would really want is some proper abstraction of parallelization in R, but for now this seem to be at least a cheap compromise. Any thoughts on this?
Hi Florian -- we talked about this locally, but I guess we didn't actually send any email! Is there an obstacle to promoting these to generics in your own package? The usual motivation for inclusion in BiocGenerics has been to avoid conflicts between packages, but I'm not sure whether this is the case (yet)? This would also add a dependency fairly deep in the hierarchy. What do you think? Martin Florian
-- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793
______________________________**_________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/**listinfo/bioc-devel<https://stat.eth z .c h/mailman/listinfo/bioc-devel>
-- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/**contact<http://cbio.mskcc.org/%7Elia n os /contact>
______________________________**_________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/**listinfo/bioc-devel<https://stat.ethz . ch /mailman/listinfo/bioc-devel>
[[alternative HTML version deleted]]
______________________________**_________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/**listinfo/bioc-devel<https://stat.ethz. c h/ mailman/listinfo/bioc-devel>
-- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793
______________________________**_________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/**listinfo/bioc-devel<https://stat.ethz.c h /m ailman/listinfo/bioc-devel>
[[alternative HTML version deleted]]
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
-- Herv? Pag?s Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
Hi Vince, I tend to agree with your pragmatic approach, but at the same time I get a slightly bad feeling that this essential part of the language (at least for todays biomedical applications) should be as close to the R source as possible. Once we embark on our own parallel package and solicit package maintainers to switch over to it we may get into a lot of trouble reverting things back to a fixed 'official' parallel package (or whatever the global solution will be). Obviously everybody can pose the issue to Rcore, my only concern is that some voices may have more influence there than others? It seems that this has been recognized as a pressing issue by the Bioconductor community, so maybe we should communicate our views as such. FLorian
On 10/25/12 7:36 PM, "Vincent Carey" <stvjc at channing.harvard.edu> wrote:
>if R-core (who afaics maintain parallel) are unwilling to adopt/maintain
>these suggestions, why not write a biocParallel and/or cookParallel
>package
>that does? it seems to me that any interested party can pose the issue to
>r-devel. if no answer is given, we can all learn from the experimental
>alternate package.
>
>On Thu, Oct 25, 2012 at 12:44 PM, Cook, Malcolm <MEC at stowers.org> wrote:
>
>>
>>
>> On 10/24/12 5:08 PM, "Herv? Pag?s" <hpages at fhcrc.org> wrote:
>>
>> >Hi,
>> >
>> >With Florian use case, there seems to be a strong/immediate need for
>> >dispatching on the cluster-like object passed as the 1st argument to
>> >parLapply() and all the other functions in the parallel package that
>> >belong to the "snow family" (14 functions in total, all documented in
>> >?parallel::parLapply). So we've just added those 14 generics to
>> >BiocGenerics 0.5.1. We're postponing the "multicore family" (i.e.
>> >mclapply(), mcmapply(), and pvec()) for now.
>> >
>> >Note that the 14 new generics dispatch at least on their 1st argument
>> >('cl'), but also on their 2nd argument when this argument is 'x', 'X'
>> >or 'seq' (expected to be a vector-like or matrix-like object). This
>> >opens the door to defining methods that take advantage of the of the
>> >implementation of particular vector-like or matrix-like objects.
>> >
>> >Also note that, even if some of the 14 functions in the "snow family"
>> >are simple convenience wrappers to other functions in the family, we've
>> >made all of them generics. For example clusterEvalQ() is a simple
>> >wrapper to clusterCall():
>> >
>> > > clusterEvalQ
>> > function (cl = NULL, expr)
>> > clusterCall(cl, eval, substitute(expr), env = .GlobalEnv)
>> > <environment: namespace:parallel>
>> >
>> >And it seems (at least intuitively) that implementing a "clusterCall"
>> >method for my cluster-like objects should be enough to have
>> >clusterEvalQ() work out-of-the-box on those objects. But, sadly enough,
>> >this is not the case:
>> >
>> > setClass("FakeCluster", representation(nnodes="integer"))
>> >
>> > setMethod("clusterCall", "FakeCluster",
>> > function (cl=NULL, fun, ...) fun(...)
>> > )
>> >
>> >Then:
>> >
>> > > mycluster <- new("FakeCluster", nnodes=10L)
>> > > clusterCall(mycluster, print, 1:6)
>> > [1] 1 2 3 4 5 6
>> > > clusterEvalQ(mycluster, print(1:6))
>> > Error in checkCluster(cl) : not a valid cluster
>> >
>> >This is because the "clusterEvalQ" default method is calling
>> >parallel::clusterCall() (which is *not* the generic), instead of
>> >calling BiocGenerics::clusterCall() (which *is* the generic).
>> >
>> >This would be avoided if clusterCall() was a generic defined in
>> >the parallel package itself (or in a package that parallel depends
>> >on). And this would of course be a better solution than having those
>> >generics in BiocGenerics. Is someone willing to bring that case to
>> >R-devel?
>> >
>> >In the mean time I need to define a "clusterEvalQ" method:
>> >
>> > setMethod("clusterEvalQ", "FakeCluster",
>> > function (cl=NULL, expr)
>> > clusterCall(cl, eval, substitute(expr), env=.GlobalEnv)
>> > )
>> >
>> >And then:
>> >
>> > > clusterEvalQ(mycluster, print(1:6))
>> > [1] 1 2 3 4 5 6
>> >
>> >Finally note that this method I defined for my objects could be made
>>the
>> >default "clusterEvalQ" method (i.e. the clusterEvalQ,ANY method) and we
>> >could put it in BiocGenerics. Or, since there is apparently nothing to
>> >win by having clusterEvalQ() being a generic in the first place, we
>> >could redefine clusterEvalQ() as an ordinary function in BiocGenerics.
>> >This function would be implemented *exactly* like
>> >parallel::clusterEvalQ() (and it would mask it), except that now
>> >it would call BiocGenerics::clusterCall() internally.
>> >
>> >What should we do?
>>
>> We have the identical problem already when we try to use parallel
>>mcmapply
>> on a BioC List (i.e. GRangesList).
>>
>> Witness:
>>
>> The casual user (ehrm, myself at least) expects that since I can
>>'lapply'
>> on a BioC GRangesList (or any other List) that I should be able to
>> mclapply on it.
>>
>> Sadly the casual user is wrong, and gets an error.
>>
>> Why?
>>
>> Because parallel::mclapply(X... calls as.list on X.
>>
>> Which yields 'Error in as.list.default : no method for coercing this S4
>> class to a vector'
>>
>> But, you say, IRanges defines as.list for Lists, as can be demonstrated
>>by
>> calling as.list(myGRL) on a GRangesList.
>>
>> Here I yield the floor to someone who can explain why this is so, for I
>> have not studied enough how namespaces/packages/symboltables/whatever
>>work
>> in R.
>>
>> Anyone?
>>
>> Regardless, one BAD workaround I found works is to snarf (tm) the source
>> for mclapply, evaluate it in the global namespace, after prefixing all
>> parallel internal functions with 'parallel:::'.
>>
>> AFter doing this, the modified mclapply works as one might expect.
>>
>> So, there is at least an issue regarding how method dispatch works
>>across
>> namespaces. Again I yield the floor, but, expect that it can be fixed.
>>
>> BUT, FURTHERMORE, MCLAPPLY SHOULD NOT COERCE X TO LIST ANYWAY
>>
>> Why? Because calling `as.list` incurs the overhead of (needlessly!?!)
>> coercing this nice tight GRangesList into a base::list.
>>
>> There is NO REASON for it to be coercing X to a list at all. By my
>> lights, mclapply only needs `length` and `seq_along` defined on X, which
>> ARE ALREADY available to a GRangesList from Vector. Indeed, commenting
>> out the X<-as.list(X) coercion in mclapply and, lo, it still works on a
>> GRangesList as hoped, and on a 1000 element GRanges list takes ~18x less
>> user time to mclapply(myGRL,length). (and even short just to use
>> elementLengths, but that is not the point).
>>
>> In this case the solution appears to be to FIX the upstream package so
>> that method dispatch works correctly (I expect that length and seq_along
>> are only visible to my snarfed mclapply and would suffer from similar
>> error without adressing the package issue).
>>
>> Indeed, similarly, in my proposed changed to parallel::pvec, I found a
>> simple change that made it work with Vector as well as vector, since
>> Vector implements `[` and `length`.
>>
>> I still think the solution to getting an SGE (et. al.) parallel back-end
>> is to seek to improve the upstream package to make 'pluggable' for
>> different parallel backends.
>>
>> I don't think I'm the right person to represent this to R-devel as
>> obviously I am not schooled (yet!?!?) in the workings of
>> S3/S4/signatures/methods/etc.
>>
>> Herve, I have a hunch that your 'In the mean time' solution is a
>> workaround that has the potential to invite further confusion.
>>
>> Anyone, as a perhaps related issue, and as an opportunity to educate me,
>> can you explain why untrace does NOT completely work on `lapply` (with
>> BiocGenerics loaded). Viz:
>>
>> trace(lapply)
>> untrace(lapply)
>> IRanges(1,2)
>> IRanges of length 1
>> trace: lapply(dots, methods:::.class1)
>> ....
>>
>>
>> --Malcolm
>>
>>
>>
>>
>>
>>
>> >
>> >H.
>> >
>> >
>> >On 10/24/2012 09:07 AM, Cook, Malcolm wrote:
>> >> On 10/24/12 12:44 AM, "Michael Lawrence" <lawrence.michael at gene.com>
>> >>wrote:
>> >>
>> >>> I agree that it would fruitful to have parLapply in BiocGenerics. It
>> >>>looks
>> >>> to be a flexible abstraction and its presence in the parallel
>>package
>> >>> makes
>> >>> it ubiquitous. If it hasn't been done already, mclapply (and
>>mcmapply)
>> >>> would be good candidates, as well. The fork-based parallelism is
>> >>> substantively different in terms of the API from the more general
>> >>> parallelism of parLapply.
>> >>>
>> >>> Someone was working on some more robust and convenient wrappers
>>around
>> >>> mclapply. Did that ever see the light of day?
>> >>
>> >>
>> >> If you are referring to
>> >>
>> >>
>>
>>http://thread.gmane.org/gmane.science.biology.informatics.conductor/43660
>> >>
>> >> in which I had offered some small changes to parallel::pvec
>> >>
>> >> https://gist.github.com/3757873/
>> >>
>> >> and after which Martin had provided me with a route I have not (yet?)
>> >> followed toward submitting a patch to R for consideration by R-devel
>>/
>> >> Simon Urbanek in
>> >>
>> >>
>> >>
>>
>>http://grokbase.com/t/r/bioc-devel/129rbmxp5b/applying-over-granges-and-o
>> >>th
>> >> er-vectors-of-ranges#201209248dcn0tpwt7k7g9zsjr4dha6f1c
>> >>
>> >>
>> >>
>> >>
>> >>>>> On Tue, Oct 23, 2012 at 12:13 PM, Steve Lianoglou <
>> >>>>> mailinglist.honeypot at gmail.com**> wrote:
>> >>>>>
>> >>>>> In response to a question from yesterday, I pointed someone to
>>the
>> >>>>>> ShortRead `srapply` function and I wondered to myself why it had
>>to
>> >>>>>> necessarily by "burried" in the ShortRead package (aside from it
>> >>>>>> having a `sr` prefix).
>> >>>>>>
>> >>>>>
>> >>>> I don't know that srapply necessarily 'got it right'...
>> >>
>> >>
>> >> One thing I like about srapply is its support for a reduce argument.
>> >>
>> >>>>>> I had thought it might be a good idea to move that (or something
>> >>>>>>like
>> >>>>>> that) to BiocGenerics (unless implementations aren't allowed
>>there)
>> >>>>>> but also realized that it would add more dependencies where
>>someone
>> >>>>>> might not necessarily need them.
>> >>
>> >>
>> >>>>>>
>> >>>>>> But, almost surely, a large majority of the people will be happy
>>to
>> >>>>>>do
>> >>>>>> some form of ||-ization, so in my mind it's not such an onerous
>> >>>>>>thing
>> >>>>>> to add -- on the other hand, this large majority is probably
>> >>>>>>enriched
>> >>>>>> for people who are doing NGS analysis, in which case, keeping it
>>in
>> >>>>>> ShortRead can make some sense.
>> >>
>> >> I remain confused about the need for putting any of this into
>> >>BiocGenerics
>> >> at all. It seems to me that properly construed parallization
>>primitives
>> >> ought to 'just work' with any object which supports indexing and
>>length.
>> >>
>> >> I would appreciate hearing arguments to the contrary.
>> >>
>> >> Florian, in a similar vein, could we not seek to change
>> >> parallel::makeCluster to be extensible to, say, support SGE cluster?
>> >>THis
>> >> seems like the 'right thing to do'. ???
>> >>
>> >>
>> >> Regardless, I think we have raised some considerations that might
>>inform
>> >> improvements to parallel, including points about error handling,
>> >>reducing
>> >> results, block-level parallization over List/Vector (in addition to
>> >> vector), etc.
>> >>
>> >> I think perhaps having a google doc that we can collectively edit to
>> >> contain the requirements we are trying to achieve might move us
>>forward
>> >> effectively. Would this help? Or perhaps a page under
>> >> http://wiki.fhcrc.org/bioc/DeveloperPage/#discussions ???
>> >>
>> >>
>> >>>>>> Taking one step back, I recall some chatter last week (or two)
>>about
>> >>>>>> some better ||-ization "primitives" -- something about a pvec
>> >>>>>>doo-dad,
>> >>>>>> and there being ideas to wrap different types of ||-ization
>>behind
>> >>>>>>an
>> >>>>>> easy to use interface (I think this was the convo), and then I
>>took
>> >>>>>>a
>> >>>>>> further step back and often wonder why we just don't bite the
>>bullet
>> >>>>>> and take advantage of the `foreach` infrastructure that is
>>already
>> >>>>>>out
>> >>>>>> there -- in which case, I could imagne a "doSGE" package that
>>might
>> >>>>>> handle the particulars of what Florain is referring to. You could
>> >>>>>>then
>> >>>>>> configure it externally via some
>> >>>>>>`registerDoSGE(some.config.**object)`
>> >>>>>> and just have the package code happily run it through
>>`foreach(...)
>> >>>>>> %dopar%` and be done w/ it.
>> >>>>>>
>> >>>>>>
>> >>>>>> IMHO it is relevant. I have not looked for other abstractions,
>> >>>>>>and
>> >>>>>> this
>> >>>>> one seems
>> >>>>> to work. Florian's objectives might be a good test case for
>> >>>>>adequacy.
>> >>>>>
>> >>>>
>> >>>> The registerDoDah does seem to be a useful abstraction.
>> >>
>> >> Is this not more-or-less the intention of
>>parallel::setDefaultCluster?
>> >>
>> >> --Malcolm
>> >>
>> >>
>> >>
>> >>>>
>> >>>> I think there's a lot of work to do for some sort of coordinated
>> >>>> parallelization that putting parLapply into BiocGenerics might
>> >>>> encourage;
>> >>>> not good things will happen when everyone in a call stack tries to
>> >>>> parallelize independently. But I'm in favor of parLapply in
>> >>>> BiocGenerics at
>> >>>> least for the moment.
>> >>>>
>> >>>> Martin
>> >>>>
>> >>>>
>> >>>>
>> >>>>>
>> >>>>> ... at least, I thought this is what was being talked about here
>> >>>>>(and
>> >>>>>> popped up a week or two ago) -- sorry if I completely missed the
>> >>>>>>mark
>> >>>>>> ...
>> >>>>>>
>> >>>>>> -steve
>> >>>>>>
>> >>>>>>
>> >>>>>> On Tue, Oct 23, 2012 at 10:38 AM, Hahne, Florian
>> >>>>>> <florian.hahne at novartis.com> wrote:
>> >>>>>>
>> >>>>>>> Hi Martin,
>> >>>>>>> I could define the generics in my own package, but that would
>>mean
>> >>>>>>> that
>> >>>>>>> those will only be available there, or in the global environment
>> >>>>>>> assuming
>> >>>>>>> that I also export them, or in all additional packages that
>> >>>>>>> explicitly
>> >>>>>>> import them from my name space. Now there already are a whole
>>bunch
>> >>>>>>> of
>> >>>>>>> packages around that all allow for parallelization via a cluster
>> >>>>>>> object.
>> >>>>>>> Obviously those all import the parLapply function from the
>>parallel
>> >>>>>>> package. That means that I can't simply supply my own modified
>> >>>>>>> cluster
>> >>>>>>> object, because the code that calls parLapply will not know
>>about
>> >>>>>>>the
>> >>>>>>> generic in my package, even if it is attached. Ideally parLapply
>> >>>>>>> would
>> >>>>>>> be
>> >>>>>>> a generic function already in the parallel package. Not sure who
>> >>>>>>> needs
>> >>>>>>> to
>> >>>>>>> be convinced in order for this to happen, but my gut feeling was
>> >>>>>>> that it
>> >>>>>>> could be easier to have the generic in BiocGenerics.
>> >>>>>>> Maybe I am missing something obvious here, but imo there is no
>>way
>> >>>>>>>to
>> >>>>>>> overwrite parLapply globally for my own class unless the
>>generic is
>> >>>>>>> imported by everyone who wants to make use of the special
>>method.
>> >>>>>>> Florian
>> >>>>>>> --
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> On 10/23/12 2:20 PM, "Martin Morgan" <mtmorgan at fhcrc.org> wrote:
>> >>>>>>>
>> >>>>>>> On 10/17/2012 05:45 AM, Hahne, Florian wrote:
>> >>>>>>>>
>> >>>>>>>>> Hi all,
>> >>>>>>>>> I was wondering whether it would be possible to have proper
>> >>>>>>>>> generics
>> >>>>>>>>>
>> >>>>>>>> for
>> >>>>>>
>> >>>>>>> some of the functions in the parallel package, e.g. parLapply
>>and
>> >>>>>>>>> clusterCall. The reason I am asking is because I want to
>>build an
>> >>>>>>>>> S4
>> >>>>>>>>> class
>> >>>>>>>>> that essentially looks like an S3 cluster object but knows
>>how to
>> >>>>>>>>> deal
>> >>>>>>>>> with the SGE. That way I can abstract away all the overhead
>> >>>>>>>>> regarding
>> >>>>>>>>> job
>> >>>>>>>>> submission, job status and reducing the results in the
>>parLapply
>> >>>>>>>>> method
>> >>>>>>>>> of
>> >>>>>>>>> that class, and would be able to supply this new cluster
>>object
>> >>>>>>>>>to
>> >>>>>>>>> all
>> >>>>>>>>> of
>> >>>>>>>>> my existing functions that can be processed in parallel using
>>a
>> >>>>>>>>> cluster
>> >>>>>>>>> object as input. I have played around with the BatchJobs
>>package
>> >>>>>>>>> as an
>> >>>>>>>>> abstraction layer to SGE and that work nicely. As a test case
>>I
>> >>>>>>>>> have
>> >>>>>>>>> created the necessary generics myself in order to supply my
>>own
>> >>>>>>>>> SGEcluster
>> >>>>>>>>> object to a function that normally deals with the "regular"
>> >>>>>>>>> parallel
>> >>>>>>>>> package S3 cluster objects and everything just worked out of
>>the
>> >>>>>>>>> box,
>> >>>>>>>>> but
>> >>>>>>>>> obviously this fails once I am in a name space and my generic
>>is
>> >>>>>>>>> not
>> >>>>>>>>> found
>> >>>>>>>>> anymore. Of course what we would really want is some proper
>> >>>>>>>>> abstraction
>> >>>>>>>>> of
>> >>>>>>>>> parallelization in R, but for now this seem to be at least a
>> >>>>>>>>>cheap
>> >>>>>>>>> compromise. Any thoughts on this?
>> >>
>> >>>>>>>>>
>> >>>>>>>>
>> >>>>>>>> Hi Florian -- we talked about this locally, but I guess we
>>didn't
>> >>>>>>>> actually send
>> >>>>>>>> any email!
>> >>>>>>>>
>> >>>>>>>> Is there an obstacle to promoting these to generics in your own
>> >>>>>>>> package?
>> >>>>>>>> The
>> >>>>>>>> usual motivation for inclusion in BiocGenerics has been to
>>avoid
>> >>>>>>>> conflicts
>> >>>>>>>> between packages, but I'm not sure whether this is the case
>>(yet)?
>> >>>>>>>> This
>> >>>>>>>> would
>> >>>>>>>> also add a dependency fairly deep in the hierarchy.
>> >>>>>>>>
>> >>>>>>>> What do you think?
>> >>>>>>>>
>> >>>>>>>> Martin
>> >>>>>>>>
>> >>>>>>>> Florian
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>
>> >>>>>>>> --
>> >>>>>>>> Computational Biology / Fred Hutchinson Cancer Research Center
>> >>>>>>>> 1100 Fairview Ave. N.
>> >>>>>>>> PO Box 19024 Seattle, WA 98109
>> >>>>>>>>
>> >>>>>>>> Location: Arnold Building M1 B861
>> >>>>>>>> Phone: (206) 667-2793
>> >>>>>>>>
>> >>>>>>>
>> >>>>>>> ______________________________**_________________
>> >>>>>>> Bioc-devel at r-project.org mailing list
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>https://stat.ethz.ch/mailman/**listinfo/bioc-devel<
>> https://stat.ethz
>> >>>>>>>.c
>> >>>>>>> h/mailman/listinfo/bioc-devel>
>> >>>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> --
>> >>>>>> Steve Lianoglou
>> >>>>>> Graduate Student: Computational Systems Biology
>> >>>>>> | Memorial Sloan-Kettering Cancer Center
>> >>>>>> | Weill Medical College of Cornell University
>> >>>>>> Contact Info:
>> >>>>>>
>> >>>>>>http://cbio.mskcc.org/~lianos/**contact<
>> http://cbio.mskcc.org/%7Elian
>> >>>>>>os
>> >>>>>> /contact>
>> >>>>>>
>> >>>>>> ______________________________**_________________
>> >>>>>> Bioc-devel at r-project.org mailing list
>> >>>>>>
>> >>>>>>
>>
>>>>>>>>https://stat.ethz.ch/mailman/**listinfo/bioc-devel<https://stat.eth
>>>>>>>>z
>> .
>> >>>>>>ch
>> >>>>>> /mailman/listinfo/bioc-devel>
>> >>>>>>
>> >>>>>>
>> >>>>> [[alternative HTML version deleted]]
>> >>>>>
>> >>>>> ______________________________**_________________
>> >>>>> Bioc-devel at r-project.org mailing list
>> >>>>>
>> >>>>>
>> >>>>>https://stat.ethz.ch/mailman/**listinfo/bioc-devel<
>> https://stat.ethz.c
>> >>>>>h/
>> >>>>> mailman/listinfo/bioc-devel>
>> >>>>>
>> >>>>>
>> >>>>
>> >>>> --
>> >>>> Computational Biology / Fred Hutchinson Cancer Research Center
>> >>>> 1100 Fairview Ave. N.
>> >>>> PO Box 19024 Seattle, WA 98109
>> >>>>
>> >>>> Location: Arnold Building M1 B861
>> >>>> Phone: (206) 667-2793
>> >>>>
>> >>>> ______________________________**_________________
>> >>>> Bioc-devel at r-project.org mailing list
>> >>>>
>> >>>>
>> >>>>https://stat.ethz.ch/mailman/**listinfo/bioc-devel<
>> https://stat.ethz.ch
>> >>>>/m
>> >>>> ailman/listinfo/bioc-devel>
>> >>>>
>> >>>
>> >>> [[alternative HTML version deleted]]
>> >>>
>> >>> _______________________________________________
>> >>> Bioc-devel at r-project.org mailing list
>> >>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>> >>
>> >> _______________________________________________
>> >> Bioc-devel at r-project.org mailing list
>> >> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>> >>
>> >
>> >--
>> >Herv? Pag?s
>> >
>> >Program in Computational Biology
>> >Division of Public Health Sciences
>> >Fred Hutchinson Cancer Research Center
>> >1100 Fairview Ave. N, M1-B514
>> >P.O. Box 19024
>> >Seattle, WA 98109-1024
>> >
>> >E-mail: hpages at fhcrc.org
>> >Phone: (206) 667-5791
>> >Fax: (206) 667-1319
>>
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>
> [[alternative HTML version deleted]]
>
On 10/25/2012 10:36 AM, Vincent Carey wrote:
if R-core (who afaics maintain parallel) are unwilling to adopt/maintain these suggestions, why not write a biocParallel and/or cookParallel package that does? it seems to me that any interested party can pose the issue to r-devel. if no answer is given, we can all learn from the experimental alternate package.
This is certainly worth thinking about. IMO it helps to make the analogy with the DBI world where RMySQL, RSQLIte, RPostgreSQL etc... are plugins that implement DBI-compliant specific back-ends. With this analogy, BiocParallel (or whatever we call it, cookParallel?) would be the analog of DBI but for cluster back-ends. It could provide built-in support for SNOW clusters but would also make it easy for people to write a BiocParallel-compliant package that implements a specific back-end. That being said, it feels that the parallel package should be that BiocParallel package. That is, it should provide the clean parallel abstraction layer that we are aiming for and provide built-in support for SNOW clusters (which it currently does). And the only thing R-core would need to do to make this happen is turn some of their functions into generics. H.
On Thu, Oct 25, 2012 at 12:44 PM, Cook, Malcolm <MEC at stowers.org
<mailto:MEC at stowers.org>> wrote:
On 10/24/12 5:08 PM, "Herv? Pag?s" <hpages at fhcrc.org
<mailto:hpages at fhcrc.org>> wrote:
>Hi,
>
>With Florian use case, there seems to be a strong/immediate need for
>dispatching on the cluster-like object passed as the 1st argument to
>parLapply() and all the other functions in the parallel package that
>belong to the "snow family" (14 functions in total, all documented in
>?parallel::parLapply). So we've just added those 14 generics to
>BiocGenerics 0.5.1. We're postponing the "multicore family" (i.e.
>mclapply(), mcmapply(), and pvec()) for now.
>
>Note that the 14 new generics dispatch at least on their 1st argument
>('cl'), but also on their 2nd argument when this argument is 'x', 'X'
>or 'seq' (expected to be a vector-like or matrix-like object). This
>opens the door to defining methods that take advantage of the of the
>implementation of particular vector-like or matrix-like objects.
>
>Also note that, even if some of the 14 functions in the "snow family"
>are simple convenience wrappers to other functions in the family,
we've
>made all of them generics. For example clusterEvalQ() is a simple
>wrapper to clusterCall():
>
> > clusterEvalQ
> function (cl = NULL, expr)
> clusterCall(cl, eval, substitute(expr), env = .GlobalEnv)
> <environment: namespace:parallel>
>
>And it seems (at least intuitively) that implementing a "clusterCall"
>method for my cluster-like objects should be enough to have
>clusterEvalQ() work out-of-the-box on those objects. But, sadly
enough,
>this is not the case:
>
> setClass("FakeCluster", representation(nnodes="integer"))
>
> setMethod("clusterCall", "FakeCluster",
> function (cl=NULL, fun, ...) fun(...)
> )
>
>Then:
>
> > mycluster <- new("FakeCluster", nnodes=10L)
> > clusterCall(mycluster, print, 1:6)
> [1] 1 2 3 4 5 6
> > clusterEvalQ(mycluster, print(1:6))
> Error in checkCluster(cl) : not a valid cluster
>
>This is because the "clusterEvalQ" default method is calling
>parallel::clusterCall() (which is *not* the generic), instead of
>calling BiocGenerics::clusterCall() (which *is* the generic).
>
>This would be avoided if clusterCall() was a generic defined in
>the parallel package itself (or in a package that parallel depends
>on). And this would of course be a better solution than having those
>generics in BiocGenerics. Is someone willing to bring that case to
>R-devel?
>
>In the mean time I need to define a "clusterEvalQ" method:
>
> setMethod("clusterEvalQ", "FakeCluster",
> function (cl=NULL, expr)
> clusterCall(cl, eval, substitute(expr), env=.GlobalEnv)
> )
>
>And then:
>
> > clusterEvalQ(mycluster, print(1:6))
> [1] 1 2 3 4 5 6
>
>Finally note that this method I defined for my objects could be
made the
>default "clusterEvalQ" method (i.e. the clusterEvalQ,ANY method)
and we
>could put it in BiocGenerics. Or, since there is apparently nothing to
>win by having clusterEvalQ() being a generic in the first place, we
>could redefine clusterEvalQ() as an ordinary function in BiocGenerics.
>This function would be implemented *exactly* like
>parallel::clusterEvalQ() (and it would mask it), except that now
>it would call BiocGenerics::clusterCall() internally.
>
>What should we do?
We have the identical problem already when we try to use parallel
mcmapply
on a BioC List (i.e. GRangesList).
Witness:
The casual user (ehrm, myself at least) expects that since I can
'lapply'
on a BioC GRangesList (or any other List) that I should be able to
mclapply on it.
Sadly the casual user is wrong, and gets an error.
Why?
Because parallel::mclapply(X... calls as.list on X.
Which yields 'Error in as.list.default : no method for coercing this S4
class to a vector'
But, you say, IRanges defines as.list for Lists, as can be
demonstrated by
calling as.list(myGRL) on a GRangesList.
Here I yield the floor to someone who can explain why this is so, for I
have not studied enough how
namespaces/packages/symboltables/whatever work
in R.
Anyone?
Regardless, one BAD workaround I found works is to snarf (tm) the source
for mclapply, evaluate it in the global namespace, after prefixing all
parallel internal functions with 'parallel:::'.
AFter doing this, the modified mclapply works as one might expect.
So, there is at least an issue regarding how method dispatch works
across
namespaces. Again I yield the floor, but, expect that it can be fixed.
BUT, FURTHERMORE, MCLAPPLY SHOULD NOT COERCE X TO LIST ANYWAY
Why? Because calling `as.list` incurs the overhead of (needlessly!?!)
coercing this nice tight GRangesList into a base::list.
There is NO REASON for it to be coercing X to a list at all. By my
lights, mclapply only needs `length` and `seq_along` defined on X, which
ARE ALREADY available to a GRangesList from Vector. Indeed, commenting
out the X<-as.list(X) coercion in mclapply and, lo, it still works on a
GRangesList as hoped, and on a 1000 element GRanges list takes ~18x less
user time to mclapply(myGRL,length). (and even short just to use
elementLengths, but that is not the point).
In this case the solution appears to be to FIX the upstream package so
that method dispatch works correctly (I expect that length and seq_along
are only visible to my snarfed mclapply and would suffer from similar
error without adressing the package issue).
Indeed, similarly, in my proposed changed to parallel::pvec, I found a
simple change that made it work with Vector as well as vector, since
Vector implements `[` and `length`.
I still think the solution to getting an SGE (et. al.) parallel back-end
is to seek to improve the upstream package to make 'pluggable' for
different parallel backends.
I don't think I'm the right person to represent this to R-devel as
obviously I am not schooled (yet!?!?) in the workings of
S3/S4/signatures/methods/etc.
Herve, I have a hunch that your 'In the mean time' solution is a
workaround that has the potential to invite further confusion.
Anyone, as a perhaps related issue, and as an opportunity to educate me,
can you explain why untrace does NOT completely work on `lapply` (with
BiocGenerics loaded). Viz:
trace(lapply)
untrace(lapply)
IRanges(1,2)
IRanges of length 1
trace: lapply(dots, methods:::.class1)
....
--Malcolm
>
>H.
>
>
>On 10/24/2012 09:07 AM, Cook, Malcolm wrote:
>> On 10/24/12 12:44 AM, "Michael Lawrence"
<lawrence.michael at gene.com <mailto:lawrence.michael at gene.com>>
>>wrote:
>>
>>> I agree that it would fruitful to have parLapply in
BiocGenerics. It
>>>looks
>>> to be a flexible abstraction and its presence in the parallel
package
>>> makes
>>> it ubiquitous. If it hasn't been done already, mclapply (and
mcmapply)
>>> would be good candidates, as well. The fork-based parallelism is
>>> substantively different in terms of the API from the more general
>>> parallelism of parLapply.
>>>
>>> Someone was working on some more robust and convenient wrappers
around
>>> mclapply. Did that ever see the light of day?
>>
>>
>> If you are referring to
>>
>>http://thread.gmane.org/gmane.science.biology.informatics.conductor/43660
>>
>> in which I had offered some small changes to parallel::pvec
>>
>> https://gist.github.com/3757873/
>>
>> and after which Martin had provided me with a route I have not
(yet?)
>> followed toward submitting a patch to R for consideration by
R-devel /
>> Simon Urbanek in
>>
>>
>>http://grokbase.com/t/r/bioc-devel/129rbmxp5b/applying-over-granges-and-o
>>th
>> er-vectors-of-ranges#201209248dcn0tpwt7k7g9zsjr4dha6f1c
>>
>>
>>
>>
>>>>> On Tue, Oct 23, 2012 at 12:13 PM, Steve Lianoglou <
>>>>> mailinglist.honeypot at gmail.com
<mailto:mailinglist.honeypot at gmail.com>**> wrote:
>>>>>
>>>>> In response to a question from yesterday, I pointed someone
to the
>>>>>> ShortRead `srapply` function and I wondered to myself why it
had to
>>>>>> necessarily by "burried" in the ShortRead package (aside from it
>>>>>> having a `sr` prefix).
>>>>>>
>>>>>
>>>> I don't know that srapply necessarily 'got it right'...
>>
>>
>> One thing I like about srapply is its support for a reduce argument.
>>
>>>>>> I had thought it might be a good idea to move that (or something
>>>>>>like
>>>>>> that) to BiocGenerics (unless implementations aren't allowed
there)
>>>>>> but also realized that it would add more dependencies where
someone
>>>>>> might not necessarily need them.
>>
>>
>>>>>>
>>>>>> But, almost surely, a large majority of the people will be
happy to
>>>>>>do
>>>>>> some form of ||-ization, so in my mind it's not such an onerous
>>>>>>thing
>>>>>> to add -- on the other hand, this large majority is probably
>>>>>>enriched
>>>>>> for people who are doing NGS analysis, in which case,
keeping it in
>>>>>> ShortRead can make some sense.
>>
>> I remain confused about the need for putting any of this into
>>BiocGenerics
>> at all. It seems to me that properly construed parallization
primitives
>> ought to 'just work' with any object which supports indexing and
length.
>>
>> I would appreciate hearing arguments to the contrary.
>>
>> Florian, in a similar vein, could we not seek to change
>> parallel::makeCluster to be extensible to, say, support SGE cluster?
>>THis
>> seems like the 'right thing to do'. ???
>>
>>
>> Regardless, I think we have raised some considerations that
might inform
>> improvements to parallel, including points about error handling,
>>reducing
>> results, block-level parallization over List/Vector (in addition to
>> vector), etc.
>>
>> I think perhaps having a google doc that we can collectively edit to
>> contain the requirements we are trying to achieve might move us
forward
>> effectively. Would this help? Or perhaps a page under
>> http://wiki.fhcrc.org/bioc/DeveloperPage/#discussions ???
>>
>>
>>>>>> Taking one step back, I recall some chatter last week (or
two) about
>>>>>> some better ||-ization "primitives" -- something about a pvec
>>>>>>doo-dad,
>>>>>> and there being ideas to wrap different types of ||-ization
behind
>>>>>>an
>>>>>> easy to use interface (I think this was the convo), and then
I took
>>>>>>a
>>>>>> further step back and often wonder why we just don't bite
the bullet
>>>>>> and take advantage of the `foreach` infrastructure that is
already
>>>>>>out
>>>>>> there -- in which case, I could imagne a "doSGE" package
that might
>>>>>> handle the particulars of what Florain is referring to. You
could
>>>>>>then
>>>>>> configure it externally via some
>>>>>>`registerDoSGE(some.config.**object)`
>>>>>> and just have the package code happily run it through
`foreach(...)
>>>>>> %dopar%` and be done w/ it.
>>>>>>
>>>>>>
>>>>>> IMHO it is relevant. I have not looked for other
abstractions,
>>>>>>and
>>>>>> this
>>>>> one seems
>>>>> to work. Florian's objectives might be a good test case for
>>>>>adequacy.
>>>>>
>>>>
>>>> The registerDoDah does seem to be a useful abstraction.
>>
>> Is this not more-or-less the intention of
parallel::setDefaultCluster?
>>
>> --Malcolm
>>
>>
>>
>>>>
>>>> I think there's a lot of work to do for some sort of coordinated
>>>> parallelization that putting parLapply into BiocGenerics might
>>>> encourage;
>>>> not good things will happen when everyone in a call stack tries to
>>>> parallelize independently. But I'm in favor of parLapply in
>>>> BiocGenerics at
>>>> least for the moment.
>>>>
>>>> Martin
>>>>
>>>>
>>>>
>>>>>
>>>>> ... at least, I thought this is what was being talked about
here
>>>>>(and
>>>>>> popped up a week or two ago) -- sorry if I completely missed the
>>>>>>mark
>>>>>> ...
>>>>>>
>>>>>> -steve
>>>>>>
>>>>>>
>>>>>> On Tue, Oct 23, 2012 at 10:38 AM, Hahne, Florian
>>>>>> <florian.hahne at novartis.com
<mailto:florian.hahne at novartis.com>> wrote:
>>>>>>
>>>>>>> Hi Martin,
>>>>>>> I could define the generics in my own package, but that
would mean
>>>>>>> that
>>>>>>> those will only be available there, or in the global
environment
>>>>>>> assuming
>>>>>>> that I also export them, or in all additional packages that
>>>>>>> explicitly
>>>>>>> import them from my name space. Now there already are a
whole bunch
>>>>>>> of
>>>>>>> packages around that all allow for parallelization via a
cluster
>>>>>>> object.
>>>>>>> Obviously those all import the parLapply function from the
parallel
>>>>>>> package. That means that I can't simply supply my own modified
>>>>>>> cluster
>>>>>>> object, because the code that calls parLapply will not know
about
>>>>>>>the
>>>>>>> generic in my package, even if it is attached. Ideally
parLapply
>>>>>>> would
>>>>>>> be
>>>>>>> a generic function already in the parallel package. Not
sure who
>>>>>>> needs
>>>>>>> to
>>>>>>> be convinced in order for this to happen, but my gut
feeling was
>>>>>>> that it
>>>>>>> could be easier to have the generic in BiocGenerics.
>>>>>>> Maybe I am missing something obvious here, but imo there is
no way
>>>>>>>to
>>>>>>> overwrite parLapply globally for my own class unless the
generic is
>>>>>>> imported by everyone who wants to make use of the special
method.
>>>>>>> Florian
>>>>>>> --
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 10/23/12 2:20 PM, "Martin Morgan" <mtmorgan at fhcrc.org
<mailto:mtmorgan at fhcrc.org>> wrote:
>>>>>>>
>>>>>>> On 10/17/2012 05:45 AM, Hahne, Florian wrote:
>>>>>>>>
>>>>>>>>> Hi all,
>>>>>>>>> I was wondering whether it would be possible to have proper
>>>>>>>>> generics
>>>>>>>>>
>>>>>>>> for
>>>>>>
>>>>>>> some of the functions in the parallel package, e.g.
parLapply and
>>>>>>>>> clusterCall. The reason I am asking is because I want to
build an
>>>>>>>>> S4
>>>>>>>>> class
>>>>>>>>> that essentially looks like an S3 cluster object but
knows how to
>>>>>>>>> deal
>>>>>>>>> with the SGE. That way I can abstract away all the overhead
>>>>>>>>> regarding
>>>>>>>>> job
>>>>>>>>> submission, job status and reducing the results in the
parLapply
>>>>>>>>> method
>>>>>>>>> of
>>>>>>>>> that class, and would be able to supply this new cluster
object
>>>>>>>>>to
>>>>>>>>> all
>>>>>>>>> of
>>>>>>>>> my existing functions that can be processed in parallel
using a
>>>>>>>>> cluster
>>>>>>>>> object as input. I have played around with the BatchJobs
package
>>>>>>>>> as an
>>>>>>>>> abstraction layer to SGE and that work nicely. As a test
case I
>>>>>>>>> have
>>>>>>>>> created the necessary generics myself in order to supply
my own
>>>>>>>>> SGEcluster
>>>>>>>>> object to a function that normally deals with the "regular"
>>>>>>>>> parallel
>>>>>>>>> package S3 cluster objects and everything just worked out
of the
>>>>>>>>> box,
>>>>>>>>> but
>>>>>>>>> obviously this fails once I am in a name space and my
generic is
>>>>>>>>> not
>>>>>>>>> found
>>>>>>>>> anymore. Of course what we would really want is some proper
>>>>>>>>> abstraction
>>>>>>>>> of
>>>>>>>>> parallelization in R, but for now this seem to be at least a
>>>>>>>>>cheap
>>>>>>>>> compromise. Any thoughts on this?
>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> Hi Florian -- we talked about this locally, but I guess we
didn't
>>>>>>>> actually send
>>>>>>>> any email!
>>>>>>>>
>>>>>>>> Is there an obstacle to promoting these to generics in
your own
>>>>>>>> package?
>>>>>>>> The
>>>>>>>> usual motivation for inclusion in BiocGenerics has been to
avoid
>>>>>>>> conflicts
>>>>>>>> between packages, but I'm not sure whether this is the
case (yet)?
>>>>>>>> This
>>>>>>>> would
>>>>>>>> also add a dependency fairly deep in the hierarchy.
>>>>>>>>
>>>>>>>> What do you think?
>>>>>>>>
>>>>>>>> Martin
>>>>>>>>
>>>>>>>> Florian
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Computational Biology / Fred Hutchinson Cancer Research Center
>>>>>>>> 1100 Fairview Ave. N.
>>>>>>>> PO Box 19024 Seattle, WA 98109
>>>>>>>>
>>>>>>>> Location: Arnold Building M1 B861
>>>>>>>> Phone: (206) 667-2793
>>>>>>>>
>>>>>>>
>>>>>>> ______________________________**_________________
>>>>>>> Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
mailing list
>>>>>>>
>>>>>>>
>>>>>>>https://stat.ethz.ch/mailman/**listinfo/bioc-devel<https://stat.ethz
>>>>>>>.c
>>>>>>> h/mailman/listinfo/bioc-devel>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Steve Lianoglou
>>>>>> Graduate Student: Computational Systems Biology
>>>>>> | Memorial Sloan-Kettering Cancer Center
>>>>>> | Weill Medical College of Cornell University
>>>>>> Contact Info:
>>>>>>
>>>>>>http://cbio.mskcc.org/~lianos/**contact<http://cbio.mskcc.org/%7Elian
>>>>>>os
>>>>>> /contact>
>>>>>>
>>>>>> ______________________________**_________________
>>>>>> Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
mailing list
>>>>>>
>>>>>>
>>>>>>https://stat.ethz.ch/mailman/**listinfo/bioc-devel<https://stat.ethz.
>>>>>>ch
>>>>>> /mailman/listinfo/bioc-devel>
>>>>>>
>>>>>>
>>>>> [[alternative HTML version deleted]]
>>>>>
>>>>> ______________________________**_________________
>>>>> Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
mailing list
>>>>>
>>>>>
>>>>>https://stat.ethz.ch/mailman/**listinfo/bioc-devel<https://stat.ethz.c
>>>>>h/
>>>>> mailman/listinfo/bioc-devel>
>>>>>
>>>>>
>>>>
>>>> --
>>>> Computational Biology / Fred Hutchinson Cancer Research Center
>>>> 1100 Fairview Ave. N.
>>>> PO Box 19024 Seattle, WA 98109
>>>>
>>>> Location: Arnold Building M1 B861
>>>> Phone: (206) 667-2793 <tel:%28206%29%20667-2793>
>>>>
>>>> ______________________________**_________________
>>>> Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
mailing list
>>>>
>>>>
>>>>https://stat.ethz.ch/mailman/**listinfo/bioc-devel<https://stat.ethz.ch
>>>>/m
>>>> ailman/listinfo/bioc-devel>
>>>>
>>>
>>> [[alternative HTML version deleted]]
>>>
>>> _______________________________________________
>>> Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
mailing list
>>
>> _______________________________________________
>> Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
mailing list
>
>--
>Herv? Pag?s
>
>Program in Computational Biology
>Division of Public Health Sciences
>Fred Hutchinson Cancer Research Center
>1100 Fairview Ave. N, M1-B514
>P.O. Box 19024
>Seattle, WA 98109-1024
>
>E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
>Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
>Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
_______________________________________________
Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org> mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel
Herv? Pag?s Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
I might be willing with a little more ammo. But, I need some guidance/education first?. Let's see if the following question helps me get it from you (you know who you are), or gets my head bitten off in this forum? Question: Why do we have BiocGenerics at all? I notice for instance that the definition of Reduce that it provides is different in only one respect from base::Reduce. Namely, base::Reduce coerces X to a list using as.list if it is an object whereas BiocGenerics::Reduce does not. Otherwise they are identical. As a result, base::Reduce(myGRangesList) fails with "no method for coercing this S4 class to a vector". But, similarly as to my argument regarding pvec and mclapply, is not the "right thing to do" to FIX base::Reduce to NOT do this coercion instead of introducing a new generic? If it didn't do this coersion then it would work with anything which implements `[[` and `seq` (including Vector). Why introduce generics for things that are defined in terms of sequence access primitives? Is there a good reason I am missing? If you agree, then, there are (at least) two upstream packages to 'fix': (1) Functional and (2) parallel. Do you agree, or can you educate me otherwise? --Malcolm From: Tim Triche <tim.triche at gmail.com<mailto:tim.triche at gmail.com>> Reply-To: "ttriche at usc.edu<mailto:ttriche at usc.edu>" <ttriche at usc.edu<mailto:ttriche at usc.edu>> Date: Thu, 25 Oct 2012 13:21:46 -0500 To: Florian Hahne <florian.hahne at novartis.com<mailto:florian.hahne at novartis.com>> Cc: Herv? Pag?s <hpages at fhcrc.org<mailto:hpages at fhcrc.org>>, Malcolm Cook <mec at stowers.org<mailto:mec at stowers.org>>, "Michael com>" <lawrence.michael at gene.com<mailto:lawrence.michael at gene.com>>, "bioc-devel at r-project.org<mailto:bioc-devel at r-project.org>" <bioc-devel at r-project.org<mailto:bioc-devel at r-project.org>> Subject: Re: [Bioc-devel] parallel package generics +1 although this time around I would prefer if someone else would stick their neck out :-)
On Thu, Oct 25, 2012 at 11:12 AM, Hahne, Florian <florian.hahne at novartis.com<mailto:florian.hahne at novartis.com>> wrote:
For me the cleanest option with the least impact would be to have this fixed directly in the parallel package. However I think that somebody with more influence should suggest that to Rdevel. If they will not do it, the other options seem all more or less equivalent to me. Florian --
On 10/25/12 12:08 AM, "Herv? Pag?s" <hpages at fhcrc.org<mailto:hpages at fhcrc.org>> wrote:
Hi,
With Florian use case, there seems to be a strong/immediate need for
dispatching on the cluster-like object passed as the 1st argument to
parLapply() and all the other functions in the parallel package that
belong to the "snow family" (14 functions in total, all documented in
?parallel::parLapply). So we've just added those 14 generics to
BiocGenerics 0.5.1. We're postponing the "multicore family" (i.e.
mclapply(), mcmapply(), and pvec()) for now.
Note that the 14 new generics dispatch at least on their 1st argument
('cl'), but also on their 2nd argument when this argument is 'x', 'X'
or 'seq' (expected to be a vector-like or matrix-like object). This
opens the door to defining methods that take advantage of the of the
implementation of particular vector-like or matrix-like objects.
Also note that, even if some of the 14 functions in the "snow family"
are simple convenience wrappers to other functions in the family, we've
made all of them generics. For example clusterEvalQ() is a simple
wrapper to clusterCall():
> clusterEvalQ
function (cl = NULL, expr)
clusterCall(cl, eval, substitute(expr), env = .GlobalEnv)
<environment: namespace:parallel>
And it seems (at least intuitively) that implementing a "clusterCall"
method for my cluster-like objects should be enough to have
clusterEvalQ() work out-of-the-box on those objects. But, sadly enough,
this is not the case:
setClass("FakeCluster", representation(nnodes="integer"))
setMethod("clusterCall", "FakeCluster",
function (cl=NULL, fun, ...) fun(...)
)
Then:
> mycluster <- new("FakeCluster", nnodes=10L)
> clusterCall(mycluster, print, 1:6)
[1] 1 2 3 4 5 6
> clusterEvalQ(mycluster, print(1:6))
Error in checkCluster(cl) : not a valid cluster
This is because the "clusterEvalQ" default method is calling
parallel::clusterCall() (which is *not* the generic), instead of
calling BiocGenerics::clusterCall() (which *is* the generic).
This would be avoided if clusterCall() was a generic defined in
the parallel package itself (or in a package that parallel depends
on). And this would of course be a better solution than having those
generics in BiocGenerics. Is someone willing to bring that case to
R-devel?
In the mean time I need to define a "clusterEvalQ" method:
setMethod("clusterEvalQ", "FakeCluster",
function (cl=NULL, expr)
clusterCall(cl, eval, substitute(expr), env=.GlobalEnv)
)
And then:
> clusterEvalQ(mycluster, print(1:6))
[1] 1 2 3 4 5 6 Finally note that this method I defined for my objects could be made the default "clusterEvalQ" method (i.e. the clusterEvalQ,ANY method) and we could put it in BiocGenerics. Or, since there is apparently nothing to win by having clusterEvalQ() being a generic in the first place, we could redefine clusterEvalQ() as an ordinary function in BiocGenerics. This function would be implemented *exactly* like parallel::clusterEvalQ() (and it would mask it), except that now it would call BiocGenerics::clusterCall() internally. What should we do? H. On 10/24/2012 09:07 AM, Cook, Malcolm wrote:
On 10/24/12 12:44 AM, "Michael Lawrence" <lawrence.michael at gene.com<mailto:lawrence.michael at gene.com>> wrote:
I agree that it would fruitful to have parLapply in BiocGenerics. It looks to be a flexible abstraction and its presence in the parallel package makes it ubiquitous. If it hasn't been done already, mclapply (and mcmapply) would be good candidates, as well. The fork-based parallelism is substantively different in terms of the API from the more general parallelism of parLapply. Someone was working on some more robust and convenient wrappers around mclapply. Did that ever see the light of day?
If you are referring to http://thread.gmane.org/gmane.science.biology.informatics.conductor/43660 in which I had offered some small changes to parallel::pvec https://gist.github.com/3757873/ and after which Martin had provided me with a route I have not (yet?) followed toward submitting a patch to R for consideration by R-devel / Simon Urbanek in http://grokbase.com/t/r/bioc-devel/129rbmxp5b/applying-over-granges-and-o th er-vectors-of-ranges#201209248dcn0tpwt7k7g9zsjr4dha6f1c
On Tue, Oct 23, 2012 at 12:13 PM, Steve Lianoglou < mailinglist.honeypot at gmail.com<mailto:mailinglist.honeypot at gmail.com>**> wrote: In response to a question from yesterday, I pointed someone to the
ShortRead `srapply` function and I wondered to myself why it had to necessarily by "burried" in the ShortRead package (aside from it having a `sr` prefix).
I don't know that srapply necessarily 'got it right'...
One thing I like about srapply is its support for a reduce argument.
I had thought it might be a good idea to move that (or something like that) to BiocGenerics (unless implementations aren't allowed there) but also realized that it would add more dependencies where someone might not necessarily need them.
But, almost surely, a large majority of the people will be happy to do some form of ||-ization, so in my mind it's not such an onerous thing to add -- on the other hand, this large majority is probably enriched for people who are doing NGS analysis, in which case, keeping it in ShortRead can make some sense.
I remain confused about the need for putting any of this into BiocGenerics at all. It seems to me that properly construed parallization primitives ought to 'just work' with any object which supports indexing and length. I would appreciate hearing arguments to the contrary. Florian, in a similar vein, could we not seek to change parallel::makeCluster to be extensible to, say, support SGE cluster? THis seems like the 'right thing to do'. ??? Regardless, I think we have raised some considerations that might inform improvements to parallel, including points about error handling, reducing results, block-level parallization over List/Vector (in addition to vector), etc. I think perhaps having a google doc that we can collectively edit to contain the requirements we are trying to achieve might move us forward effectively. Would this help? Or perhaps a page under http://wiki.fhcrc.org/bioc/DeveloperPage/#discussions ???
Taking one step back, I recall some chatter last week (or two) about some better ||-ization "primitives" -- something about a pvec doo-dad, and there being ideas to wrap different types of ||-ization behind an easy to use interface (I think this was the convo), and then I took a further step back and often wonder why we just don't bite the bullet and take advantage of the `foreach` infrastructure that is already out there -- in which case, I could imagne a "doSGE" package that might handle the particulars of what Florain is referring to. You could then configure it externally via some `registerDoSGE(some.config.**object)` and just have the package code happily run it through `foreach(...) %dopar%` and be done w/ it. IMHO it is relevant. I have not looked for other abstractions, and this
one seems to work. Florian's objectives might be a good test case for adequacy.
The registerDoDah does seem to be a useful abstraction.
Is this not more-or-less the intention of parallel::setDefaultCluster? --Malcolm
I think there's a lot of work to do for some sort of coordinated parallelization that putting parLapply into BiocGenerics might encourage; not good things will happen when everyone in a call stack tries to parallelize independently. But I'm in favor of parLapply in BiocGenerics at least for the moment. Martin
... at least, I thought this is what was being talked about here (and
popped up a week or two ago) -- sorry if I completely missed the mark ... -steve On Tue, Oct 23, 2012 at 10:38 AM, Hahne, Florian <florian.hahne at novartis.com<mailto:florian.hahne at novartis.com>> wrote:
Hi Martin, I could define the generics in my own package, but that would mean that those will only be available there, or in the global environment assuming that I also export them, or in all additional packages that explicitly import them from my name space. Now there already are a whole bunch of packages around that all allow for parallelization via a cluster object. Obviously those all import the parLapply function from the parallel package. That means that I can't simply supply my own modified cluster object, because the code that calls parLapply will not know about the generic in my package, even if it is attached. Ideally parLapply would be a generic function already in the parallel package. Not sure who needs to be convinced in order for this to happen, but my gut feeling was that it could be easier to have the generic in BiocGenerics. Maybe I am missing something obvious here, but imo there is no way to overwrite parLapply globally for my own class unless the generic is imported by everyone who wants to make use of the special method. Florian -- On 10/23/12 2:20 PM, "Martin Morgan" <mtmorgan at fhcrc.org<mailto:mtmorgan at fhcrc.org>> wrote: On 10/17/2012 05:45 AM, Hahne, Florian wrote:
Hi all, I was wondering whether it would be possible to have proper generics
for
some of the functions in the parallel package, e.g. parLapply and
clusterCall. The reason I am asking is because I want to build an S4 class that essentially looks like an S3 cluster object but knows how to deal with the SGE. That way I can abstract away all the overhead regarding job submission, job status and reducing the results in the parLapply method of that class, and would be able to supply this new cluster object to all of my existing functions that can be processed in parallel using a cluster object as input. I have played around with the BatchJobs package as an abstraction layer to SGE and that work nicely. As a test case I have created the necessary generics myself in order to supply my own SGEcluster object to a function that normally deals with the "regular" parallel package S3 cluster objects and everything just worked out of the box, but obviously this fails once I am in a name space and my generic is not found anymore. Of course what we would really want is some proper abstraction of parallelization in R, but for now this seem to be at least a cheap compromise. Any thoughts on this?
Hi Florian -- we talked about this locally, but I guess we didn't actually send any email! Is there an obstacle to promoting these to generics in your own package? The usual motivation for inclusion in BiocGenerics has been to avoid conflicts between packages, but I'm not sure whether this is the case (yet)? This would also add a dependency fairly deep in the hierarchy. What do you think? Martin Florian
-- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793
______________________________**_________________ Bioc-devel at r-project.org<mailto:Bioc-devel at r-project.org> mailing list https://stat.ethz.ch/mailman/**listinfo/bioc-devel<https://stat.ethz .c h/mailman/listinfo/bioc-devel>
-- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/**contact<http://cbio.mskcc.org/%7Elian os /contact>
______________________________**_________________ Bioc-devel at r-project.org<mailto:Bioc-devel at r-project.org> mailing list https://stat.ethz.ch/mailman/**listinfo/bioc-devel<https://stat.ethz. ch /mailman/listinfo/bioc-devel>
[[alternative HTML version deleted]]
______________________________**_________________ Bioc-devel at r-project.org<mailto:Bioc-devel at r-project.org> mailing list https://stat.ethz.ch/mailman/**listinfo/bioc-devel<https://stat.ethz.c h/ mailman/listinfo/bioc-devel>
-- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793<tel:%28206%29%20667-2793>
______________________________**_________________ Bioc-devel at r-project.org<mailto:Bioc-devel at r-project.org> mailing list https://stat.ethz.ch/mailman/**listinfo/bioc-devel<https://stat.ethz.ch /m ailman/listinfo/bioc-devel>
[[alternative HTML version deleted]]
_______________________________________________ Bioc-devel at r-project.org<mailto:Bioc-devel at r-project.org> mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
_______________________________________________ Bioc-devel at r-project.org<mailto:Bioc-devel at r-project.org> mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
-- Herv? Pag?s Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org<mailto:hpages at fhcrc.org> Phone: (206) 667-5791<tel:%28206%29%20667-5791> Fax: (206) 667-1319<tel:%28206%29%20667-1319>
_______________________________________________ Bioc-devel at r-project.org<mailto:Bioc-devel at r-project.org> mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel -- A model is a lie that helps you see the truth. Howard Skipper<http://cancerres.aacrjournals.org/content/31/9/1173.full.pdf>
On 10/25/2012 07:53 PM, Cook, Malcolm wrote:
Namely, base::Reduce coerces X to a list using as.list if it is an object whereas BiocGenerics::Reduce does not. Otherwise they are identical. As a result, base::Reduce(myGRangesList) fails with "no method for coercing this S4 class to a vector".
Well, so when you put it this way... I was wondering why there is no
as.list.List
i.e., an S3 method for List on the generic as.list? This seems to be consistent
with the recommendation on ?Methods under 'Methods for S3 Generic Functions'
(and this little hack seems to allow both Reduce and mclapply to 'work').
Index: NAMESPACE
===================================================================
--- NAMESPACE (revision 70700)
+++ NAMESPACE (working copy)
@@ -332,3 +332,4 @@
expand
)
+S3method(as.list, List)
Index: R/List-class.R
===================================================================
--- R/List-class.R (revision 70700)
+++ R/List-class.R (working copy)
@@ -442,10 +442,11 @@
### Coercion.
###
+as.list.List <-
+ function(x, ...) lapply(x, identity)
+
setAs("List", "list", function(from) as.list(from))
-setMethod("as.list", "List", function(x, ...) lapply(x, identity))
-
setGeneric("as.env", function(x, ...) standardGeneric("as.env"))
setMethod("as.env", "List",
Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793
On 10/25/2012 07:53 PM, Cook, Malcolm wrote:
Question: Why do we have BiocGenerics at all? I notice for instance that the definition of Reduce that it provides is different in only one respect from base::Reduce.
For most of the generics the motivation was different -- different packages would independently implement S4 generics and methods on them, the generics from different packages would mask one another, and the user would be confused when the wrong method was chosen. This could still be a case of 'fix it upstream'. The upstream fix is to make S4 generics of common / all functions. This has costs, in terms of performance and perhaps other issues. Also, the stats4 package is an attempt at an 'upstream' fix where common statistical functions are made into S4 generics. It could be the case that some methods could be avoided by appropriate (re)-definition of the default. Martin
Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/bioc-devel/attachments/20121025/295367de/attachment.pl>
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/bioc-devel/attachments/20121025/b25da60b/attachment.pl>
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/bioc-devel/attachments/20121025/ae4d2005/attachment.pl>
Martin, Great. Nice, and makes sense, and should be added to R/List-class But, but It is 'a good thing' that it make Reduce and mclapply 'work'... But, they work in an inefficient manner. Modifying base::Reduce and parallel::mclapply as I suggest (to NOT coerce to list at all) make them VERY MUCH faster. and does away with the need for BioC Generics Do you agree that these changes should be made in 'upstream' packages? --Malcolm
On 10/25/12 2:34 PM, "Martin Morgan" <mtmorgan at fhcrc.org> wrote:
On 10/25/2012 07:53 PM, Cook, Malcolm wrote:
Namely, base::Reduce coerces X to a list using as.list if it is an object whereas BiocGenerics::Reduce does not. Otherwise they are identical. As a result, base::Reduce(myGRangesList) fails with "no method for coercing this S4 class to a vector".
Well, so when you put it this way... I was wondering why there is no
as.list.List
i.e., an S3 method for List on the generic as.list? This seems to be
consistent
with the recommendation on ?Methods under 'Methods for S3 Generic
Functions'
(and this little hack seems to allow both Reduce and mclapply to 'work').
Index: NAMESPACE
===================================================================
--- NAMESPACE (revision 70700)
+++ NAMESPACE (working copy)
@@ -332,3 +332,4 @@
expand
)
+S3method(as.list, List)
Index: R/List-class.R
===================================================================
--- R/List-class.R (revision 70700)
+++ R/List-class.R (working copy)
@@ -442,10 +442,11 @@
### Coercion.
###
+as.list.List <-
+ function(x, ...) lapply(x, identity)
+
setAs("List", "list", function(from) as.list(from))
-setMethod("as.list", "List", function(x, ...) lapply(x, identity))
-
setGeneric("as.env", function(x, ...) standardGeneric("as.env"))
setMethod("as.env", "List",
--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109
Location: Arnold Building M1 B861
Phone: (206) 667-2793