Back to formatted view
Raw Message

Message-ID: <C980CE99-CB7E-43FD-869B-08BD4B1702D6@gmail.com>
Date: 2011-10-10T20:14:49Z
From: Joshua Wiley
Subject: multicore by(), like mclapply?
In-Reply-To: <CAJ55+dL_=oMbszz8KbxS3P=nhDYZR1432nbNhgqadvg3rWT39w@mail.gmail.com>

I could be waay off base here, but my concern about presplitting the data is that you will have your data, and a second copy of our data that is something like a list where each element contains the portion of the data for that split.  Good speed wise, bad memory wise.  My hope with the technique I showed (again I may not have accomplished it) was to only have at anyone time, the original data and a copy of the particular elements being worked with.  Of course  this is not an issue if you have plenty of memory.

On Oct 10, 2011, at 12:19, Thomas Lumley <tlumley at uw.edu> wrote:

> On Tue, Oct 11, 2011 at 7:54 AM, ivo welch <ivo.welch at gmail.com> wrote:
>> hi josh---thx.  I had a different version of this, and discarded it
>> because I think it was very slow.  the reason is that on each
>> application, your version has to scan my (very long) data vector.  (I
>> have many thousand different cases, too.)  I presume that by() has one
>> scan through the vector that makes all splits.
> 
> by.data.frame() is basically a wrapper for tapply(), and the key line
> in tapply() is
>   ans <- lapply(split(X, group), FUN, ...)
> which should be easy to adapt for mclapply.
> 
> -- 
> Thomas Lumley
> Professor of Biostatistics
> University of Auckland