dear r experts---Is there a multicore equivalent of by(), just like mclapply() is the multicore equivalent of lapply()? if not, is there a fast way to convert a data.table into a list based on a column that lapply and mclapply can consume? advice appreciated...as always. regards, /iaw ---- Ivo Welch (ivo.welch at gmail.com)
multicore by(), like mclapply?
8 messages · ivo welch, Joshua Wiley, Matt Dowle +2 more
Hi Ivo,
My suggestion would be to only pass lapply (or mclapply) the indices.
That should be fast, subsetting with data table should also be fast,
and then you do whatever computations you will. For example:
require(data.table)
DT <- data.table(x=rep(c("a","b","c"),each=3), y=c(1,3,6), v=1:9)
setkey(DT, x)
lapply(as.character(unique(DT[,x])), function(i) DT[i])
the DT[i] object is the subset of the data table you want. You can
pass this to whatever function for computations you need.
Hope this helps,
Josh
On Mon, Oct 10, 2011 at 10:41 AM, ivo welch <ivo.welch at gmail.com> wrote:
dear r experts---Is there a multicore equivalent of by(), just like mclapply() is the multicore equivalent of lapply()? if not, is there a fast way to convert a data.table into a list based on a column that lapply and mclapply can consume? advice appreciated...as always. regards, /iaw ---- Ivo Welch (ivo.welch at gmail.com)
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Joshua Wiley Ph.D. Student, Health Psychology Programmer Analyst II, ATS Statistical Consulting Group University of California, Los Angeles https://joshuawiley.com/
Package plyr has .parallel. Searching datatable-help for "multicore", say on Nabble here, http://r.789695.n4.nabble.com/datatable-help-f2315188.html yields three relevant posts and examples. Please check wiki do's and don'ts to make sure you didn't fall into one of those traps, though (we don't know data or task so just guessing) : http://rwiki.sciviews.org/doku.php?id=packages:cran:data.table HTH Matthew "ivo welch" <ivo.welch at gmail.com> wrote in message news:CAPr7RtUroPQtQvoh5uBuT60OYkwGR+ufGr_Z=g5g+vLJEOjeaA at mail.gmail.com...
dear r experts---Is there a multicore equivalent of by(), just like mclapply() is the multicore equivalent of lapply()? if not, is there a fast way to convert a data.table into a list based on a column that lapply and mclapply can consume? advice appreciated...as always. regards, /iaw ---- Ivo Welch (ivo.welch at gmail.com)
hi josh---thx. I had a different version of this, and discarded it because I think it was very slow. the reason is that on each application, your version has to scan my (very long) data vector. (I have many thousand different cases, too.) I presume that by() has one scan through the vector that makes all splits. regards, /iaw ---- Ivo Welch (ivo.welch at gmail.com)
On Mon, Oct 10, 2011 at 11:07 AM, Joshua Wiley <jwiley.psych at gmail.com> wrote:
Hi Ivo,
My suggestion would be to only pass lapply (or mclapply) the indices.
That should be fast, subsetting with data table should also be fast,
and then you do whatever computations you will. ?For example:
require(data.table)
DT <- data.table(x=rep(c("a","b","c"),each=3), y=c(1,3,6), v=1:9)
setkey(DT, x)
lapply(as.character(unique(DT[,x])), function(i) DT[i])
the DT[i] object is the subset of the data table you want. ?You can
pass this to whatever function for computations you need.
Hope this helps,
Josh
On Mon, Oct 10, 2011 at 10:41 AM, ivo welch <ivo.welch at gmail.com> wrote:
dear r experts---Is there a multicore equivalent of by(), just like mclapply() is the multicore equivalent of lapply()? if not, is there a fast way to convert a data.table into a list based on a column that lapply and mclapply can consume? advice appreciated...as always. regards, /iaw ---- Ivo Welch (ivo.welch at gmail.com)
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
-- Joshua Wiley Ph.D. Student, Health Psychology Programmer Analyst II, ATS Statistical Consulting Group University of California, Los Angeles https://joshuawiley.com/
On Tue, Oct 11, 2011 at 7:54 AM, ivo welch <ivo.welch at gmail.com> wrote:
hi josh---thx. ?I had a different version of this, and discarded it because I think it was very slow. ?the reason is that on each application, your version has to scan my (very long) data vector. ?(I have many thousand different cases, too.) ?I presume that by() has one scan through the vector that makes all splits.
by.data.frame() is basically a wrapper for tapply(), and the key line in tapply() is ans <- lapply(split(X, group), FUN, ...) which should be easy to adapt for mclapply.
Thomas Lumley Professor of Biostatistics University of Auckland
I could be waay off base here, but my concern about presplitting the data is that you will have your data, and a second copy of our data that is something like a list where each element contains the portion of the data for that split. Good speed wise, bad memory wise. My hope with the technique I showed (again I may not have accomplished it) was to only have at anyone time, the original data and a copy of the particular elements being worked with. Of course this is not an issue if you have plenty of memory.
On Oct 10, 2011, at 12:19, Thomas Lumley <tlumley at uw.edu> wrote:
On Tue, Oct 11, 2011 at 7:54 AM, ivo welch <ivo.welch at gmail.com> wrote:
hi josh---thx. I had a different version of this, and discarded it because I think it was very slow. the reason is that on each application, your version has to scan my (very long) data vector. (I have many thousand different cases, too.) I presume that by() has one scan through the vector that makes all splits.
by.data.frame() is basically a wrapper for tapply(), and the key line in tapply() is ans <- lapply(split(X, group), FUN, ...) which should be easy to adapt for mclapply. -- Thomas Lumley Professor of Biostatistics University of Auckland
This is the sort of thing that should be measured, rather than
speculated about, but if you're using multicore all those subsets can
be made at the same time, not sequentially, so they add up to a copy
of the whole data. Using data.table rather than a data.frame would
help, of course.
I would guess that splitting, garbage collecting, and then forking
would be most efficient -- reducing the chance that all the separate
processes end up separately garbage collecting the results of the
split.
It's a pity that forking messes up the profilers; makes it harder to
measure these things.
-thomas
On Tue, Oct 11, 2011 at 9:14 AM, Joshua Wiley <jwiley.psych at gmail.com> wrote:
I could be waay off base here, but my concern about presplitting the data is that you will have your data, and a second copy of our data that is something like a list where each element contains the portion of the data for that split. ?Good speed wise, bad memory wise. ?My hope with the technique I showed (again I may not have accomplished it) was to only have at anyone time, the original data and a copy of the particular elements being worked with. ?Of course ?this is not an issue if you have plenty of memory. On Oct 10, 2011, at 12:19, Thomas Lumley <tlumley at uw.edu> wrote:
On Tue, Oct 11, 2011 at 7:54 AM, ivo welch <ivo.welch at gmail.com> wrote:
hi josh---thx. ?I had a different version of this, and discarded it because I think it was very slow. ?the reason is that on each application, your version has to scan my (very long) data vector. ?(I have many thousand different cases, too.) ?I presume that by() has one scan through the vector that makes all splits.
by.data.frame() is basically a wrapper for tapply(), and the key line in tapply() is ? ans <- lapply(split(X, group), FUN, ...) which should be easy to adapt for mclapply. -- Thomas Lumley Professor of Biostatistics University of Auckland
Thomas Lumley Professor of Biostatistics University of Auckland
On Mon, Oct 10, 2011 at 4:14 PM, Joshua Wiley <jwiley.psych at gmail.com> wrote:
I could be waay off base here, but my concern about presplitting the data is that you will have your data, and a second copy of our data that is something like a list where each element contains the portion of the data for that split. ?Good speed wise, bad memory wise. ?My hope with the technique I showed (again I may not have accomplished it) was to only have at anyone time, the original data and a copy of the particular elements being worked with. ?Of course ?this is not an issue if you have plenty of memory.
That's exactly what plyr does behind the scenes. Hadley
Assistant Professor / Dobelman Family Junior Chair Department of Statistics / Rice University http://had.co.nz/