Hi all, I'm using mclapply() of the multicore package for processing chunks of data in parallel --and it works great. But when I want to collect all processed elements of the returned list into one big data frame it takes ages. The elements are all data frames having identical column names, and I'm using a simple rbind() inside a loop to do that. But I guess it makes some expensive checking computations at each iteration as it gets slower and slower as it goes. Writing out to disk individual files, concatenating with the system and reading back from disk the resulting file is actually faster... Is there a magic argument to rbind() that I'm missing, or is there any other solution to collect the results of parallel processing efficiently? Thanks, Vincent
multicore package: collecting results
8 messages · Ben Bolker, Michael Lawrence, Vincent Aubanel +1 more
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On 06/29/2011 02:34 PM, Vincent Aubanel wrote:
Hi all, I'm using mclapply() of the multicore package for processing chunks of data in parallel --and it works great. But when I want to collect all processed elements of the returned list into one big data frame it takes ages. The elements are all data frames having identical column names, and I'm using a simple rbind() inside a loop to do that. But I guess it makes some expensive checking computations at each iteration as it gets slower and slower as it goes. Writing out to disk individual files, concatenating with the system and reading back from disk the resulting file is actually faster...
Why do you have to write to disk? Can you collect the results as a list L and then do.call(rbind,L) in one go?
Is there a magic argument to rbind() that I'm missing, or is there any other solution to collect the results of parallel processing efficiently? Thanks, Vincent
_______________________________________________ R-SIG-Mac mailing list R-SIG-Mac at r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-mac
-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk4Lc1sACgkQc5UpGjwzenMofgCdFEzN6qO8e8owV9GCJ6Mwafvn n0oAoIZt5txoRS9Ma73XefnEzBDW19lE =BDfd -----END PGP SIGNATURE-----
Is the slowdown happening while mclapply runs or while you're doing the rbind? If the latter, I wonder if the code below is more efficient than using rbind inside a loop: my_df = do.call( rbind , my_list_from_mclapply )
On Wed, Jun 29, 2011 at 3:34 PM, Vincent Aubanel <v.aubanel at laslab.org> wrote:
Hi all, I'm using mclapply() of the multicore package for processing chunks of data in parallel --and it works great. But when I want to collect all processed elements of the returned list into one big data frame it takes ages. The elements are all data frames having identical column names, and I'm using a simple rbind() inside a loop to do that. But I guess it makes some expensive checking computations at each iteration as it gets slower and slower as it goes. Writing out to disk individual files, concatenating with the system and reading back from disk the resulting file is actually faster... Is there a magic argument to rbind() that I'm missing, or is there any other solution to collect the results of parallel processing efficiently? Thanks, Vincent
_______________________________________________ R-SIG-Mac mailing list R-SIG-Mac at r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-mac
On Jun 29, 2011, at 2:59 PM, Mike Lawrence wrote:
Is the slowdown happening while mclapply runs or while you're doing the rbind? If the latter, I wonder if the code below is more efficient than using rbind inside a loop: my_df = do.call( rbind , my_list_from_mclapply )
Another potential issue is that data frames do many sanity checks that are due to row.names handling etc. If you don't use row.names *and* know in advance that the concatenation is benign *and* your data types are compatible, you can usually speed things up immensely by operating on lists instead and converting to a dataframe at the very end by declaring the resulting list conform to the data.frame class. Again, this only works if you really know what you're doing but the speed up can be very big (usually orders of magnitude). This is a general advice, not in particular for rbind. Whether it would work for you or not is easy to test - something like l = my_list_from_mclapply all = lapply(seq.int(l[[1]]), function(i) do.call(c, lapply(l, function(x) x[[i]]))) names(all) = names(l[[1]]) attr(all, "row.names") = c(NA, -length(all[[1]])) class(all) = "data.frame" Again, make sure all the assumptions above are satisfied before using. Cheers, Simon
On Wed, Jun 29, 2011 at 3:34 PM, Vincent Aubanel <v.aubanel at laslab.org> wrote:
Hi all, I'm using mclapply() of the multicore package for processing chunks of data in parallel --and it works great. But when I want to collect all processed elements of the returned list into one big data frame it takes ages. The elements are all data frames having identical column names, and I'm using a simple rbind() inside a loop to do that. But I guess it makes some expensive checking computations at each iteration as it gets slower and slower as it goes. Writing out to disk individual files, concatenating with the system and reading back from disk the resulting file is actually faster... Is there a magic argument to rbind() that I'm missing, or is there any other solution to collect the results of parallel processing efficiently? Thanks, Vincent
_______________________________________________ R-SIG-Mac mailing list R-SIG-Mac at r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-mac
_______________________________________________ R-SIG-Mac mailing list R-SIG-Mac at r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-mac
Thanks for this, it's now dead fast, as one could conceivably expect. Simon's solution is astonishingly fast, however I had to reconstruct the factors and their levels which were (expectedly) lost during the c() operation. Unfortunately this eats up some fair amount of cpu, but on a 14 columns, ~2 million rows data frame it is still 2x faster than the elegant one line solution. Some figures of performance:
t <- proc.time() dl <- mclapply(lsessions, mcfun, mc.cores=cores) print(proc.time()-t)
utilisateur syst?me ?coul?
171.894 47.696 28.713
l <- dl all = lapply(seq.int(l[[1]]), function(i) do.call(c, lapply(l, function(x) x[[i]]))) names(all) = names(l[[1]]) #attr(all, "row.names") = seq.int(all[[1]]) attr(all, "row.names") = c(NA, -length(all[[1]])) class(all) = "data.frame"
utilisateur syst?me ?coul?
0.412 0.280 0.708
all$factor <- factor(all$factor); levels(all$factor) <- c("A","B")
...
utilisateur syst?me ?coul?
4.852 2.349 7.038
my_df = do.call(rbind, dl)
utilisateur syst?me ?coul?
9.791 5.411 15.039
Thanks to both of you!
Vincent
Le 29 juin 2011 ? 21:48, Simon Urbanek a ?crit :
On Jun 29, 2011, at 2:59 PM, Mike Lawrence wrote:
Is the slowdown happening while mclapply runs or while you're doing the rbind? If the latter, I wonder if the code below is more efficient than using rbind inside a loop: my_df = do.call( rbind , my_list_from_mclapply )
Another potential issue is that data frames do many sanity checks that are due to row.names handling etc. If you don't use row.names *and* know in advance that the concatenation is benign *and* your data types are compatible, you can usually speed things up immensely by operating on lists instead and converting to a dataframe at the very end by declaring the resulting list conform to the data.frame class. Again, this only works if you really know what you're doing but the speed up can be very big (usually orders of magnitude). This is a general advice, not in particular for rbind. Whether it would work for you or not is easy to test - something like l = my_list_from_mclapply all = lapply(seq.int(l[[1]]), function(i) do.call(c, lapply(l, function(x) x[[i]]))) names(all) = names(l[[1]]) attr(all, "row.names") = c(NA, -length(all[[1]])) class(all) = "data.frame" Again, make sure all the assumptions above are satisfied before using. Cheers, Simon
On Wed, Jun 29, 2011 at 3:34 PM, Vincent Aubanel <v.aubanel at laslab.org> wrote:
Hi all, I'm using mclapply() of the multicore package for processing chunks of data in parallel --and it works great. But when I want to collect all processed elements of the returned list into one big data frame it takes ages. The elements are all data frames having identical column names, and I'm using a simple rbind() inside a loop to do that. But I guess it makes some expensive checking computations at each iteration as it gets slower and slower as it goes. Writing out to disk individual files, concatenating with the system and reading back from disk the resulting file is actually faster... Is there a magic argument to rbind() that I'm missing, or is there any other solution to collect the results of parallel processing efficiently? Thanks, Vincent
_______________________________________________ R-SIG-Mac mailing list R-SIG-Mac at r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-mac
_______________________________________________ R-SIG-Mac mailing list R-SIG-Mac at r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-mac
On Jun 30, 2011, at 7:28 AM, Vincent Aubanel wrote:
Thanks for this, it's now dead fast, as one could conceivably expect. Simon's solution is astonishingly fast, however I had to reconstruct the factors and their levels which were (expectedly) lost during the c() operation.
One way to avoid it is to use as.character() on factors inside the parallel function, so the pieces don't have factors. You can create a factor at the end and it should be faster, because factor() calls as.character() anyway so it will be a no-op by that point. Cheers, S
Unfortunately this eats up some fair amount of cpu, but on a 14 columns, ~2 million rows data frame it is still 2x faster than the elegant one line solution. Some figures of performance:
t <- proc.time() dl <- mclapply(lsessions, mcfun, mc.cores=cores) print(proc.time()-t)
utilisateur syst?me ?coul? 171.894 47.696 28.713
l <- dl all = lapply(seq.int(l[[1]]), function(i) do.call(c, lapply(l, function(x) x[[i]]))) names(all) = names(l[[1]]) #attr(all, "row.names") = seq.int(all[[1]]) attr(all, "row.names") = c(NA, -length(all[[1]])) class(all) = "data.frame"
utilisateur syst?me ?coul?
0.412 0.280 0.708
all$factor <- factor(all$factor); levels(all$factor) <- c("A","B")
...
utilisateur syst?me ?coul?
4.852 2.349 7.038
my_df = do.call(rbind, dl)
utilisateur syst?me ?coul?
9.791 5.411 15.039
Thanks to both of you!
Vincent
Le 29 juin 2011 ? 21:48, Simon Urbanek a ?crit :
On Jun 29, 2011, at 2:59 PM, Mike Lawrence wrote:
Is the slowdown happening while mclapply runs or while you're doing the rbind? If the latter, I wonder if the code below is more efficient than using rbind inside a loop: my_df = do.call( rbind , my_list_from_mclapply )
Another potential issue is that data frames do many sanity checks that are due to row.names handling etc. If you don't use row.names *and* know in advance that the concatenation is benign *and* your data types are compatible, you can usually speed things up immensely by operating on lists instead and converting to a dataframe at the very end by declaring the resulting list conform to the data.frame class. Again, this only works if you really know what you're doing but the speed up can be very big (usually orders of magnitude). This is a general advice, not in particular for rbind. Whether it would work for you or not is easy to test - something like l = my_list_from_mclapply all = lapply(seq.int(l[[1]]), function(i) do.call(c, lapply(l, function(x) x[[i]]))) names(all) = names(l[[1]]) attr(all, "row.names") = c(NA, -length(all[[1]])) class(all) = "data.frame" Again, make sure all the assumptions above are satisfied before using. Cheers, Simon
On Wed, Jun 29, 2011 at 3:34 PM, Vincent Aubanel <v.aubanel at laslab.org> wrote:
Hi all, I'm using mclapply() of the multicore package for processing chunks of data in parallel --and it works great. But when I want to collect all processed elements of the returned list into one big data frame it takes ages. The elements are all data frames having identical column names, and I'm using a simple rbind() inside a loop to do that. But I guess it makes some expensive checking computations at each iteration as it gets slower and slower as it goes. Writing out to disk individual files, concatenating with the system and reading back from disk the resulting file is actually faster... Is there a magic argument to rbind() that I'm missing, or is there any other solution to collect the results of parallel processing efficiently? Thanks, Vincent
_______________________________________________ R-SIG-Mac mailing list R-SIG-Mac at r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-mac
_______________________________________________ R-SIG-Mac mailing list R-SIG-Mac at r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-mac
Le 30 juin 2011 ? 15:36, Simon Urbanek a ?crit :
On Jun 30, 2011, at 7:28 AM, Vincent Aubanel wrote:
Thanks for this, it's now dead fast, as one could conceivably expect. Simon's solution is astonishingly fast, however I had to reconstruct the factors and their levels which were (expectedly) lost during the c() operation.
One way to avoid it is to use as.character() on factors inside the parallel function, so the pieces don't have factors. You can create a factor at the end and it should be faster, because factor() calls as.character() anyway so it will be a no-op by that point.
It is faster, thanks! Slightly for the parallel loop (because of removal of unnecessary as.character() operations) and down to about 3 s for the total time of converting into factors. I thought that maintaining data as factors was somewhat more economical and faster than as characters... Vincent
Cheers, S
Unfortunately this eats up some fair amount of cpu, but on a 14 columns, ~2 million rows data frame it is still 2x faster than the elegant one line solution. Some figures of performance:
t <- proc.time() dl <- mclapply(lsessions, mcfun, mc.cores=cores) print(proc.time()-t)
utilisateur syst?me ?coul? 171.894 47.696 28.713
l <- dl all = lapply(seq.int(l[[1]]), function(i) do.call(c, lapply(l, function(x) x[[i]]))) names(all) = names(l[[1]]) #attr(all, "row.names") = seq.int(all[[1]]) attr(all, "row.names") = c(NA, -length(all[[1]])) class(all) = "data.frame"
utilisateur syst?me ?coul?
0.412 0.280 0.708
all$factor <- factor(all$factor); levels(all$factor) <- c("A","B")
...
utilisateur syst?me ?coul?
4.852 2.349 7.038
my_df = do.call(rbind, dl)
utilisateur syst?me ?coul?
9.791 5.411 15.039
Thanks to both of you!
Vincent
Le 29 juin 2011 ? 21:48, Simon Urbanek a ?crit :
On Jun 29, 2011, at 2:59 PM, Mike Lawrence wrote:
Is the slowdown happening while mclapply runs or while you're doing the rbind? If the latter, I wonder if the code below is more efficient than using rbind inside a loop: my_df = do.call( rbind , my_list_from_mclapply )
Another potential issue is that data frames do many sanity checks that are due to row.names handling etc. If you don't use row.names *and* know in advance that the concatenation is benign *and* your data types are compatible, you can usually speed things up immensely by operating on lists instead and converting to a dataframe at the very end by declaring the resulting list conform to the data.frame class. Again, this only works if you really know what you're doing but the speed up can be very big (usually orders of magnitude). This is a general advice, not in particular for rbind. Whether it would work for you or not is easy to test - something like l = my_list_from_mclapply all = lapply(seq.int(l[[1]]), function(i) do.call(c, lapply(l, function(x) x[[i]]))) names(all) = names(l[[1]]) attr(all, "row.names") = c(NA, -length(all[[1]])) class(all) = "data.frame" Again, make sure all the assumptions above are satisfied before using. Cheers, Simon
On Wed, Jun 29, 2011 at 3:34 PM, Vincent Aubanel <v.aubanel at laslab.org> wrote:
Hi all, I'm using mclapply() of the multicore package for processing chunks of data in parallel --and it works great. But when I want to collect all processed elements of the returned list into one big data frame it takes ages. The elements are all data frames having identical column names, and I'm using a simple rbind() inside a loop to do that. But I guess it makes some expensive checking computations at each iteration as it gets slower and slower as it goes. Writing out to disk individual files, concatenating with the system and reading back from disk the resulting file is actually faster... Is there a magic argument to rbind() that I'm missing, or is there any other solution to collect the results of parallel processing efficiently? Thanks, Vincent
_______________________________________________ R-SIG-Mac mailing list R-SIG-Mac at r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-mac
_______________________________________________ R-SIG-Mac mailing list R-SIG-Mac at r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-mac
On Jun 30, 2011, at 11:19 AM, Vincent Aubanel wrote:
Le 30 juin 2011 ? 15:36, Simon Urbanek a ?crit :
On Jun 30, 2011, at 7:28 AM, Vincent Aubanel wrote:
Thanks for this, it's now dead fast, as one could conceivably expect. Simon's solution is astonishingly fast, however I had to reconstruct the factors and their levels which were (expectedly) lost during the c() operation.
One way to avoid it is to use as.character() on factors inside the parallel function, so the pieces don't have factors. You can create a factor at the end and it should be faster, because factor() calls as.character() anyway so it will be a no-op by that point.
It is faster, thanks! Slightly for the parallel loop (because of removal of unnecessary as.character() operations) and down to about 3 s for the total time of converting into factors. I thought that maintaining data as factors was somewhat more economical and faster than as characters...
Not really - ever since R hashes strings there is almost no difference on the storage side (it's int vs void* for the actual vector so on 64-bit machines there is a penalty but on the other hand you have direct access to the string elements). There are some differences in handling so depending on the task the one may be better than the other. Cheers, S
Vincent
Cheers, S
Unfortunately this eats up some fair amount of cpu, but on a 14 columns, ~2 million rows data frame it is still 2x faster than the elegant one line solution. Some figures of performance:
t <- proc.time() dl <- mclapply(lsessions, mcfun, mc.cores=cores) print(proc.time()-t)
utilisateur syst?me ?coul? 171.894 47.696 28.713
l <- dl all = lapply(seq.int(l[[1]]), function(i) do.call(c, lapply(l, function(x) x[[i]]))) names(all) = names(l[[1]]) #attr(all, "row.names") = seq.int(all[[1]]) attr(all, "row.names") = c(NA, -length(all[[1]])) class(all) = "data.frame"
utilisateur syst?me ?coul? 0.412 0.280 0.708
all$factor <- factor(all$factor); levels(all$factor) <- c("A","B")
... utilisateur syst?me ?coul? 4.852 2.349 7.038
my_df = do.call(rbind, dl)
utilisateur syst?me ?coul? 9.791 5.411 15.039 Thanks to both of you! Vincent Le 29 juin 2011 ? 21:48, Simon Urbanek a ?crit :
On Jun 29, 2011, at 2:59 PM, Mike Lawrence wrote:
Is the slowdown happening while mclapply runs or while you're doing the rbind? If the latter, I wonder if the code below is more efficient than using rbind inside a loop: my_df = do.call( rbind , my_list_from_mclapply )
Another potential issue is that data frames do many sanity checks that are due to row.names handling etc. If you don't use row.names *and* know in advance that the concatenation is benign *and* your data types are compatible, you can usually speed things up immensely by operating on lists instead and converting to a dataframe at the very end by declaring the resulting list conform to the data.frame class. Again, this only works if you really know what you're doing but the speed up can be very big (usually orders of magnitude). This is a general advice, not in particular for rbind. Whether it would work for you or not is easy to test - something like l = my_list_from_mclapply all = lapply(seq.int(l[[1]]), function(i) do.call(c, lapply(l, function(x) x[[i]]))) names(all) = names(l[[1]]) attr(all, "row.names") = c(NA, -length(all[[1]])) class(all) = "data.frame" Again, make sure all the assumptions above are satisfied before using. Cheers, Simon
On Wed, Jun 29, 2011 at 3:34 PM, Vincent Aubanel <v.aubanel at laslab.org> wrote:
Hi all, I'm using mclapply() of the multicore package for processing chunks of data in parallel --and it works great. But when I want to collect all processed elements of the returned list into one big data frame it takes ages. The elements are all data frames having identical column names, and I'm using a simple rbind() inside a loop to do that. But I guess it makes some expensive checking computations at each iteration as it gets slower and slower as it goes. Writing out to disk individual files, concatenating with the system and reading back from disk the resulting file is actually faster... Is there a magic argument to rbind() that I'm missing, or is there any other solution to collect the results of parallel processing efficiently? Thanks, Vincent
_______________________________________________ R-SIG-Mac mailing list R-SIG-Mac at r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-mac
_______________________________________________ R-SIG-Mac mailing list R-SIG-Mac at r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-mac