Skip to content

multicore package: collecting results

8 messages · Ben Bolker, Michael Lawrence, Vincent Aubanel +1 more

#
Hi all,

I'm using mclapply() of the multicore package for processing chunks of data in parallel --and it works great.

But when I want to collect all processed elements of the returned list into one big data frame it takes ages.

The elements are all data frames having identical column names, and I'm using a simple rbind() inside a loop to do that. But I guess it makes some expensive checking computations at each iteration as it gets slower and slower as it goes. Writing out to disk individual files, concatenating with the system and reading back from disk the resulting file is actually faster...

Is there a magic argument to rbind() that I'm missing, or is there any other solution to collect the results of parallel processing efficiently?

Thanks,
Vincent
#
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
On 06/29/2011 02:34 PM, Vincent Aubanel wrote:
Why do you have to write to disk?  Can you collect the results as a
list L and then do.call(rbind,L)  in one go?
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk4Lc1sACgkQc5UpGjwzenMofgCdFEzN6qO8e8owV9GCJ6Mwafvn
n0oAoIZt5txoRS9Ma73XefnEzBDW19lE
=BDfd
-----END PGP SIGNATURE-----
#
Is the slowdown happening while mclapply runs or while you're doing
the rbind? If the latter, I wonder if the code below is more efficient
than using rbind inside a loop:

my_df = do.call( rbind , my_list_from_mclapply )
On Wed, Jun 29, 2011 at 3:34 PM, Vincent Aubanel <v.aubanel at laslab.org> wrote:
#
On Jun 29, 2011, at 2:59 PM, Mike Lawrence wrote:

            
Another potential issue is that data frames do many sanity checks that are due to row.names handling etc. If you don't use row.names *and* know in advance that the concatenation is benign *and* your data types are compatible, you can usually speed things up immensely by operating on lists instead and converting to a dataframe at the very end by declaring the resulting list conform to the data.frame class. Again, this only works if you really know what you're doing but the speed up can be very big (usually orders of magnitude). This is a general advice, not in particular for rbind. Whether it would work for you or not is easy to test - something like

l = my_list_from_mclapply
all =  lapply(seq.int(l[[1]]), function(i) do.call(c, lapply(l, function(x) x[[i]])))
names(all) = names(l[[1]])
attr(all, "row.names") = c(NA, -length(all[[1]]))
class(all) = "data.frame"

Again, make sure all the assumptions above are satisfied before using.

Cheers,
Simon
#
Thanks for this, it's now dead fast, as one could conceivably expect.
Simon's solution is astonishingly fast, however I had to reconstruct the factors and their levels which were (expectedly) lost during the c() operation. Unfortunately this eats up some fair amount of cpu, but on a 14 columns, ~2 million rows data frame it is still 2x faster than the elegant one line solution.

Some figures of performance:
utilisateur     syst?me      ?coul? 
    171.894      47.696      28.713
utilisateur     syst?me      ?coul? 
      0.412       0.280       0.708
...
utilisateur     syst?me      ?coul? 
      4.852       2.349       7.038
utilisateur     syst?me      ?coul? 
      9.791       5.411      15.039 

Thanks to both of you!

Vincent


Le 29 juin 2011 ? 21:48, Simon Urbanek a ?crit :
#
On Jun 30, 2011, at 7:28 AM, Vincent Aubanel wrote:

            
One way to avoid it is to use as.character() on factors inside the parallel function, so the pieces don't have factors. You can create a factor at the end and it should be faster, because factor() calls as.character() anyway so it will be a no-op by that point.

Cheers,
S
#
Le 30 juin 2011 ? 15:36, Simon Urbanek a ?crit :
It is faster, thanks! Slightly for the parallel loop (because of removal of unnecessary as.character() operations) and down to about 3 s for the total time of converting into factors. I thought that maintaining data as factors was somewhat more economical and faster than as characters...

Vincent
#
On Jun 30, 2011, at 11:19 AM, Vincent Aubanel wrote:

            
Not really - ever since R hashes strings there is almost no difference on the storage side (it's int vs void* for the actual vector so on 64-bit machines there is a penalty but on the other hand you have direct access to the string elements). There are some differences in handling so depending on the task the one may be better than the other.

Cheers,
S