Matrix multiplication
On other machines, I might use a multithreaded BLAS like gotoblas so that I have some flexibility (though apparently unlike Claudia, I rarely change it in practice).
:-) Yes, I do change it in practice, because I have steps where I use explicit parallelization via multicore or snow and I switch between the 3 different parallel computation types. Our server has 2 hex-core CPUs but only 8 GB RAM. The spectroscopic data analysis I use usually isn't really hard computationally, but the data sets are often uncomfortably large for the server. With explicit parallelization RAM often restricts me to 2 or 3 threads. Here's what I observe and why I switch back and forth: If the calculation is implicitly parallel with the optimized BLAS, that's the way to go. Easiest on RAM, fast, no whatsoever coding effort. Just lean back and enjoy seeing all cores hard at work. There are functions like %*% and (t)crossprod that use all 12 cores (or whatever I restrict NUM_GOTO_THREADS to). Other functions, e.g. loess () seem never to use more than all 6 cores of one CPU. For these, I'm better off with explicit parallelization with 2 snow nodes and NUM_GOTO_THREADS = 6 (I have to execute taskset on each node). However, snow (and multicore) need more RAM as the data must be loaded in each node. That would mean e.g. NUM_GOTO_THREADS = 11 (to leave an "alibi-core" for my colleague) in the main R session, and e.g. 2 nodes with NUM_GOTO_THREADS = 6 or 3 nodes with NUM_GOTO_THREADS = 4. Multicore doesn't make use of the implicit parallelization of the BLAS. But it is easier to use than snow: no cluster set up required, no hassle with exporting all variables, etc. So, if the function anyways doesn't have any implicit parallelization, I just change lapply to mclapply, and that's it. Best, Claudia