Skip to content
Prev 8705 / 10988 Next

[Rcpp-devel] examples of using cula matrix multiplication in Rcpp

On 5/18/2015 15:12, Dale Smith wrote:
I also think the focus on the high-level approach is often the right 
choice, at least initially.

Using either CUDA or OpenCL directly adds a lot of repetitive (and 
redundant) boilerplate code -- oftentimes (unless you actually make 
active use of the fine-tuning this allows you to use) with no 
performance benefits compared to the higher-level solutions (this really 
shouldn't need (re)stating, but I still occasionally encounter folks 
expecting "lower level" -- read: longer -- code to be somehow 
automagically faster). At the same time, having to deal with the 
lower-level details can also make the whole experience more error-prone 
(e.g., due to manual resource management -- which, again, unless you're 
explicitly fine-tuning it yourself, will not make your code 
automagically perform faster).

Personally, I've had a good experience with C++AMP (hardware-vendor 
independent; note: the last time I've used it it was more polished on 
MSFT platforms, although open-source Linux implementation is available) 
and Thrust (CUDA / NVIDIA hardware): http://thrust.github.io/
SYCL looks (I'm yet to try it out) like an OpenCL equivalent of Thrust 
-- and its parallel STL implementation looks quite promising: 
https://github.com/KhronosGroup/SyclParallelSTL
// OpenCL-based Boost.Compute has been recently accepted to Boost: 
https://github.com/boostorg/compute
(The flip side being that NVIDIA hasn't historically kept OpenCL drivers 
for its cards very much up-to-date... perhaps this will change with 
improvements necessary for CUDA 7, as well as requirements needed to 
implement Vulkan API.)

In other words, instead of starting directly with CUDA, I'd suggest 
starting with Thrust -- analogously, instead of jumping straight to raw 
OpenCL, I'd probably start with SYCL Parallel STL (or Boost.Compute?).

There's plenty of high-level GPGPU solutions available for C++, here are 
some good overviews:
http://www.soa-world.de/echelon/2014/04/c-accelerator-libraries.html // 
multiple reviews: http://www.soa-world.de/echelon/
http://arxiv.org/abs/1212.6326

What I haven't seen is any study of integrating these with R (I've only 
used standalone C++ code for GPGPU), could be interesting.
In my experience, the "best" use case (in terms of being the 
lowest-hanging-fruit) would be an embarrassingly parallel problem; for 
examples, see:
http://en.wikipedia.org/wiki/Embarrassingly_parallel
Naturally, the larger the workload, the higher the chance of the 
speed-up exceeding the data transfer costs.

Best,

Matt

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/rcpp-devel/attachments/20150518/362d8bf1/attachment-0001.html>