Skip to content

[Rcpp-devel] examples of using cula matrix multiplication in Rcpp

10 messages · Sean O'Riordain, Dirk Eddelbuettel, Yue Li +5 more

#
Dear List,

I wonder if anyone worked on incorporating CULA tools library functionality into Rcpp. How much speed gain on top of Rcpp do we expect on basic operation like matrix multiplication?

In particular, I?m  currently usnig RArmadillo to seamlessly perform matrix multiplication. But the speed gain over my R implementation is 5-10 times if not less. 

I?m wondering if there is an equivalent easy-to-use library for doing matrix multiplication with GPU enabled. A complete simple example would be greatly appreciated.

Thanks in advance,
Yue
#
On 16 May 2015 at 11:46, Yue Li wrote:
| I wonder if anyone worked on incorporating CULA tools library functionality into Rcpp. How much speed gain on top of Rcpp do we expect on basic operation like matrix multiplication?
| 
| In particular, I?m  currently usnig RArmadillo to seamlessly perform matrix multiplication. But the speed gain over my R implementation is 5-10 times if not less. 
| 
| I?m wondering if there is an equivalent easy-to-use library for doing matrix multiplication with GPU enabled. A complete simple example would be greatly appreciated.

A few years ago I did some work on the 'gcbd' package to time and benchmark
precisely these types of things: because they will depend on the hardware
used for the gpu, hardware used for the cpu, software used as the compiler,
software used for the BLAS/LAPACK library, software used as the OS etc pp I
worked out a framework to benchmark these things and compare them.

So have a look at this package and its vignette: it at least times several
BLAS libraries against the gpu card I had (have).

In general, I think its conclusion stands. You "waste" so much time copying
data over to the gpu that any computation gain is dwarfed until you get to
truly enormous (and unsual) matrix sizes.  So gpus are still good for things
like limited (maybe one-time) transfer and then a of iterations: some finance
applications with Monte Carlo prices some to mind, anything MCMC and of
course the whole 'deep learning' complex.

And with that: no, as far as I know nobody has tightly integrated Rcpp and
gpu computing as it simply is not that clearly a match.

That's my $0.02. More comments welcome, particularly with benchmarks.

Dirk
#
Some students I have been working with managed to get Rcpp to work with
Cuda for a simple use case - calculating a big log-likelihood for MCMC -
and they got a bit of a speedup compared with Rcpp - but it needs more
work.  They promised they would write up a note for the gallery once their
exams are over in a couple of weeks.

Sean
On 16 May 2015 at 16:56, Dirk Eddelbuettel <edd at debian.org> wrote:

            

  
    
#
On 16 May 2015 at 17:05, Sean O'Riordain wrote:
| Some students I have been working with managed to get Rcpp to work with Cuda
| for a simple use case - calculating a big log-likelihood for MCMC - and they
| got a bit of a speedup compared with Rcpp - but it needs more work.? They
| promised they would write up a note for the gallery once their exams are over
| in a couple of weeks.

That is splendid news!

I better make sure I can compile with CUDA then or else building the article
may be tricky.

Dirk
#
Thanks for the quick insightful replies! I will look into the solutions and keep the list posted on any progress on this end.

Yue
#
I?ve been playing around with Rcpp and CUDA (CUBLAS and Magma in particular) for quite a while now and definitely find it useful for improving performance. My interest is mostly in spatial models and gaussian processes where the rate limiting step is usually O(n^3) matrix decomposition where n is between 1000 to 5000.

For these types of tasks I routinely see ~2x improvements over RcppArmadillo & OpenBLAS using a $100 consumer grade card, which isn?t huge but makes a big difference when the overall runtime is around 80 hours per model.

If anyone is interested in looking at some code I have the early stages of a package up on github: https://github.com/rundel/RcppGP <https://github.com/rundel/RcppGP>. In particular the gpu_mat class has a reasonably mature interface for moving data between armadillo and cuBLAS.

-Colin

-----

Colin Rundel
Assistant Professor of the Practice
Duke University, Department of Statistical Science
www.stat.duke.edu/~cr173/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/rcpp-devel/attachments/20150516/3ae83d1e/attachment.html>
1 day later
#
I'm not a big fan of GPU computing for many of the reasons Dirk mentions below and something else I discovered while taking a Coursera class last winter.



CUDA requires significant effort to keep up your skills unless you do it semi-regularly or more often. It's a very hard learning curve. I can't climb that curve at this point in my working life. An occasional user may want to skip CUDA and investigate OpenACC or something related. Do what works best for you. I?ll investigate rCUDA, PyCUDA, OpenACC, etc, and leave the lower-level stuff to others.



I?d like to reiterate that by far the most difficult think about working with GPU technology is efficiently moving data on and off the card. Do you have a rigorously established use case for using GPU technology?



I?m skeptical that tying Rcpp with CUDA is something lots of people should do, but give it a try if you have the expertise and can make the use case. Moving data on and off the card is a third layer between you and the computations?


Dale Smith, Ph.D.
Data Scientist
?
[http://host.msgapp.com/Extranet/96621/Signature%20Images/sig%20logo.png]<http://nexidia.com/>

d. 404.495.7220 x 4008   f. 404.795.7221
Nexidia Corporate | 3565 Piedmont Road, Building Two, Suite 400 | Atlanta, GA 30305

[http://host.msgapp.com/Extranet/96621/Signature%20Images/sig%20Blog.jpeg]<http://blog.nexidia.com/> [http://host.msgapp.com/Extranet/96621/Signature%20Images/sig%20LinkedIn.jpeg] <https://www.linkedin.com/company/nexidia>  [http://host.msgapp.com/Extranet/96621/Signature%20Images/sig%20Google.jpeg] <https://plus.google.com/u/0/107921893643164441840/posts>  [http://host.msgapp.com/Extranet/96621/Signature%20Images/sig%20twitter.jpeg] <https://twitter.com/Nexidia>  [http://host.msgapp.com/Extranet/96621/Signature%20Images/sig%20Youtube.jpeg] <https://www.youtube.com/user/NexidiaTV>

From: rcpp-devel-bounces at lists.r-forge.r-project.org [mailto:rcpp-devel-bounces at lists.r-forge.r-project.org] On Behalf Of Colin Rundel
Sent: Saturday, May 16, 2015 3:58 PM
To: rcpp-devel at lists.r-forge.r-project.org
Subject: Re: [Rcpp-devel] examples of using cula matrix multiplication in Rcpp

I?ve been playing around with Rcpp and CUDA (CUBLAS and Magma in particular) for quite a while now and definitely find it useful for improving performance. My interest is mostly in spatial models and gaussian processes where the rate limiting step is usually O(n^3) matrix decomposition where n is between 1000 to 5000.

For these types of tasks I routinely see ~2x improvements over RcppArmadillo & OpenBLAS using a $100 consumer grade card, which isn?t huge but makes a big difference when the overall runtime is around 80 hours per model.

If anyone is interested in looking at some code I have the early stages of a package up on github: https://github.com/rundel/RcppGP. In particular the gpu_mat class has a reasonably mature interface for moving data between armadillo and cuBLAS.

-Colin

-----

Colin Rundel
Assistant Professor of the Practice
Duke University, Department of Statistical Science
www.stat.duke.edu/~cr173/<http://www.stat.duke.edu/~cr173/>
On May 16, 2015, at 12:24 PM, Yue Li <gorillayue at gmail.com<mailto:gorillayue at gmail.com>> wrote:
Thanks for the quick insightful replies! I will look into the solutions and keep the list posted on any progress on this end.

Yue
On May 16, 2015, at 12:10 PM, Dirk Eddelbuettel <edd at debian.org<mailto:edd at debian.org>> wrote:

        
On 16 May 2015 at 17:05, Sean O'Riordain wrote:
| Some students I have been working with managed to get Rcpp to work with Cuda
| for a simple use case - calculating a big log-likelihood for MCMC - and they
| got a bit of a speedup compared with Rcpp - but it needs more work.  They
| promised they would write up a note for the gallery once their exams are over
| in a couple of weeks.

That is splendid news!

I better make sure I can compile with CUDA then or else building the article
may be tricky.

Dirk

--
http://dirk.eddelbuettel.com | @eddelbuettel | edd at debian.org<mailto:edd at debian.org>

_______________________________________________
Rcpp-devel mailing list
Rcpp-devel at lists.r-forge.r-project.org<mailto:Rcpp-devel at lists.r-forge.r-project.org>
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/rcpp-devel

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/rcpp-devel/attachments/20150518/fae74ee6/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.png
Type: image/png
Size: 8361 bytes
Desc: image001.png
URL: <http://lists.r-forge.r-project.org/pipermail/rcpp-devel/attachments/20150518/fae74ee6/attachment-0001.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image002.jpg
Type: image/jpeg
Size: 762 bytes
Desc: image002.jpg
URL: <http://lists.r-forge.r-project.org/pipermail/rcpp-devel/attachments/20150518/fae74ee6/attachment-0005.jpg>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image003.jpg
Type: image/jpeg
Size: 740 bytes
Desc: image003.jpg
URL: <http://lists.r-forge.r-project.org/pipermail/rcpp-devel/attachments/20150518/fae74ee6/attachment-0006.jpg>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image004.jpg
Type: image/jpeg
Size: 753 bytes
Desc: image004.jpg
URL: <http://lists.r-forge.r-project.org/pipermail/rcpp-devel/attachments/20150518/fae74ee6/attachment-0007.jpg>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image005.jpg
Type: image/jpeg
Size: 749 bytes
Desc: image005.jpg
URL: <http://lists.r-forge.r-project.org/pipermail/rcpp-devel/attachments/20150518/fae74ee6/attachment-0008.jpg>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image006.jpg
Type: image/jpeg
Size: 765 bytes
Desc: image006.jpg
URL: <http://lists.r-forge.r-project.org/pipermail/rcpp-devel/attachments/20150518/fae74ee6/attachment-0009.jpg>
#
I am actually working on a general purpose GPU library for R using Rcpp and
RcppArmadillo but it is still under heavy development.  During these very
early stages I have had an 'older' card (AMD Radeon HD 5700 Series) so I
have been working primarily with OpenCL and the clBLAS library (which must
be installed).  The idea has been to create the easiest possible (at least
from my perspective) interface to leverage GPU's.  I will be receiving a
newer NVIDIA card (GeForce GTX 970) so I will begin adding in CUDA as well
(I fully intended to have a hybrid system similar to the arrayfire
library).  You can view it on my github as well:
https://github.com/cdeterman/gpuR

As I said, keep in mind that it is still under heavy development and many
more functions need to be added and it is only available for Linux at the
moment.  It is designed to provide a new class structure similar to the
'Matrix' package.  You can see an example of some vector addition use on
the github page but an example of matrix multiplication would be:

# convert matrix to gpuMatrix object
gpuA <- gpuMatrix(A)
gpuB <- gpuMatrix(B)

# matrix mutliplication
gpuC <- gpuA %*% gpuB


Also, if a user is looking in to GPGPU, they are likely dealing with 'big
data' so this package is intended to be used in concert with the
'bigmemory' package as well with the 'gpuBigMatrix' function where the idea
is to provide a full interface when matrices exceed local memory size
(obviously would be slower but useful for those without access to expensive
hardware).  There is also support for 'integer', 'float', and 'double' data
types if the default R 'double' precision is not required (for an
additional speed-up).

With my older card, and using code that could likely be optimized further,
Dirk is correct that the data transfer time is very significant.  My
initial benchmarks can exceed R's native BLAS (not much to celebrate) but
is clearly bested by just using OpenBLAS.  Also, as Dirk mentions, as the
size of the matrix increases the performance distance also shrinks until
the GPU wins out.  Again, my initial benchmarks show that the gpuMatrix
multiplication does eventually beat out OpenBLAS consistently once matrices
begin to approaches sizes of 2048x2048.  I am optimistic however with the
use of a newer card and beginning to apply some CUDA.  Once I have more
functions and the CUDA interface set up I intend to have it submitted to
CRAN.  I am always open to comments, concerns, and/or contributions :)

Regards,
Charles
On Sat, May 16, 2015 at 2:58 PM, Colin Rundel <rundel at gmail.com> wrote:

            
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/rcpp-devel/attachments/20150518/179bc3b2/attachment.html>
#
I have played with CUDA for some time and here is my simple comments.

(1) The simplest way to use CUDA with R/Armadillo is to use nvblas. You can
see the demo on 21st page of [1].

(2) The speedup may not as good as expected sometimes (at least in my own
experiments).

Best wishes,

KK

[1]
http://on-demand.gputechconf.com/supercomputing/2013/presentation/SC3108-New-Features-CUDA%206%20-GPU-Acceleration.pdf
On Sat, May 16, 2015 at 11:46 AM, Yue Li <gorillayue at gmail.com> wrote:

            

  
    
#
On 5/18/2015 15:12, Dale Smith wrote:
I also think the focus on the high-level approach is often the right 
choice, at least initially.

Using either CUDA or OpenCL directly adds a lot of repetitive (and 
redundant) boilerplate code -- oftentimes (unless you actually make 
active use of the fine-tuning this allows you to use) with no 
performance benefits compared to the higher-level solutions (this really 
shouldn't need (re)stating, but I still occasionally encounter folks 
expecting "lower level" -- read: longer -- code to be somehow 
automagically faster). At the same time, having to deal with the 
lower-level details can also make the whole experience more error-prone 
(e.g., due to manual resource management -- which, again, unless you're 
explicitly fine-tuning it yourself, will not make your code 
automagically perform faster).

Personally, I've had a good experience with C++AMP (hardware-vendor 
independent; note: the last time I've used it it was more polished on 
MSFT platforms, although open-source Linux implementation is available) 
and Thrust (CUDA / NVIDIA hardware): http://thrust.github.io/
SYCL looks (I'm yet to try it out) like an OpenCL equivalent of Thrust 
-- and its parallel STL implementation looks quite promising: 
https://github.com/KhronosGroup/SyclParallelSTL
// OpenCL-based Boost.Compute has been recently accepted to Boost: 
https://github.com/boostorg/compute
(The flip side being that NVIDIA hasn't historically kept OpenCL drivers 
for its cards very much up-to-date... perhaps this will change with 
improvements necessary for CUDA 7, as well as requirements needed to 
implement Vulkan API.)

In other words, instead of starting directly with CUDA, I'd suggest 
starting with Thrust -- analogously, instead of jumping straight to raw 
OpenCL, I'd probably start with SYCL Parallel STL (or Boost.Compute?).

There's plenty of high-level GPGPU solutions available for C++, here are 
some good overviews:
http://www.soa-world.de/echelon/2014/04/c-accelerator-libraries.html // 
multiple reviews: http://www.soa-world.de/echelon/
http://arxiv.org/abs/1212.6326

What I haven't seen is any study of integrating these with R (I've only 
used standalone C++ code for GPGPU), could be interesting.
In my experience, the "best" use case (in terms of being the 
lowest-hanging-fruit) would be an embarrassingly parallel problem; for 
examples, see:
http://en.wikipedia.org/wiki/Embarrassingly_parallel
Naturally, the larger the workload, the higher the chance of the 
speed-up exceeding the data transfer costs.

Best,

Matt

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/rcpp-devel/attachments/20150518/362d8bf1/attachment-0001.html>