Skip to content

MKL Acceleration encouraging; need adjust package builds?

4 messages · David Smith, Dirk Eddelbuettel, Paul Johnson

#
Dear R-devel:

The Cluster administrators at KU got enthusiastic about testing
R-3.2.2 with Intel MKL when I asked for some BLAS integration.  Below
I forward a performance report, which is encouraging, and thought you
would like to know the numbers.  Appears to my untrained eye there are
some extraordinary speedups on Cholesky decomposition, determinants,
and matrix inversion.

They had difficulty getting R to compile with  R shared BLAS (don't
know what went wrong there), so they went the other direction.

In his message to me, the technician says that I should consider
adjusting the compilation flags on the packages that use BLAS.  Do you
think that is needed? R is compiled with non-shared BLAS libraries,
won't packages know where to look for BLAS headers?

2. If I need to do that, I wonder how to do it and which packages need
attention.  Eigen and Armadillo packages, and possibly the ones that
depend on them, lme4, anything flowing through Rcpp.

Here's the build for some packages. Are they finding MKL BLAS?  How
would I know?

* installing *source* package 'RcppArmadillo' ...
** package 'RcppArmadillo' successfully unpacked and MD5 sums checked
* checking LAPACK_LIBS: divide-and-conquer complex SVD available via
system LAPACK
** libs
g++ -I/tools/cluster/6.2/R/3.2.2_mkl/lib64/R/include
-I/usr/local/include
-I"/panfs/pfs.acf.ku.edu/crmda/tools/lib64/R/3.2/site-library/Rcpp/include"
 -I../inst/include -fpic  -O3 -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2
-fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64
-mtune=generic    -c RcppArmadillo.cpp -o RcppArmadillo.o
g++ -I/tools/cluster/6.2/R/3.2.2_mkl/lib64/R/include
-I/usr/local/include
-I"/panfs/pfs.acf.ku.edu/crmda/tools/lib64/R/3.2/site-library/Rcpp/include"
 -I../inst/include -fpic  -O3 -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2
-fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64
-mtune=generic    -c RcppExports.cpp -o RcppExports.o
g++ -I/tools/cluster/6.2/R/3.2.2_mkl/lib64/R/include
-I/usr/local/include
-I"/panfs/pfs.acf.ku.edu/crmda/tools/lib64/R/3.2/site-library/Rcpp/include"
 -I../inst/include -fpic  -O3 -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2
-fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64
-mtune=generic    -c fastLm.cpp -o fastLm.o
g++ -shared -L/tools/cluster/6.2/R/3.2.2_mkl/lib64/R/lib
-L/usr/local/lib64 -o RcppArmadillo.so RcppArmadillo.o RcppExports.o
fastLm.o -L/panfs/pfs.acf.ku.edu/cluster/6.2/intel/2015/mkl/lib/intel64
-Wl,--no-as-needed -lmkl_gf_lp64 -Wl,--start-group -lmkl_gnu_thread
-lmkl_core -Wl,--end-group -fopenmp -ldl -lpthread -lm -lgfortran -lm
-L/tools/cluster/6.2/R/3.2.2_mkl/lib64/R/lib -lR
installing to /panfs/pfs.acf.ku.edu/crmda/tools/lib64/R/3.2/site-library/RcppArmadillo/libs
** R
** inst
** preparing package for lazy loading
** help
*** installing help indices
** building package indices
** installing vignettes
** testing if installed package can be loaded
* DONE (RcppArmadillo)

* installing *source* package 'RcppEigen' ...
** package 'RcppEigen' successfully unpacked and MD5 sums checked
** libs
g++ -I/tools/cluster/6.2/R/3.2.2_mkl/lib64/R/include
-I/usr/local/include
-I"/panfs/pfs.acf.ku.edu/crmda/tools/lib64/R/3.2/site-library/Rcpp/include"
 -I../inst/include -fpic  -O3 -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2
-fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64
-mtune=generic    -c RcppEigen.cpp -o RcppEigen.o
g++ -I/tools/cluster/6.2/R/3.2.2_mkl/lib64/R/include
-I/usr/local/include
-I"/panfs/pfs.acf.ku.edu/crmda/tools/lib64/R/3.2/site-library/Rcpp/include"
 -I../inst/include -fpic  -O3 -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2
-fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64
-mtune=generic    -c RcppExports.cpp -o RcppExports.o
g++ -I/tools/cluster/6.2/R/3.2.2_mkl/lib64/R/include
-I/usr/local/include
-I"/panfs/pfs.acf.ku.edu/crmda/tools/lib64/R/3.2/site-library/Rcpp/include"
 -I../inst/include -fpic  -O3 -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2
-fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64
-mtune=generic    -c fastLm.cpp -o fastLm.o
g++ -shared -L/tools/cluster/6.2/R/3.2.2_mkl/lib64/R/lib
-L/usr/local/lib64 -o RcppEigen.so RcppEigen.o RcppExports.o fastLm.o
-L/panfs/pfs.acf.ku.edu/cluster/6.2/intel/2015/mkl/lib/intel64
-Wl,--no-as-needed -lmkl_gf_lp64 -Wl,--start-group -lmkl_gnu_thread
-lmkl_core -Wl,--end-group -fopenmp -ldl -lpthread -lm -lgfortran -lm
-L/tools/cluster/6.2/R/3.2.2_mkl/lib64/R/lib -lR
installing to /panfs/pfs.acf.ku.edu/crmda/tools/lib64/R/3.2/site-library/RcppEigen/libs
** R
** inst
** preparing package for lazy loading
** help
*** installing help indices
** building package indices
** installing vignettes
** testing if installed package can be loaded
* DONE (RcppEigen)

* installing *source* package 'MatrixModels' ...
** package 'MatrixModels' successfully unpacked and MD5 sums checked
** R
** preparing package for lazy loading
Creating a generic function for 'resid' from package 'stats' in
package 'MatrixModels'
Creating a generic function for 'fitted.values' from package 'stats'
in package 'MatrixModels'
Creating a generic function for 'coefficients' from package 'stats' in
package 'MatrixModels'
Creating a generic function for 'formula' from package 'stats' in
package 'MatrixModels'
Creating a generic function for 'coef' from package 'stats' in package
'MatrixModels'
Creating a generic function for 'fitted' from package 'stats' in
package 'MatrixModels'
Creating a generic function for 'residuals' from package 'stats' in
package 'MatrixModels'
** help
*** installing help indices
** building package indices
** testing if installed package can be loaded
* DONE (MatrixModels)
* installing *source* package 'quantreg' ...
** package 'quantreg' successfully unpacked and MD5 sums checked
** libs
gfortran   -fpic  -g -O2  -c akj.f -o akj.o
gfortran   -fpic  -g -O2  -c boot.f -o boot.o
gfortran   -fpic  -g -O2  -c brute.f -o brute.o
gcc -std=gnu99 -I/tools/cluster/6.2/R/3.2.2_mkl/lib64/R/include
-I/usr/local/include    -fpic
-I/panfs/pfs.acf.ku.edu/cluster/system/pkg/R/curl7.45_install/include
-L/panfs/pfs.acf.ku.edu/cluster/6.2/R/3.2.2_mkl/lib64  -c chlfct.c -o
chlfct.o
gfortran   -fpic  -g -O2  -c cholesky.f -o cholesky.o
gfortran   -fpic  -g -O2  -c combos.f -o combos.o
gfortran   -fpic  -g -O2  -c crq.f -o crq.o
gfortran   -fpic  -g -O2  -c crqfnb.f -o crqfnb.o
gfortran   -fpic  -g -O2  -c dsel05.f -o dsel05.o
gfortran   -fpic  -g -O2  -c etime.f -o etime.o
gfortran   -fpic  -g -O2  -c extract.f -o extract.o
gfortran   -fpic  -g -O2  -c idmin.f -o idmin.o
gfortran   -fpic  -g -O2  -c iswap.f -o iswap.o
gfortran   -fpic  -g -O2  -c kuantile.f -o kuantile.o
gcc -std=gnu99 -I/tools/cluster/6.2/R/3.2.2_mkl/lib64/R/include
-I/usr/local/include    -fpic
-I/panfs/pfs.acf.ku.edu/cluster/system/pkg/R/curl7.45_install/include
-L/panfs/pfs.acf.ku.edu/cluster/6.2/R/3.2.2_mkl/lib64  -c mcmb.c -o
mcmb.o
gfortran   -fpic  -g -O2  -c penalty.f -o penalty.o
gfortran   -fpic  -g -O2  -c powell.f -o powell.o
gfortran   -fpic  -g -O2  -c rls.f -o rls.o
gfortran   -fpic  -g -O2  -c rq0.f -o rq0.o
gfortran   -fpic  -g -O2  -c rq1.f -o rq1.o
gfortran   -fpic  -g -O2  -c rqbr.f -o rqbr.o
gfortran   -fpic  -g -O2  -c rqfn.f -o rqfn.o
gfortran   -fpic  -g -O2  -c rqfnb.f -o rqfnb.o
gfortran   -fpic  -g -O2  -c rqfnc.f -o rqfnc.o
gfortran   -fpic  -g -O2  -c rqs.f -o rqs.o
gfortran   -fpic  -g -O2  -c sparskit2.f -o sparskit2.o
gcc -std=gnu99 -I/tools/cluster/6.2/R/3.2.2_mkl/lib64/R/include
-I/usr/local/include    -fpic
-I/panfs/pfs.acf.ku.edu/cluster/system/pkg/R/curl7.45_install/include
-L/panfs/pfs.acf.ku.edu/cluster/6.2/R/3.2.2_mkl/lib64  -c srqfn.c -o
srqfn.o
gcc -std=gnu99 -I/tools/cluster/6.2/R/3.2.2_mkl/lib64/R/include
-I/usr/local/include    -fpic
-I/panfs/pfs.acf.ku.edu/cluster/system/pkg/R/curl7.45_install/include
-L/panfs/pfs.acf.ku.edu/cluster/6.2/R/3.2.2_mkl/lib64  -c srqfnc.c -o
srqfnc.o
gfortran   -fpic  -g -O2  -c srtpai.f -o srtpai.o
gcc -std=gnu99 -shared -L/tools/cluster/6.2/R/3.2.2_mkl/lib64/R/lib
-L/usr/local/lib64 -o quantreg.so akj.o boot.o brute.o chlfct.o
cholesky.o combos.o crq.o crqfnb.o dsel05.o etime.o extract.o idmin.o
iswap.o kuantile.o mcmb.o penalty.o powell.o rls.o rq0.o rq1.o rqbr.o
rqfn.o rqfnb.o rqfnc.o rqs.o sparskit2.o srqfn.o srqfnc.o srtpai.o
-L/panfs/pfs.acf.ku.edu/cluster/6.2/intel/2015/mkl/lib/intel64
-Wl,--no-as-needed -lmkl_gf_lp64 -Wl,--start-group -lmkl_gnu_thread
-lmkl_core -Wl,--end-group -fopenmp -ldl -lpthread -lm -lgfortran -lm
-lgfortran -lm -L/tools/cluster/6.2/R/3.2.2_mkl/lib64/R/lib -lR
installing to /panfs/pfs.acf.ku.edu/crmda/tools/lib64/R/3.2/site-library/quantreg/libs
** R
** data
** demo
** inst
** preparing package for lazy loading
** help
*** installing help indices
** building package indices
** installing vignettes
** testing if installed package can be loaded
* DONE (quantreg)


pj



Hi PJ,

We're still running the benchmarks to quantify the performance increase.

The R benchmarks for the MKL version are promising. The performance increase is
varied from test to test, but there isn't any degradation in performance by
using the MKL version. You can expect a 2x to 10x performance increase
depending on the matrix calculations you are performing. Here are the
compilation arguments we used for compiling R with MKL:

--disable-BLAS-shlib
--with-blas="-L/panfs/pfs.acf.ku.edu/cluster/6.2/intel/2015/mkl/lib/intel64 -W
l,--no-as-needed -lmkl_gf_lp64 -Wl,--start-group -lmkl_gnu_thread -lmkl_core
-Wl,--end-group -fopenmp -ldl -lpthread -lm" --with-lapack

You may want to include these while recompiling R packages which use BLAS.


Here are the results of the benchmark for the standard R 3.2.2:

R Benchmark 2.5
===============
Number of times each test is run__________________________: 3

I. Matrix calculation
---------------------
Creation, transp., deformation of a 2500x2500 matrix (sec): 2.69466666666667
2400x2400 normal distributed random matrix ^1000____ (sec): 1.42433333333333
Sorting of 7,000,000 random values__________________ (sec): 2.34466666666667
2800x2800 cross-product matrix (b = a' * a)_________ (sec): 33.187
Linear regr. over a 3000x3000 matrix (c = a \ b')___ (sec): 14.52
--------------------------------------------
Trimmed geom. mean (2 extremes eliminated): 4.51008013606039

II. Matrix functions
--------------------
FFT over 2,400,000 random values____________________ (sec): 1.203
Eigenvalues of a 640x640 random matrix______________ (sec): 1.60599999999999
Determinant of a 2500x2500 random matrix____________ (sec): 7.64266666666667
Cholesky decomposition of a 3000x3000 matrix________ (sec): 8.05900000000001
Inverse of a 1600x1600 random matrix________________ (sec): 8.64166666666667
--------------------------------------------
Trimmed geom. mean (2 extremes eliminated): 4.62477425061321

III. Programmation
------------------
3,500,000 Fibonacci numbers calculation (vector calc)(sec): 1.25633333333335
Creation of a 3000x3000 Hilbert matrix (matrix calc) (sec): 0.894999999999982
Grand common divisors of 400,000 pairs (recursion)__ (sec): 1.714
Creation of a 500x500 Toeplitz matrix (loops)_______ (sec): 1.4013333333333
Escoufier's method on a 45x45 matrix (mixed)________ (sec): 2.041
--------------------------------------------
Trimmed geom. mean (2 extremes eliminated): 1.44505946077978


Total time for all 15 tests_________________________ (sec): 88.6306666666667
Overall mean (sum of I, II and III trimmed means/3)_ (sec): 3.11209972260597
--- End of test ---


Here are the results for the MKL version:

R Benchmark 2.5
===============
Number of times each test is run__________________________: 3

I. Matrix calculation
---------------------
Creation, transp., deformation of a 2500x2500 matrix (sec): 2.88466666666667
2400x2400 normal distributed random matrix ^1000____ (sec): 1.45933333333333
Sorting of 7,000,000 random values__________________ (sec): 2.35166666666667
2800x2800 cross-product matrix (b = a' * a)_________ (sec): 3.37233333333333
Linear regr. over a 3000x3000 matrix (c = a \ b')___ (sec): 1.68666666666666
--------------------------------------------
Trimmed geom. mean (2 extremes eliminated): 2.25337542617509

II. Matrix functions
--------------------
FFT over 2,400,000 random values____________________ (sec): 1.232
Eigenvalues of a 640x640 random matrix______________ (sec): 0.823333333333333
Determinant of a 2500x2500 random matrix____________ (sec): 1.752
Cholesky decomposition of a 3000x3000 matrix________ (sec): 1.417
Inverse of a 1600x1600 random matrix________________ (sec): 1.33833333333334
--------------------------------------------
Trimmed geom. mean (2 extremes eliminated): 1.32693082905282

III. Programmation
------------------
3,500,000 Fibonacci numbers calculation (vector calc)(sec): 1.28600000000001
Creation of a 3000x3000 Hilbert matrix (matrix calc) (sec): 1.00833333333334
Grand common divisors of 400,000 pairs (recursion)__ (sec): 1.82266666666666
Creation of a 500x500 Toeplitz matrix (loops)_______ (sec): 1.40533333333334
Escoufier's method on a 45x45 matrix (mixed)________ (sec): 1.91199999999998
--------------------------------------------
Trimmed geom. mean (2 extremes eliminated): 1.48790723568791


Total time for all 15 tests_________________________ (sec): 25.7516666666667
Overall mean (sum of I, II and III trimmed means/3)_ (sec): 1.64469699141649
--- End of test ---
#
Hi Paul,

We've been through this process ourselves for the Revolution R Open project. There are a number of pitfalls to avoid, but you can take a look at how we achieved it in the build scripts at:

https://github.com/RevolutionAnalytics/RRO

There are also some very useful notes in the R Installation guide:
https://cran.r-project.org/doc/manuals/r-release/R-admin.html#BLAS 

Most packages do benefit from MKL (or any multi-threaded BLAS) to some degree, although the actual benefit depends on the R functions they call. Some packages (and some built-in R functions) don't call into BLAS endpoints, so you won't see benefits in all cases.

# David Smith
#
We said it before, but it bears repeating: BLAS is an interface.

So unless you use on a static library build, these library can be switch
after compilation and at essentially any point in time.  My (unfinished)
package gcbd shows how in its simple and vignette by comparing a number of
BLAS implementations.  See the (now dated) chart on page 9 of
  https://cran.rstudio.com/web/packages/gcbd/vignettes/gcbd.pdf
or this (old) blog post
  http://dirk.eddelbuettel.com/blog/2010/10/03/

While the charts could do with an update, they do show how eg reference blas
is clearly outperformed by Atlas or GotoBLAS (the predecessor to OpebBLAS).

Hope this helps,  Dirk
1 day later
#
On Mon, Nov 23, 2015 at 11:39 AM, David Smith <davidsmi at microsoft.com> wrote:
Dear David

I'm in the situation mentioned here in the docs, since BLAS is not shared.

"Note that under Unix (but not under Windows) if R is compiled against
a non-default BLAS and --enable-BLAS-shlib is not used, then all
BLAS-using packages must also be. So if R is re-built to use an
enhanced BLAS then packages such as quantreg will need to be
re-installed. "

I am building all of the modules from scratch, so if the default build
is sufficient, then I'll be done. When I asked the other day, I was
worried that packages would find the wrong shared library. As far as I
can tell now, I should not have been so worried.

Today, while browsing the R installation, I find the Makeconf file and
that has all the information a package should need.  I've verified
that the quantreg package detects this information, and we'll just
hope the others do too :)

In case anybody else comes along later and wonders how R can be
configured to make this go, here's the top of our Makeconf from the
installed R, which has the configure line as well as BLAS_LIBS, which,
so far as I can tell, is making all of this go.

Makeconf content

# etc/Makeconf.  Generated from Makeconf.in by configure.
#
# ${R_HOME}/etc/Makeconf
# R was configured using the following call
# (not including env. vars and site configuration)
# configure  '--prefix=/tools/cluster/6.2/R/3.2.2_mkl' '--with-tcltk'
'--enable-R-shlib' '--enable-shared' '--with-pic'
'--disable-BLAS-shlib'
'--with-blas=-L/panfs/pfs.acf.ku.edu/cluster/6.2/intel/2015/mkl/lib/intel64
-Wl,--no-as-needed -lmkl_gf_lp64 -Wl,--start-group -lmkl_gnu_thread
-lmkl_core  -Wl,--end-group -fopenmp  -ldl -lpthread -lm'
'--with-lapack'
'CFLAGS=-I/panfs/pfs.acf.ku.edu/cluster/system/pkg/R/curl7.45_install/include
-L/panfs/pfs.acf.ku.edu/cluster/6.2/R/3.2.2_mkl/lib64'
'JAVA_HOME=/tools/cluster/6.2/java/jdk1.8.0_66'

## This fails if it contains spaces, or if it is quoted
include $(R_SHARE_DIR)/make/vars.mk

AR = ar
## Used by packages 'maps' and 'mapdata'
AWK = gawk
BLAS_LIBS = -L/panfs/pfs.acf.ku.edu/cluster/6.2/intel/2015/mkl/lib/intel64
-Wl,--no-as-needed -lmkl_gf_lp64 -Wl,--start-group -lmkl_gnu_thread
-lmkl_core  -Wl,--end-group -fopenmp  -ldl -lpthread -lm
C_VISIBILITY = -fvisibility=hidden
...



pj