-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 I'm trying to calculate Pearson correlation coefficients for a large matrix of size 18563 x 18563. The following function takes about XX minutes to complete, and I'd like to do this calculation about 15 times and so speed is some what of an issue. Does anyone have any suggestions on ways to speed this up? I'd wondered if using C++ code to do the calculations might speed things up, but I've never written any C/C++ code or attempted to use any within R. I've seen some C++ code here: http://www.alglib.net/statistics/correlation.php I wondered if anyone might be able to help me get this so it can run in R? I've tried the following: 1) download and unzipped http://www.alglib.net/translator/dl/statistics.correlation.cpp.zip 2) moved the contents of the libs dir into the parent dir alongside correlation.cpp (didn't know how to tell R where to look for C libraries) 3) Tried: "R CMD SHLIB correlation.cpp" and got the following as output: - -- start output -- icpc -I/tools/R/2.7.1/lib/R/include -I/usr/local/include -mp -fpic - -g -O2 -c correlation.cpp -o correlation.o ap.h(163): warning #858: type qualifier on return type is meaningless const bool operator==(const complex& lhs, const complex& rhs); ^ ap.h(164): warning #858: type qualifier on return type is meaningless const bool operator!=(const complex& lhs, const complex& rhs); ^ ap.h(179): warning #858: type qualifier on return type is meaningless const double abscomplex(const complex &z); ^ icpc -shared -L/usr/local/lib -o correlation.so correlation.o - -- end output -- 4) Now this doesn't look brilliant! Any thoughts? Also, I'm assuming I need to do some other work with the C++ code in order to allow me to use it from within my R scripts - any pointers on that? Thanks for any input - I hope I just need a hand over the initial hurdles and then I can get onto that up-hill learning curve!! Nathan - -- - -------------------------------------------------------- Dr. Nathan S. Watson-Haigh OCE Post Doctoral Fellow CSIRO Livestock Industries Queensland Bioscience Precinct St Lucia, QLD 4067 Australia Tel: +61 (0)7 3214 2922 Fax: +61 (0)7 3214 2900 Web: http://www.csiro.au/people/Nathan.Watson-Haigh.html - -------------------------------------------------------- -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (MingW32) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAklF8McACgkQ9gTv6QYzVL68WwCfSNTEH9nszUzqUIFb7pUvnGxD 00QAn1uKJEqm4keX2viYdTVkQVHxDXQU =rBWF -----END PGP SIGNATURE-----
Pearson Correlation Speed
5 messages · Nathan S. Watson-Haigh, Charles C. Berry
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Nathan S. Watson-Haigh wrote:
I'm trying to calculate Pearson correlation coefficients for a large matrix of size 18563 x 18563. The following function takes about XX minutes to complete, and I'd like to do this calculation about 15 times and so speed is some what of an issue.
Sorry, meant to fill in the blanks! the following takes about 15 mins to complete: corr <- abs(cor(dat, use="p"))
Does anyone have any suggestions on ways to speed this up? I'd wondered if using C++ code to do the calculations might speed things up, but I've never written any C/C++ code or attempted to use any within R. I've seen some C++ code here: http://www.alglib.net/statistics/correlation.php I wondered if anyone might be able to help me get this so it can run in R? I've tried the following: 1) download and unzipped http://www.alglib.net/translator/dl/statistics.correlation.cpp.zip 2) moved the contents of the libs dir into the parent dir alongside correlation.cpp (didn't know how to tell R where to look for C libraries) 3) Tried: "R CMD SHLIB correlation.cpp" and got the following as output: -- start output -- icpc -I/tools/R/2.7.1/lib/R/include -I/usr/local/include -mp -fpic -g -O2 -c correlation.cpp -o correlation.o ap.h(163): warning #858: type qualifier on return type is meaningless const bool operator==(const complex& lhs, const complex& rhs); ^ ap.h(164): warning #858: type qualifier on return type is meaningless const bool operator!=(const complex& lhs, const complex& rhs); ^ ap.h(179): warning #858: type qualifier on return type is meaningless const double abscomplex(const complex &z); ^ icpc -shared -L/usr/local/lib -o correlation.so correlation.o -- end output -- 4) Now this doesn't look brilliant! Any thoughts? Also, I'm assuming I need to do some other work with the C++ code in order to allow me to use it from within my R scripts - any pointers on that? Thanks for any input - I hope I just need a hand over the initial hurdles and then I can get onto that up-hill learning curve!! Nathan
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. - -- - -------------------------------------------------------- Dr. Nathan S. Watson-Haigh OCE Post Doctoral Fellow CSIRO Livestock Industries Queensland Bioscience Precinct St Lucia, QLD 4067 Australia Tel: +61 (0)7 3214 2922 Fax: +61 (0)7 3214 2900 Web: http://www.csiro.au/people/Nathan.Watson-Haigh.html - -------------------------------------------------------- -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (MingW32) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAklF8xAACgkQ9gTv6QYzVL4+wwCeL3jZVYi1VsCIQG/FQYpvcUPi XCwAoKGAImMBJOLSOBELchL+LpKDnlTT =LIiy -----END PGP SIGNATURE-----
On Mon, 15 Dec 2008, Nathan S. Watson-Haigh wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Nathan S. Watson-Haigh wrote:
I'm trying to calculate Pearson correlation coefficients for a large matrix of size 18563 x 18563. The following function takes about XX minutes to complete, and I'd like to do this calculation about 15 times and so speed is some what of an issue.
I think you are on the wrong track, Nathan. The matrix you are starting with is 18563 x 18563 and the result of finding the correlations amongst the columns of that matrix is also 18563 x 18563. It will require more than 5 Gigabytes of memory to store the result and the original matrix. Likely the time needed to do the calc is inflated because of caching issues and if your machine has less than enough memory to store the result and all the intermediate pieces by swapping as well. You can finesse these by breaking your problem into smaller pieces, say computing the correlations between each pair of 19 blocks of columns (columns 1:977, 977+1:977, ... 18*977+1:977 ), then assembling the results. --- BTW, R already has the necessary machinery to calculate the crossproduct matrix (etc) needed to find the correlations. You can access the low level linear algebra that R uses. You can marry R to an optimized BLAS if you like. So pulling in some other code to do this will not save you anything. If you ever do decide to import C[++] code there is excellent documentation in the Writing R Extensions manual, which you should review before attempting to import C++ code into R. HTH, Chuck
Sorry, meant to fill in the blanks! the following takes about 15 mins to complete: corr <- abs(cor(dat, use="p"))
Does anyone have any suggestions on ways to speed this up? I'd wondered if using C++ code to do the calculations might speed things up, but I've never written any C/C++ code or attempted to use any within R. I've seen some C++ code here: http://www.alglib.net/statistics/correlation.php I wondered if anyone might be able to help me get this so it can run in R? I've tried the following: 1) download and unzipped http://www.alglib.net/translator/dl/statistics.correlation.cpp.zip 2) moved the contents of the libs dir into the parent dir alongside correlation.cpp (didn't know how to tell R where to look for C libraries) 3) Tried: "R CMD SHLIB correlation.cpp" and got the following as output: -- start output -- icpc -I/tools/R/2.7.1/lib/R/include -I/usr/local/include -mp -fpic -g -O2 -c correlation.cpp -o correlation.o ap.h(163): warning #858: type qualifier on return type is meaningless const bool operator==(const complex& lhs, const complex& rhs); ^ ap.h(164): warning #858: type qualifier on return type is meaningless const bool operator!=(const complex& lhs, const complex& rhs); ^ ap.h(179): warning #858: type qualifier on return type is meaningless const double abscomplex(const complex &z); ^ icpc -shared -L/usr/local/lib -o correlation.so correlation.o -- end output -- 4) Now this doesn't look brilliant! Any thoughts? Also, I'm assuming I need to do some other work with the C++ code in order to allow me to use it from within my R scripts - any pointers on that? Thanks for any input - I hope I just need a hand over the initial hurdles and then I can get onto that up-hill learning curve!! Nathan
[snip]
Charles C. Berry (858) 534-2098
Dept of Family/Preventive Medicine
E mailto:cberry at tajo.ucsd.edu UC San Diego
http://famprevmed.ucsd.edu/faculty/cberry/ La Jolla, San Diego 92093-0901
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Charles C. Berry wrote:
On Mon, 15 Dec 2008, Nathan S. Watson-Haigh wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Nathan S. Watson-Haigh wrote:
I'm trying to calculate Pearson correlation coefficients for a large matrix of size 18563 x 18563. The following function takes about XX minutes to complete, and I'd like to do this calculation about 15 times and so speed is some what of an issue.
I think you are on the wrong track, Nathan. The matrix you are starting with is 18563 x 18563 and the result of finding the correlations amongst the columns of that matrix is also 18563 x 18563. It will require more than 5 Gigabytes of memory to store the result and the original matrix.
Yes the memory usage is somewhat large - luckily I have the use of a cluster with lots of shared memory! However, I'm interested to learn how you came about the calculation to determine the memory requirements.
Likely the time needed to do the calc is inflated because of caching issues and if your machine has less than enough memory to store the result and all the intermediate pieces by swapping as well. You can finesse these by breaking your problem into smaller pieces, say computing the correlations between each pair of 19 blocks of columns (columns 1:977, 977+1:977, ... 18*977+1:977 ), then assembling the results.
This is possibly, however why is something like this not implemented internally in the cor() function if it poorly scales due to the large memory requirements?
--- BTW, R already has the necessary machinery to calculate the crossproduct matrix (etc) needed to find the correlations. You can access the low level linear algebra that R uses. You can marry R to an optimized BLAS if you like. So pulling in some other code to do this will not save you anything. If you ever do decide to import C[++] code there is excellent documentation in the Writing R Extensions manual, which you should review before attempting to import C++ code into R.
Thanks, I have seen this and it seemed quite technical to use as a starting point for someone unfamiliar with both C++ and incorporating C++ code into R. Cheers, Nathan -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (MingW32) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAklHESYACgkQ9gTv6QYzVL68aQCgl0TsZL4CcnWFdlP073d7Vvui 5WAAoIcvGcunYzR+DM0Xv6R1TPmH4oA+ =5As1 -----END PGP SIGNATURE-----
On Tue, 16 Dec 2008, Nathan S. Watson-Haigh wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Charles C. Berry wrote:
On Mon, 15 Dec 2008, Nathan S. Watson-Haigh wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Nathan S. Watson-Haigh wrote:
I'm trying to calculate Pearson correlation coefficients for a large matrix of size 18563 x 18563. The following function takes about XX minutes to complete, and I'd like to do this calculation about 15 times and so speed is some what of an issue.
I think you are on the wrong track, Nathan. The matrix you are starting with is 18563 x 18563 and the result of finding the correlations amongst the columns of that matrix is also 18563 x 18563. It will require more than 5 Gigabytes of memory to store the result and the original matrix.
Yes the memory usage is somewhat large - luckily I have the use of a cluster with lots of shared memory! However, I'm interested to learn how you came about the calculation to determine the memory requirements.
The original object is
18563^2*8/1024^3
[1] 2.567358
Gigabytes, and so is the result. I added them together.
Likely the time needed to do the calc is inflated because of caching issues and if your machine has less than enough memory to store the result and all the intermediate pieces by swapping as well. You can finesse these by breaking your problem into smaller pieces, say computing the correlations between each pair of 19 blocks of columns (columns 1:977, 977+1:977, ... 18*977+1:977 ), then assembling the results.
This is possibly, however why is something like this not implemented internally in the cor() function if it poorly scales due to the large memory requirements?
Because nobody ever really needed it? Seriously, optimizing something like this is machine dependent, and R-core probably has higher priorities. cor() provides lots of options - it handles NAs, for example - and it is probably not worth the trouble to try to optimize over those options. The calculation sans NAs is a simple one and can be done using the built in BLAS (as crossprod() does), which BLAS can in turn be tuned to the machine used. So, if your environment has a tuned or multithreaded BLAS, you might be better off to use crossprod() and scale the result.
--- BTW, R already has the necessary machinery to calculate the crossproduct matrix (etc) needed to find the correlations. You can access the low level linear algebra that R uses. You can marry R to an optimized BLAS if you like. So pulling in some other code to do this will not save you anything. If you ever do decide to import C[++] code there is excellent documentation in the Writing R Extensions manual, which you should review before attempting to import C++ code into R.
Thanks, I have seen this and it seemed quite technical to use as a starting point for someone unfamiliar with both C++ and incorporating C++ code into R.
Well, in that case the path of least resistance is to start the process
when you leave for the night and pick up the results the next morning.
HTH,
Chuck
Charles C. Berry (858) 534-2098
Dept of Family/Preventive Medicine
E mailto:cberry at tajo.ucsd.edu UC San Diego
http://famprevmed.ucsd.edu/faculty/cberry/ La Jolla, San Diego 92093-0901