I am a newbie to C (although I did some programming in Perl and Pascal) so forgive me for the language that follows. I am writing because I have a question concerning the *implementation of *(the internal function)* cwilcox* in https://svn.r-project.org/R/!svn/bc/81274/branches/R-DL/src/nmath/wilcox.c. This function is used to determine the test statistics of the Wilcoxon-Mann-Whitney U-test. _______________________________________________________________________________________________________________________________________ The function `cwilcox` has three inputs: k, m and n. To establish the test statistics `cwilcox` determines the number of possible sums of natural numbers with the following restrictions: - the sum of the elements must be k, - the elements (summands) must be in a decreasing (or increasing) order, - every element must be less then m and - there must be at most n summands. The problem with the evaluation of this function `cwilcox(k,m,n)` is that it requires a recursion where in fact a three-dimensional array has to be evaluated (see the code around line 157). This requires a lot of memory and takes time and seems still an issue even with newer machines, see the warning in the documentation https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/wilcox.test . In an old paper I have written a formula where the evaluation can be done in a one-dimensional array that uses way less memory and is much faster. The paper itself is in German (you find a copy here: https://upload.wikimedia.org/wikipedia/commons/f/f5/LoefflerWilcoxonMannWhitneyTest.pdf), so I uploaded a translation into English (to be found in https://upload.wikimedia.org/wikipedia/de/1/19/MannWhitney_151102.pdf). I also looked at the code in R and wrote a new version of the code that uses my formulae so that a faster implementation is possible (code with comments is below). I have several questions regarding the code: 1. A lot of commands in the original R code are used to handle the memory. I do not know how to do this so I skipped memory handling completely and simply allocated space to an array (see below int cwilcox_distr[(m*n+1)/2];). Since my array is 1-dimensional instead of 3-dimensional I think as a first guess that will be ok. 2. I read the documentation when and why code in R should be changed. I am not familiar enough with R to understand how this applies here. My code uses less memory - is that enough? Or should another function be defined that uses my code? Or shall it be disregarded? 3. I was not able to file my ideas in bugzilla.r-project or the like and I hope that this mailing list is a good address for my question. I also have written an example of a main function where the original code from R is compared to my code. I do not attach this example because this email is already very long but I can provide that example if needed. Maybe we can discuss this further. Best wishes, Andreas L?ffler. ``` #include <stdio.h> #include <stdlib.h> /* The traditional approch to determine the Mann-Whitney statistics uses a recursion formular for partitions of natural numbers, namely in the line w[i][j][k] = cwilcox(k - j, i - 1, j) + cwilcox(k, i, j - 1); (taken from https://svn.r-project.org/R/!svn/bc/81274/branches/R-DL/src/nmath/wilcox.c). This approach requires a huge number of partitions to be evaluated because the second variable (j in the left term) and the third variable (k in the left term) in this recursion are not constant but change as well. Hence, a three dimensional array is evaluated. In an old paper a recursion equations was shown that avoids this disadvantage. The recursion equation of that paper uses only an array where the second as well as the third variable remain constant. This implies faster evaluation and less memory used. The original paper is in German and can be found in https://upload.wikimedia.org/wikipedia/commons/f/f5/LoefflerWilcoxonMannWhitneyTest.pdf and the author has uploaded a translation into English in https://upload.wikimedia.org/wikipedia/de/1/19/MannWhitney_151102.pdf. This function uses this approach. */ static int cwilcox_sigma(int k, int m, int n) { /* this relates to the sigma function below */ int s, d; s=0; for (d = 1; d <= m; d++) { if (k%d == 0) { s=s+d; } } for (d = n+1; d <= n+m; d++) { if (k%d == 0) { s=s-d; } } return s; } /* this can replace cwilcox. It runs faster and uses way less memory */ static double cwilcox2(int k, int m, int n){ int cwilcox_distr[(m*n+1)/2]; /* will store (one half of the) distribution */ int s, i, kk; if (2*k>m*n){ k=m*n-k; /* permutation function is symmetric */ } for (kk=0; 2*kk<=m*n+1; kk++){ if (kk==0){ cwilcox_distr[0]=1; /* by definition 0 has only 1 partition */ } else { s=0; for (i = 0; i<kk; i++){ s=s+cwilcox_distr[i]*cwilcox_sigma(kk-i,m,n); /* recursion formula */ } cwilcox_distr[kk]=s/kk; } } return (double) cwilcox_distr[k]; }
cwilcox - new version
7 messages · Andreas Löffler, Lakshman, Aidan H, Aidan Lakshman +1 more
Hi Andreas, These are the kinds of suggested improvements that get me really excited. I?m not a part of R-Core, but I am happy to help out as much as I can with optimizing your C code to submit this.
I was not able to file my ideas in bugzilla.r-project or the like and I hope that this mailing list is a good address for my question.
You should send an email to bug-report-request at r-project.org to get an account. See this link for details: https://www.r-project.org/bugs.html#:~:text=In%20order%20to%20get%20a,the%20report%20isn't%20extraneous.
I read the documentation when and why code in R should be changed. I am not familiar enough with R to understand how this applies here. My code uses less memory - is that enough? Or should another function be defined that uses my code? Or shall it be disregarded?
It depends a bit on the submission and R-Core, but the first step for something like this is to create a bug report on Bugzilla along with a reproducible example of a situation where your code outperforms the current version. From there, there are typically some discussions around the code itself, leading to a decision on whether or not to incorporate it. I threw together a quick patch file based on your proposed changes, which I?ve attached here if others would like to look at it and test. I have a feeling the loops in `cwilcox_sigma` can be factored out. The patch file is far from perfect; a final patch would remove some internal calls made obsolete by this method, like `w_free`, `w_init_maybe`, and `wilcox_free`. The current patch just makes `wilcox_free` a noop and comments out calls to the other functions. In my builds, `pwilcox` / `dwilcox` run much slower and `wilcox.test` runs about equally fast (as the current implementation). I haven?t done any memory benchmarking yet. I think that your version should be faster, so I?m not quite sure where the slowdown is coming from?maybe someone else can weigh in with thoughts. I?ll have more time to work on optimization and tracing down the differences in runtime tomorrow. I do have to say, I like this implementation better. The code is significantly cleaner (in my opinion), and it doesn?t require use of a static global variable with `on.exit` hooks. If you can find a reproducible example where the current R build crashes/breaks and your version does not, that would be extremely helpful. Some test cases where your version outperforms the existing one would also be helpful. I?m happy to discuss more and help out with this, including setting up a Bugzilla report if you?d like. There are a lot of people that know more about this than I; hoping to hear their thoughts on this as well. -Aidan ----------------------- Aidan Lakshman (he/him) PhD Candidate, Wright Lab University of Pittsburgh School of Medicine Department of Biomedical Informatics www.AHL27.com ahl27 at pitt.edu | (724) 612-9940 From: R-devel <r-devel-bounces at r-project.org> on behalf of Andreas L?ffler <andreas.loeffler at gmail.com> Date: Monday, January 15, 2024 at 13:52 To: r-devel at r-project.org <r-devel at r-project.org> Subject: [Rd] cwilcox - new version I am a newbie to C (although I did some programming in Perl and Pascal) so forgive me for the language that follows. I am writing because I have a question concerning the *implementation of *(the internal function)* cwilcox* in https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fsvn.r-project.org%2FR%2F!svn%2Fbc%2F81274%2Fbranches%2FR-DL%2Fsrc%2Fnmath%2Fwilcox.c&data=05%7C02%7Cahl27%40pitt.edu%7C2d55e3294f1047e1c0e808dc15fa3cff%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C638409415285927560%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C62000%7C%7C%7C&sdata=Yu9u36OOyZWSVjA6%2BTKlYSTLh0VjMzOZKuQN48Ml3S0%3D&reserved=0<https://svn.r-project.org/R/!svn/bc/81274/branches/R-DL/src/nmath/wilcox.c>. This function is used to determine the test statistics of the Wilcoxon-Mann-Whitney U-test. _______________________________________________________________________________________________________________________________________ The function `cwilcox` has three inputs: k, m and n. To establish the test statistics `cwilcox` determines the number of possible sums of natural numbers with the following restrictions: - the sum of the elements must be k, - the elements (summands) must be in a decreasing (or increasing) order, - every element must be less then m and - there must be at most n summands. The problem with the evaluation of this function `cwilcox(k,m,n)` is that it requires a recursion where in fact a three-dimensional array has to be evaluated (see the code around line 157). This requires a lot of memory and takes time and seems still an issue even with newer machines, see the warning in the documentation https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.rdocumentation.org%2Fpackages%2Fstats%2Fversions%2F3.6.2%2Ftopics%2Fwilcox.test&data=05%7C02%7Cahl27%40pitt.edu%7C2d55e3294f1047e1c0e808dc15fa3cff%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C638409415285935895%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C62000%7C%7C%7C&sdata=GAkMVhP1tqf%2FezoLdrTKnU7KoysWzCpsDkpCdxCp5aQ%3D&reserved=0<https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/wilcox.test> . In an old paper I have written a formula where the evaluation can be done in a one-dimensional array that uses way less memory and is much faster. The paper itself is in German (you find a copy here: https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fupload.wikimedia.org%2Fwikipedia%2Fcommons%2Ff%2Ff5%2FLoefflerWilcoxonMannWhitneyTest.pdf&data=05%7C02%7Cahl27%40pitt.edu%7C2d55e3294f1047e1c0e808dc15fa3cff%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C638409415285940564%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C62000%7C%7C%7C&sdata=37lZBA9tQ49Gk8NKsR6OidF3U6nk%2Bv3LVbt21IwGSYk%3D&reserved=0)<https://upload.wikimedia.org/wikipedia/commons/f/f5/LoefflerWilcoxonMannWhitneyTest.pdf>, so I uploaded a translation into English (to be found in https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fupload.wikimedia.org%2Fwikipedia%2Fde%2F1%2F19%2FMannWhitney_151102.pdf&data=05%7C02%7Cahl27%40pitt.edu%7C2d55e3294f1047e1c0e808dc15fa3cff%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C638409415285945105%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C62000%7C%7C%7C&sdata=MZb7dgk%2Bbt38HuEqz3G9M9UA0dsmjQHLgcDl01RAXsg%3D&reserved=0)<https://upload.wikimedia.org/wikipedia/de/1/19/MannWhitney_151102.pdf>. I also looked at the code in R and wrote a new version of the code that uses my formulae so that a faster implementation is possible (code with comments is below). I have several questions regarding the code: 1. A lot of commands in the original R code are used to handle the memory. I do not know how to do this so I skipped memory handling completely and simply allocated space to an array (see below int cwilcox_distr[(m*n+1)/2];). Since my array is 1-dimensional instead of 3-dimensional I think as a first guess that will be ok. 2. I read the documentation when and why code in R should be changed. I am not familiar enough with R to understand how this applies here. My code uses less memory - is that enough? Or should another function be defined that uses my code? Or shall it be disregarded? 3. I was not able to file my ideas in bugzilla.r-project or the like and I hope that this mailing list is a good address for my question. I also have written an example of a main function where the original code from R is compared to my code. I do not attach this example because this email is already very long but I can provide that example if needed. Maybe we can discuss this further. Best wishes, Andreas L?ffler. ``` #include <stdio.h> #include <stdlib.h> /* The traditional approch to determine the Mann-Whitney statistics uses a recursion formular for partitions of natural numbers, namely in the line w[i][j][k] = cwilcox(k - j, i - 1, j) + cwilcox(k, i, j - 1); (taken from https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fsvn.r-project.org%2FR%2F!svn%2Fbc%2F81274%2Fbranches%2FR-DL%2Fsrc%2Fnmath%2Fwilcox.c&data=05%7C02%7Cahl27%40pitt.edu%7C2d55e3294f1047e1c0e808dc15fa3cff%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C638409415285949688%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C62000%7C%7C%7C&sdata=pPnCBuSzFfmE0%2FH3nmRmtO9qLLBiHSCcG2cNoI2scw4%3D&reserved=0)<https://svn.r-project.org/R/!svn/bc/81274/branches/R-DL/src/nmath/wilcox.c>. This approach requires a huge number of partitions to be evaluated because the second variable (j in the left term) and the third variable (k in the left term) in this recursion are not constant but change as well. Hence, a three dimensional array is evaluated. In an old paper a recursion equations was shown that avoids this disadvantage. The recursion equation of that paper uses only an array where the second as well as the third variable remain constant. This implies faster evaluation and less memory used. The original paper is in German and can be found in https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fupload.wikimedia.org%2Fwikipedia%2Fcommons%2Ff%2Ff5%2FLoefflerWilcoxonMannWhitneyTest.pdf&data=05%7C02%7Cahl27%40pitt.edu%7C2d55e3294f1047e1c0e808dc15fa3cff%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C638409415285954898%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C62000%7C%7C%7C&sdata=u7%2FH8geUfv0nwqspn%2FQ0DYGBbmGr9bA180B8vWxKIGc%3D&reserved=0<https://upload.wikimedia.org/wikipedia/commons/f/f5/LoefflerWilcoxonMannWhitneyTest.pdf> and the author has uploaded a translation into English in https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fupload.wikimedia.org%2Fwikipedia%2Fde%2F1%2F19%2FMannWhitney_151102.pdf&data=05%7C02%7Cahl27%40pitt.edu%7C2d55e3294f1047e1c0e808dc15fa3cff%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C638409415285959440%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C62000%7C%7C%7C&sdata=2nhVTotogBjd%2Fa2qaJD0pJLPB%2FE9FLy2d1fXYyz0BI0%3D&reserved=0<https://upload.wikimedia.org/wikipedia/de/1/19/MannWhitney_151102.pdf>. This function uses this approach. */ static int cwilcox_sigma(int k, int m, int n) { /* this relates to the sigma function below */ int s, d; s=0; for (d = 1; d <= m; d++) { if (k%d == 0) { s=s+d; } } for (d = n+1; d <= n+m; d++) { if (k%d == 0) { s=s-d; } } return s; } /* this can replace cwilcox. It runs faster and uses way less memory */ static double cwilcox2(int k, int m, int n){ int cwilcox_distr[(m*n+1)/2]; /* will store (one half of the) distribution */ int s, i, kk; if (2*k>m*n){ k=m*n-k; /* permutation function is symmetric */ } for (kk=0; 2*kk<=m*n+1; kk++){ if (kk==0){ cwilcox_distr[0]=1; /* by definition 0 has only 1 partition */ } else { s=0; for (i = 0; i<kk; i++){ s=s+cwilcox_distr[i]*cwilcox_sigma(kk-i,m,n); /* recursion formula */ } cwilcox_distr[kk]=s/kk; } } return (double) cwilcox_distr[k]; } ______________________________________________ R-devel at r-project.org mailing list https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstat.ethz.ch%2Fmailman%2Flistinfo%2Fr-devel&data=05%7C02%7Cahl27%40pitt.edu%7C2d55e3294f1047e1c0e808dc15fa3cff%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C638409415285963952%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C62000%7C%7C%7C&sdata=Xl2%2FW6Th506HlCklz2cmUWh5EHJ7E9%2BBcIgHwUVFpFg%3D&reserved=0<https://stat.ethz.ch/mailman/listinfo/r-devel>
New email client, hoping that Outlook won?t continue to HTMLify my emails and mess up my attachments without my consent :/ I?m re-attaching the patch file here, compressed as a .gz in case that was the issue with the previous attachment. Thanks to Ivan for pointing out that my attachment didn?t go through correctly on the first try. I?ll be working on seeing if I can optimize this to actually improve performance during the day today. -Aidan
On 15 Jan 2024, at 5:51, Andreas L?ffler wrote:
I am a newbie to C (although I did some programming in Perl and Pascal) so forgive me for the language that follows. I am writing because I have a question concerning the *implementation of *(the internal function)* cwilcox* in https://svn.r-project.org/R/!svn/bc/81274/branches/R-DL/src/nmath/wilcox.c. This function is used to determine the test statistics of the Wilcoxon-Mann-Whitney U-test.
_______________________________________________________________________________________________________________________________________ The function `cwilcox` has three inputs: k, m and n. To establish the test statistics `cwilcox` determines the number of possible sums of natural numbers with the following restrictions: - the sum of the elements must be k, - the elements (summands) must be in a decreasing (or increasing) order, - every element must be less then m and - there must be at most n summands. The problem with the evaluation of this function `cwilcox(k,m,n)` is that it requires a recursion where in fact a three-dimensional array has to be evaluated (see the code around line 157). This requires a lot of memory and takes time and seems still an issue even with newer machines, see the warning in the documentation https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/wilcox.test . In an old paper I have written a formula where the evaluation can be done in a one-dimensional array that uses way less memory and is much faster. The paper itself is in German (you find a copy here: https://upload.wikimedia.org/wikipedia/commons/f/f5/LoefflerWilcoxonMannWhitneyTest.pdf), so I uploaded a translation into English (to be found in https://upload.wikimedia.org/wikipedia/de/1/19/MannWhitney_151102.pdf). I also looked at the code in R and wrote a new version of the code that uses my formulae so that a faster implementation is possible (code with comments is below). I have several questions regarding the code: 1. A lot of commands in the original R code are used to handle the memory. I do not know how to do this so I skipped memory handling completely and simply allocated space to an array (see below int cwilcox_distr[(m*n+1)/2];). Since my array is 1-dimensional instead of 3-dimensional I think as a first guess that will be ok. 2. I read the documentation when and why code in R should be changed. I am not familiar enough with R to understand how this applies here. My code uses less memory - is that enough? Or should another function be defined that uses my code? Or shall it be disregarded? 3. I was not able to file my ideas in bugzilla.r-project or the like and I hope that this mailing list is a good address for my question. I also have written an example of a main function where the original code from R is compared to my code. I do not attach this example because this email is already very long but I can provide that example if needed. Maybe we can discuss this further. Best wishes, Andreas L?ffler. ``` #include <stdio.h> #include <stdlib.h> /* The traditional approch to determine the Mann-Whitney statistics uses a recursion formular for partitions of natural numbers, namely in the line w[i][j][k] = cwilcox(k - j, i - 1, j) + cwilcox(k, i, j - 1); (taken from https://svn.r-project.org/R/!svn/bc/81274/branches/R-DL/src/nmath/wilcox.c). This approach requires a huge number of partitions to be evaluated because the second variable (j in the left term) and the third variable (k in the left term) in this recursion are not constant but change as well. Hence, a three dimensional array is evaluated. In an old paper a recursion equations was shown that avoids this disadvantage. The recursion equation of that paper uses only an array where the second as well as the third variable remain constant. This implies faster evaluation and less memory used. The original paper is in German and can be found in https://upload.wikimedia.org/wikipedia/commons/f/f5/LoefflerWilcoxonMannWhitneyTest.pdf and the author has uploaded a translation into English in https://upload.wikimedia.org/wikipedia/de/1/19/MannWhitney_151102.pdf. This function uses this approach. */ static int cwilcox_sigma(int k, int m, int n) { /* this relates to the sigma function below */ int s, d; s=0; for (d = 1; d <= m; d++) { if (k%d == 0) { s=s+d; } } for (d = n+1; d <= n+m; d++) { if (k%d == 0) { s=s-d; } } return s; } /* this can replace cwilcox. It runs faster and uses way less memory */ static double cwilcox2(int k, int m, int n){ int cwilcox_distr[(m*n+1)/2]; /* will store (one half of the) distribution */ int s, i, kk; if (2*k>m*n){ k=m*n-k; /* permutation function is symmetric */ } for (kk=0; 2*kk<=m*n+1; kk++){ if (kk==0){ cwilcox_distr[0]=1; /* by definition 0 has only 1 partition */ } else { s=0; for (i = 0; i<kk; i++){ s=s+cwilcox_distr[i]*cwilcox_sigma(kk-i,m,n); /* recursion formula */ } cwilcox_distr[kk]=s/kk; } } return (double) cwilcox_distr[k]; } [[alternative HTML version deleted]] ______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
-------------- next part -------------- A non-text attachment was scrubbed... Name: wilcox_patch_draft.gz Type: application/x-gzip Size: 1707 bytes Desc: not available URL: <https://stat.ethz.ch/pipermail/r-devel/attachments/20240116/b2822fbd/attachment.bin>
I?ve been looking at this for a couple hours--it ended up being trickier than I expected to implement well.
I?ve attached a new patch here. This version scales significantly better than the existing method for both `pwilcox` and `dwilcox`. Examples are included below.
I can?t think of any other ways to improve runtime at this point. That?s not to say there aren?t any, though?I?m hopeful inspection by someone else could identify more ways to save some time.
This version uses Andreas?s algorithm for computing the value of `cwilcox`, but evaluates it lazily and caches results. Memory usage of this should be orders of magnitude less than the current version, and recursion is eliminated. As far as I can tell, the numerical accuracy of the results is unchanged.
Performance statistics are interesting. If we assume the two populations have a total of `m` members, then this implementation runs slightly slower for m < 20, and much slower for 50 < m < 100. However, this implementation works significantly *faster* for m > 200. The breakpoint is precisely when each population has a size of 50; `qwilcox(0.5,50,50)` runs in 8 microseconds in the current version, but `qwilcox(0.5, 50, 51)` runs in 5 milliseconds. The new version runs in roughly 1 millisecond for both. This is probably because of internal logic that requires many more `free/calloc` calls if either population is larger than `WILCOX_MAX`, which is set to 50.
I?m hopeful that this can be optimized to be suitable for inclusion in R. Lower performance for population sizes below 50 is not ideal, since `wilcox.test` switches to non-exact testing for population sizes above 50.
-Aidan
Benchmarking results on my machine using `microbenchmark`:
(median runtimes over 100 iterations, us=microseconds, ms=milliseconds, s=seconds)
`qwilcox(0.5, n, n)`
n: 10 25 50 100 200
old version: 1.2us 2.9us 9.0us 87.4ms 2.3s
Andreas version: 2.7us 68.6us 1.1ms 16.9ms 264.3ms
`dwilcox((n*n)%/%2, n, n)`
n: 10 25 50 100 200
old version: 1.4us 0.9us 0.9us 43.2ms 851.6ms
Andreas version: 2.3us 53.9us 1.0ms 16.4ms 261.7ms
`pwilcox(1:100, 10, 10)`:
old version: 62.9us
Andreas version: 22.9us
-----------------------
Aidan Lakshman (he/him)
PhD Candidate, Wright Lab
University of Pittsburgh School of Medicine
Department of Biomedical Informatics
http://www.ahl27.com/
ahl27 at pitt.edu | (724) 612-9940
On 16 Jan 2024, at 9:47, Aidan Lakshman wrote:
New email client, hoping that Outlook won?t continue to HTMLify my emails and mess up my attachments without my consent :/ I?m re-attaching the patch file here, compressed as a .gz in case that was the issue with the previous attachment. Thanks to Ivan for pointing out that my attachment didn?t go through correctly on the first try. I?ll be working on seeing if I can optimize this to actually improve performance during the day today. -Aidan On 15 Jan 2024, at 5:51, Andreas L?ffler wrote:
I am a newbie to C (although I did some programming in Perl and Pascal) so forgive me for the language that follows. I am writing because I have a question concerning the *implementation of *(the internal function)* cwilcox* in https://svn.r-project.org/R/!svn/bc/81274/branches/R-DL/src/nmath/wilcox.c. This function is used to determine the test statistics of the Wilcoxon-Mann-Whitney U-test.
_______________________________________________________________________________________________________________________________________ The function `cwilcox` has three inputs: k, m and n. To establish the test statistics `cwilcox` determines the number of possible sums of natural numbers with the following restrictions: - the sum of the elements must be k, - the elements (summands) must be in a decreasing (or increasing) order, - every element must be less then m and - there must be at most n summands. The problem with the evaluation of this function `cwilcox(k,m,n)` is that it requires a recursion where in fact a three-dimensional array has to be evaluated (see the code around line 157). This requires a lot of memory and takes time and seems still an issue even with newer machines, see the warning in the documentation https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/wilcox.test . In an old paper I have written a formula where the evaluation can be done in a one-dimensional array that uses way less memory and is much faster. The paper itself is in German (you find a copy here: https://upload.wikimedia.org/wikipedia/commons/f/f5/LoefflerWilcoxonMannWhitneyTest.pdf), so I uploaded a translation into English (to be found in https://upload.wikimedia.org/wikipedia/de/1/19/MannWhitney_151102.pdf). I also looked at the code in R and wrote a new version of the code that uses my formulae so that a faster implementation is possible (code with comments is below). I have several questions regarding the code: 1. A lot of commands in the original R code are used to handle the memory. I do not know how to do this so I skipped memory handling completely and simply allocated space to an array (see below int cwilcox_distr[(m*n+1)/2];). Since my array is 1-dimensional instead of 3-dimensional I think as a first guess that will be ok. 2. I read the documentation when and why code in R should be changed. I am not familiar enough with R to understand how this applies here. My code uses less memory - is that enough? Or should another function be defined that uses my code? Or shall it be disregarded? 3. I was not able to file my ideas in bugzilla.r-project or the like and I hope that this mailing list is a good address for my question. I also have written an example of a main function where the original code from R is compared to my code. I do not attach this example because this email is already very long but I can provide that example if needed. Maybe we can discuss this further. Best wishes, Andreas L?ffler. ``` #include <stdio.h> #include <stdlib.h> /* The traditional approch to determine the Mann-Whitney statistics uses a recursion formular for partitions of natural numbers, namely in the line w[i][j][k] = cwilcox(k - j, i - 1, j) + cwilcox(k, i, j - 1); (taken from https://svn.r-project.org/R/!svn/bc/81274/branches/R-DL/src/nmath/wilcox.c). This approach requires a huge number of partitions to be evaluated because the second variable (j in the left term) and the third variable (k in the left term) in this recursion are not constant but change as well. Hence, a three dimensional array is evaluated. In an old paper a recursion equations was shown that avoids this disadvantage. The recursion equation of that paper uses only an array where the second as well as the third variable remain constant. This implies faster evaluation and less memory used. The original paper is in German and can be found in https://upload.wikimedia.org/wikipedia/commons/f/f5/LoefflerWilcoxonMannWhitneyTest.pdf and the author has uploaded a translation into English in https://upload.wikimedia.org/wikipedia/de/1/19/MannWhitney_151102.pdf. This function uses this approach. */ static int cwilcox_sigma(int k, int m, int n) { /* this relates to the sigma function below */ int s, d; s=0; for (d = 1; d <= m; d++) { if (k%d == 0) { s=s+d; } } for (d = n+1; d <= n+m; d++) { if (k%d == 0) { s=s-d; } } return s; } /* this can replace cwilcox. It runs faster and uses way less memory */ static double cwilcox2(int k, int m, int n){ int cwilcox_distr[(m*n+1)/2]; /* will store (one half of the) distribution */ int s, i, kk; if (2*k>m*n){ k=m*n-k; /* permutation function is symmetric */ } for (kk=0; 2*kk<=m*n+1; kk++){ if (kk==0){ cwilcox_distr[0]=1; /* by definition 0 has only 1 partition */ } else { s=0; for (i = 0; i<kk; i++){ s=s+cwilcox_distr[i]*cwilcox_sigma(kk-i,m,n); /* recursion formula */ } cwilcox_distr[kk]=s/kk; } } return (double) cwilcox_distr[k]; } [[alternative HTML version deleted]] ______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
-------------- next part -------------- A non-text attachment was scrubbed... Name: wilcox_patch_draft_v2.gz Type: application/x-gzip Size: 2652 bytes Desc: not available URL: <https://stat.ethz.ch/pipermail/r-devel/attachments/20240116/51c0bd7c/attachment.bin>
Performance statistics are interesting. If we assume the two populations have a total of `m` members, then this implementation runs slightly slower for m < 20, and much slower for 50 < m < 100. However, this implementation works significantly *faster* for m > 200. The breakpoint is precisely when each population has a size of 50; `qwilcox(0.5,50,50)` runs in 8 microseconds in the current version, but `qwilcox(0.5, 50, 51)` runs in 5 milliseconds. The new version runs in roughly 1 millisecond for both. This is probably because of internal logic that requires many more `free/calloc` calls if either population is larger than `WILCOX_MAX`, which is set to 50.
Also because cwilcox_sigma has to be evaluated, and this is slightly more demanding since it uses k%d. There is a tradeoff here between memory usage and time of execution. I am not a heavy user of the U test but I think the typical use case does not involve several hundreds of tests in a session so execution time (my 2 cents) is less important. But if R crashes one execution is already problematic. But the takeaway is probably: we should implement both approaches in the code and leave it to the user which one she prefers. If time is important and memory not an issue and if m, n are low go for the "traditional approach". Otherwise, use my formula? PS (@Aidan): I have applied for an bugzilla account two days ago and heard not back from them. Also Spam is empty. Is that ok or shall I do something?
Hi All, Figured I'd put my two cents in here as the welch-lab's LIGER package currently uses mann-whitney on datasets much larger than m = 200. Our current version uses a modified PRESTO (https://github.com/immunogenomics/presto) implementation over the inbuilt tests because of the lack of scaling. I stumbled into this thread while working on some improvements for it and would like to make it known that there is absolutely an audience for the high-member use-case. Best, -Andrew Robbins
On 1/17/2024 5:55 AM, Andreas L?ffler wrote:
Performance statistics are interesting. If we assume the two populations have a total of `m` members, then this implementation runs slightly slower for m < 20, and much slower for 50 < m < 100. However, this implementation works significantly *faster* for m > 200. The breakpoint is precisely when each population has a size of 50; `qwilcox(0.5,50,50)` runs in 8 microseconds in the current version, but `qwilcox(0.5, 50, 51)` runs in 5 milliseconds. The new version runs in roughly 1 millisecond for both. This is probably because of internal logic that requires many more `free/calloc` calls if either population is larger than `WILCOX_MAX`, which is set to 50.
Also because cwilcox_sigma has to be evaluated, and this is slightly more demanding since it uses k%d. There is a tradeoff here between memory usage and time of execution. I am not a heavy user of the U test but I think the typical use case does not involve several hundreds of tests in a session so execution time (my 2 cents) is less important. But if R crashes one execution is already problematic. But the takeaway is probably: we should implement both approaches in the code and leave it to the user which one she prefers. If time is important and memory not an issue and if m, n are low go for the "traditional approach". Otherwise, use my formula? PS (@Aidan): I have applied for an bugzilla account two days ago and heard not back from them. Also Spam is empty. Is that ok or shall I do something? [[alternative HTML version deleted]]
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Andrew Robbins Systems Analyst, Welch Lab<https://welch-lab.github.io> University of Michigan Department of Computational Medicine and Bioinformatics -------------- next part -------------- A non-text attachment was scrubbed... Name: OpenPGP_signature.asc Type: application/pgp-signature Size: 840 bytes Desc: OpenPGP digital signature URL: <https://stat.ethz.ch/pipermail/r-devel/attachments/20240117/c18b12c1/attachment.sig>
Hi everyone, I?ve opened a Bugzilla report for Andreas with the most recent implementation here: https://bugs.r-project.org/show_bug.cgi?id=18655. Feedback would be greatly appreciated. The most straight forward approach is likely to implement both methods and determine which to use based on population sizes. The cutoff at n=50 is very sharp; it would be a large improvement to just call Andreas?s algorithm when either population is larger than 50, and use the current method otherwise. For the Bugzilla report I?ve only submitted the new version for benchmarking purposes. I think that if there is a way to improve this algorithm such that it matches the performance of the current version for population sizes under 50, then it would be significantly cleaner than to have two algorithms with an internal switch. As for remaining performance improvements: 1. cwilcox_sigma is definitely a performance loss. It would improve performance to instead just loop from 1 to min(m, sqrt(k)) and from n+1 to min(m+n, sqrt(k)), since the formula just finds potential factors of k. Maybe there are other ways to improve this, but I think factorization is a notoriously intensive problem, so that further optimzation may be intractable. 2. Calculation of the distribution values has quadratic scaling. Maybe there?s a way to optimize that further? See lines 91-103 in the most recent version. Regardless of runtime, memory is certainly improved. For calculation on population sizes m,n, the current version has memory complexity O((mn)^2), whereas Andreas?s version has complexity O(mn). Running `qwilcox(0.5,500,500)` crashes my R session with the old version, but runs successfully in about 10s with the new version. I?ve written up all the information so far on the Bugzilla report, and I?m sure Andreas will add more information if necessary when his account is approved. Thanks again to Andreas for introducing this algorithm?I?m hopeful that this is able to improve performance of the wilcox functions. -Aidan ----------------------- Aidan Lakshman (he/him) PhD Candidate, Wright Lab University of Pittsburgh School of Medicine Department of Biomedical Informatics www.AHL27.com ahl27 at pitt.edu | (724) 612-9940
On 17 Jan 2024, at 5:55, Andreas L?ffler wrote:
Performance statistics are interesting. If we assume the two populations have a total of `m` members, then this implementation runs slightly slower for m < 20, and much slower for 50 < m < 100. However, this implementation works significantly *faster* for m > 200. The breakpoint is precisely when each population has a size of 50; `qwilcox(0.5,50,50)` runs in 8 microseconds in the current version, but `qwilcox(0.5, 50, 51)` runs in 5 milliseconds. The new version runs in roughly 1 millisecond for both. This is probably because of internal logic that requires many more `free/calloc` calls if either population is larger than `WILCOX_MAX`, which is set to 50.
Also because cwilcox_sigma has to be evaluated, and this is slightly more demanding since it uses k%d. There is a tradeoff here between memory usage and time of execution. I am not a heavy user of the U test but I think the typical use case does not involve several hundreds of tests in a session so execution time (my 2 cents) is less important. But if R crashes one execution is already problematic. But the takeaway is probably: we should implement both approaches in the code and leave it to the user which one she prefers. If time is important and memory not an issue and if m, n are low go for the "traditional approach". Otherwise, use my formula? PS (@Aidan): I have applied for an bugzilla account two days ago and heard not back from them. Also Spam is empty. Is that ok or shall I do something?