[R-pkg-devel] Different unit test results on MacOS - R-package-devel

Mon, Mar 2, 2026 2:36 PM #

Hi,

My package is called clustord (Github latest version at github.com/vuw-clustering/clustord, and version 2.0.0 pushed to CRAN yesterday 2nd March 2026.)

I have an odd problem: I added a function to my package and an extensive set of unit tests for it, and the unit tests run correctly on Windows and Linux, but half of one single file out of the three test files runs differently on MacOS than it does on Windows and Linux, and fails the tests.

The package is a clustering package, and the new function is designed to be able to reorder the output clusters in order of their cluster effect sizes. The unit tests run the clustering algorithm and then the reorder function on a simulated dataset and then check the output orderings against what I've manually worked out the ordering should be.

The dataset simulation process uses randomness, and the clustering algorithm uses randomness, but the reordering does not. The start of each section of the test script is set.seed(), in order to ensure the dataset is always the same, and then that seed should also fix the output of the clustering algorithm that runs just after the dataset simulation. So therefore the results should always be the same on all operating systems. This is why I'm so puzzled that almost all of the different versions of this test work on MacOS as on Windows and Linux, but this particular version of the test runs differently on MacOS, even though I set the seed at the start of simulating the dataset for this specific test run.

Since I do not have a Mac, it is difficult for me to debug it, though I can see the error when the push to Github triggers the Github Actions check, which runs on multiple OSs.

The start of the section of the test script that's failing is:

------------------------------
    library(clustord)
    ## Dataset simulation
    set.seed(30)
    n <- 30
    p <- 5
    long_df_sim <- data.frame(Y=factor(sample(1:3,n*p,replace=TRUE)),
                              ROW=rep(1:n,times=p),COL=rep(1:p,each=n))

    xr1 <- runif(n, min=0, max=2)
    xr2 <- sample(c("A","B"),size=n, replace=TRUE, prob=c(0.3,0.7))
    xr3 <- factor(sample(1:4, size=n, replace=TRUE))

    xc1 <- runif(p, min=-1, max=1)

    long_df_sim$xr1 <- rep(xr1, times=5)
    long_df_sim$xr2 <- rep(xr2, times=5)
    long_df_sim$xr3 <- rep(xr3, times=5)
    long_df_sim$xc1 <- rep(xc1, each=30)

    ## Clustering algorithm
    # OSM results --------------------------------------------------------------
    ## Model 1 ----
    orig <- clustord(Y~ROWCLUST*xr1+xr2*xr3+COL, model="OSM", RG=4,
                     long_df=long_df_sim, nstarts=1, constraint_sum_zero = FALSE,
                     control_EM=list(maxiter=3,maxiter_start=2,keep_all_params=TRUE))
------------------------------

This section is just the dataset simulation and the clustering algorithm. The reordering checks afterwards are failing, but I think it's more likely it's because the clustering algorithm is somehow producing a different result on the Mac than because the reordering (which is deterministic) is somehow producing a different result on the Mac.

Iff you display orig$out_parlist after running the above code, I expect the $rowc values to be

$rowc
     rowc_1      rowc_2      rowc_3      rowc_4
 0.00000000  0.08713465 -0.26123294  0.05820879

I will keep investigating this myself, but if anyone has any suggestions why the randomness might be working slightly differently on the Mac, or any other possible causes for occasional mismatches between MacOS and other OSs, I would really appreciate reading them.

Thanks very much
Louise

Louise McMillan

Mon, Mar 2, 2026 8:06 PM #

Hi,

Quick follow-up: after editing the tests on a branch of the repository so that they spit out various values for debug purposes, I have observed that the simulated datasets generated in the tests are the same on all OSs, but the values that the algorithm has at the point where it starts the EM algorithm stage are not. As I am short on time for  now, I am going to convert the unit tests of the reordering step so that they are applying reordering to a pre-run saved set of clustering output objects. In the long run, I need to diagnose where the differences are between the starting points generated by the three OSs.

So I have no need for further assistance right now, but thanks for reading.

Louise

From: Louise McMillan <louise.mcmillan at vuw.ac.nz>
Sent: Tuesday, 3 March 2026 11:36 am
To: r-package-devel at r-project.org <r-package-devel at r-project.org>
Subject: Different unit test results on MacOS

Hi,

My package is called clustord (Github latest version at github.com/vuw-clustering/clustord, and version 2.0.0 pushed to CRAN yesterday 2nd March 2026.)

I have an odd problem: I added a function to my package and an extensive set of unit tests for it, and the unit tests run correctly on Windows and Linux, but half of one single file out of the three test files runs differently on MacOS than it does on Windows and Linux, and fails the tests.

The package is a clustering package, and the new function is designed to be able to reorder the output clusters in order of their cluster effect sizes. The unit tests run the clustering algorithm and then the reorder function on a simulated dataset and then check the output orderings against what I've manually worked out the ordering should be.

The dataset simulation process uses randomness, and the clustering algorithm uses randomness, but the reordering does not. The start of each section of the test script is set.seed(), in order to ensure the dataset is always the same, and then that seed should also fix the output of the clustering algorithm that runs just after the dataset simulation. So therefore the results should always be the same on all operating systems. This is why I'm so puzzled that almost all of the different versions of this test work on MacOS as on Windows and Linux, but this particular version of the test runs differently on MacOS, even though I set the seed at the start of simulating the dataset for this specific test run.

Since I do not have a Mac, it is difficult for me to debug it, though I can see the error when the push to Github triggers the Github Actions check, which runs on multiple OSs.

The start of the section of the test script that's failing is:

------------------------------
    library(clustord)
    ## Dataset simulation
    set.seed(30)
    n <- 30
    p <- 5
    long_df_sim <- data.frame(Y=factor(sample(1:3,n*p,replace=TRUE)),
                              ROW=rep(1:n,times=p),COL=rep(1:p,each=n))

    xr1 <- runif(n, min=0, max=2)
    xr2 <- sample(c("A","B"),size=n, replace=TRUE, prob=c(0.3,0.7))
    xr3 <- factor(sample(1:4, size=n, replace=TRUE))

    xc1 <- runif(p, min=-1, max=1)

    long_df_sim$xr1 <- rep(xr1, times=5)
    long_df_sim$xr2 <- rep(xr2, times=5)
    long_df_sim$xr3 <- rep(xr3, times=5)
    long_df_sim$xc1 <- rep(xc1, each=30)

    ## Clustering algorithm
    # OSM results --------------------------------------------------------------
    ## Model 1 ----
    orig <- clustord(Y~ROWCLUST*xr1+xr2*xr3+COL, model="OSM", RG=4,
                     long_df=long_df_sim, nstarts=1, constraint_sum_zero = FALSE,
                     control_EM=list(maxiter=3,maxiter_start=2,keep_all_params=TRUE))
------------------------------

This section is just the dataset simulation and the clustering algorithm. The reordering checks afterwards are failing, but I think it's more likely it's because the clustering algorithm is somehow producing a different result on the Mac than because the reordering (which is deterministic) is somehow producing a different result on the Mac.

Iff you display orig$out_parlist after running the above code, I expect the $rowc values to be

$rowc
     rowc_1      rowc_2      rowc_3      rowc_4
 0.00000000  0.08713465 -0.26123294  0.05820879

I will keep investigating this myself, but if anyone has any suggestions why the randomness might be working slightly differently on the Mac, or any other possible causes for occasional mismatches between MacOS and other OSs, I would really appreciate reading them.

Thanks very much
Louise

Hugh Parsonage

Tue, Mar 3, 2026 2:07 AM #

Smells like a arm64 ? `parallel_starts` precision issue: Try

Sys.setenv(
  OMP_NUM_THREADS="1",
  OPENBLAS_NUM_THREADS="1",
  MKL_NUM_THREADS="1",
  VECLIB_MAXIMUM_THREADS="1"  # important on macOS
)

# and (for diagnostics, on a Mac machine)

options(digits = 21)
#or
sprintf("%.<whatever produces the most precision>f", "Your info when tests
fail %g")

around the numbers you expect to be ordered identically. Though I can't see
parallel_starts invoked TRUE in your test suite, though I only looked
cursorily.

On Tue, 3 Mar 2026 at 20:26, Louise McMillan <louise.mcmillan at vuw.ac.nz>
wrote:

________________________________
From: Louise McMillan <louise.mcmillan at vuw.ac.nz>
Sent: Tuesday, 3 March 2026 11:36 am
To: r-package-devel at r-project.org <r-package-devel at r-project.org>
Subject: Different unit test results on MacOS

Hi,

My package is called clustord (Github latest version at
github.com/vuw-clustering/clustord, and version 2.0.0 pushed to CRAN
yesterday 2nd March 2026.)

I have an odd problem: I added a function to my package and an extensive
set of unit tests for it, and the unit tests run correctly on Windows and
Linux, but half of one single file out of the three test files runs
differently on MacOS than it does on Windows and Linux, and fails the tests.

The package is a clustering package, and the new function is designed to
be able to reorder the output clusters in order of their cluster effect
sizes. The unit tests run the clustering algorithm and then the reorder
function on a simulated dataset and then check the output orderings against
what I've manually worked out the ordering should be.

The dataset simulation process uses randomness, and the clustering
algorithm uses randomness, but the reordering does not. The start of each
section of the test script is set.seed(), in order to ensure the dataset is
always the same, and then that seed should also fix the output of the
clustering algorithm that runs just after the dataset simulation. So
therefore the results should always be the same on all operating systems.
This is why I'm so puzzled that almost all of the different versions of
this test work on MacOS as on Windows and Linux, but this particular
version of the test runs differently on MacOS, even though I set the seed
at the start of simulating the dataset for this specific test run.

Since I do not have a Mac, it is difficult for me to debug it, though I
can see the error when the push to Github triggers the Github Actions
check, which runs on multiple OSs.

The start of the section of the test script that's failing is:

------------------------------
    library(clustord)
    ## Dataset simulation
    set.seed(30)
    n <- 30
    p <- 5
    long_df_sim <- data.frame(Y=factor(sample(1:3,n*p,replace=TRUE)),
                              ROW=rep(1:n,times=p),COL=rep(1:p,each=n))

    xr1 <- runif(n, min=0, max=2)
    xr2 <- sample(c("A","B"),size=n, replace=TRUE, prob=c(0.3,0.7))
    xr3 <- factor(sample(1:4, size=n, replace=TRUE))

    xc1 <- runif(p, min=-1, max=1)

    long_df_sim$xr1 <- rep(xr1, times=5)
    long_df_sim$xr2 <- rep(xr2, times=5)
    long_df_sim$xr3 <- rep(xr3, times=5)
    long_df_sim$xc1 <- rep(xc1, each=30)

    ## Clustering algorithm
    # OSM results
--------------------------------------------------------------
    ## Model 1 ----
    orig <- clustord(Y~ROWCLUST*xr1+xr2*xr3+COL, model="OSM", RG=4,
                     long_df=long_df_sim, nstarts=1, constraint_sum_zero =
FALSE,

 control_EM=list(maxiter=3,maxiter_start=2,keep_all_params=TRUE))
------------------------------

This section is just the dataset simulation and the clustering algorithm.
The reordering checks afterwards are failing, but I think it's more likely
it's because the clustering algorithm is somehow producing a different
result on the Mac than because the reordering (which is deterministic) is
somehow producing a different result on the Mac.

Iff you display orig$out_parlist after running the above code, I expect
the $rowc values to be

$rowc
     rowc_1      rowc_2      rowc_3      rowc_4
 0.00000000  0.08713465 -0.26123294  0.05820879

I will keep investigating this myself, but if anyone has any suggestions
why the randomness might be working slightly differently on the Mac, or any
other possible causes for occasional mismatches between MacOS and other
OSs, I would really appreciate reading them.

Thanks very much
Louise

        [[alternative HTML version deleted]]

______________________________________________
R-package-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-package-devel

Simon Urbanek

Tue, Mar 3, 2026 2:01 PM #

Louise,

TL;DR this is not macOS specific - your test example is chaotic and thus will be influenced even by small changes in the precision beyond what is guaranteed, i.e. your assumptions are not generally valid and thus the tests don?t work.

The full story: floating point operation are deterministic, but may not yield the same results if varying precision is used. In most cases, the default precision is double-precision (53-bit significand) which is guaranteed to work on all CPUs, but R has the option to use extended precision (long double) for some operations (like accumulators in sums/means) if available which can be anywhere from 53 to 113-bits. Common CPUs derived from the Intel FPU use 64-bit significand which is only slightly more than double precision, but enough to make a difference in some cases. The arm CPUs used by Macs only support double precision, so any operations that are otherwise preformed with extended precision will be different. The differences will be very small, but your algorithm seems to be extremely sensitive to those - I think you're just testing the wrong thing (the output even says that it doesn't converge) since the result is chaotic (i.e. small changes have huge impact), not something you want to test.

To check what precision your R has, have a look at .Machine$longdouble.digits which will give you the precision of long doubles or NULL if there is no long double support in that build of R (given your results I bet you are using Intel-based CPU with 64-bit precision). You can check if your code is overly sensitive on your machine if you compile R with --disable-long-double which will make R only use double precision and then run your code - it does produce very different results on your example.

Cheers,
Simon

On 3/03/2026, at 11:36, Louise McMillan <louise.mcmillan at vuw.ac.nz> wrote:

Hi,

My package is called clustord (Github latest version at github.com/vuw-clustering/clustord, and version 2.0.0 pushed to CRAN yesterday 2nd March 2026.)

I have an odd problem: I added a function to my package and an extensive set of unit tests for it, and the unit tests run correctly on Windows and Linux, but half of one single file out of the three test files runs differently on MacOS than it does on Windows and Linux, and fails the tests.

The package is a clustering package, and the new function is designed to be able to reorder the output clusters in order of their cluster effect sizes. The unit tests run the clustering algorithm and then the reorder function on a simulated dataset and then check the output orderings against what I've manually worked out the ordering should be.

The dataset simulation process uses randomness, and the clustering algorithm uses randomness, but the reordering does not. The start of each section of the test script is set.seed(), in order to ensure the dataset is always the same, and then that seed should also fix the output of the clustering algorithm that runs just after the dataset simulation. So therefore the results should always be the same on all operating systems. This is why I'm so puzzled that almost all of the different versions of this test work on MacOS as on Windows and Linux, but this particular version of the test runs differently on MacOS, even though I set the seed at the start of simulating the dataset for this specific test run.

Since I do not have a Mac, it is difficult for me to debug it, though I can see the error when the push to Github triggers the Github Actions check, which runs on multiple OSs.

The start of the section of the test script that's failing is:

------------------------------
   library(clustord)
   ## Dataset simulation
   set.seed(30)
   n <- 30
   p <- 5
   long_df_sim <- data.frame(Y=factor(sample(1:3,n*p,replace=TRUE)),
                             ROW=rep(1:n,times=p),COL=rep(1:p,each=n))

   xr1 <- runif(n, min=0, max=2)
   xr2 <- sample(c("A","B"),size=n, replace=TRUE, prob=c(0.3,0.7))
   xr3 <- factor(sample(1:4, size=n, replace=TRUE))

   xc1 <- runif(p, min=-1, max=1)

   long_df_sim$xr1 <- rep(xr1, times=5)
   long_df_sim$xr2 <- rep(xr2, times=5)
   long_df_sim$xr3 <- rep(xr3, times=5)
   long_df_sim$xc1 <- rep(xc1, each=30)

   ## Clustering algorithm
   # OSM results --------------------------------------------------------------
   ## Model 1 ----
   orig <- clustord(Y~ROWCLUST*xr1+xr2*xr3+COL, model="OSM", RG=4,
                    long_df=long_df_sim, nstarts=1, constraint_sum_zero = FALSE,
                    control_EM=list(maxiter=3,maxiter_start=2,keep_all_params=TRUE))
------------------------------

This section is just the dataset simulation and the clustering algorithm. The reordering checks afterwards are failing, but I think it's more likely it's because the clustering algorithm is somehow producing a different result on the Mac than because the reordering (which is deterministic) is somehow producing a different result on the Mac.

Iff you display orig$out_parlist after running the above code, I expect the $rowc values to be

$rowc
    rowc_1      rowc_2      rowc_3      rowc_4
0.00000000  0.08713465 -0.26123294  0.05820879

I will keep investigating this myself, but if anyone has any suggestions why the randomness might be working slightly differently on the Mac, or any other possible causes for occasional mismatches between MacOS and other OSs, I would really appreciate reading them.

Thanks very much
Louise

[[alternative HTML version deleted]]

______________________________________________
R-package-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-package-devel

Louise McMillan

Tue, Mar 3, 2026 3:25 PM #

Hi Simon and Hugh,

That's extremely helpful, thank you for all the info. I have found a better solution for those tests in the meantime ? they are to test the reordering of the clustering output, so the output doesn't need to have converged, thus all the warnings because the clustering is just running briefly to produce any output. Your info will be very helpful for my further investigation into the variation between runs and between OSs.

I agree that the algorithm is likely to be sensitive to small differences at the start, but when it's running in the proper mode to get good output, rather than just running briefly for test output, it chooses the best of many different starting points before working towards a good solution, so the variation at the start then has less of an impact.

Thanks
Louise

From: Simon Urbanek <simon.urbanek at R-project.org>
Sent: Wednesday, 4 March 2026 11:01 am
To: Louise McMillan <louise.mcmillan at vuw.ac.nz>
Cc: r-package-devel at r-project.org <r-package-devel at r-project.org>
Subject: Re: [R-pkg-devel] Different unit test results on MacOS

Louise,

TL;DR this is not macOS specific - your test example is chaotic and thus will be influenced even by small changes in the precision beyond what is guaranteed, i.e. your assumptions are not generally valid and thus the tests don?t work.

The full story: floating point operation are deterministic, but may not yield the same results if varying precision is used. In most cases, the default precision is double-precision (53-bit significand) which is guaranteed to work on all CPUs, but R has the option to use extended precision (long double) for some operations (like accumulators in sums/means) if available which can be anywhere from 53 to 113-bits. Common CPUs derived from the Intel FPU use 64-bit significand which is only slightly more than double precision, but enough to make a difference in some cases. The arm CPUs used by Macs only support double precision, so any operations that are otherwise preformed with extended precision will be different. The differences will be very small, but your algorithm seems to be extremely sensitive to those - I think you're just testing the wrong thing (the output even says that it doesn't converge) since the result is chaotic (i.e. small changes have huge impact), not something you want to test.

To check what precision your R has, have a look at .Machine$longdouble.digits which will give you the precision of long doubles or NULL if there is no long double support in that build of R (given your results I bet you are using Intel-based CPU with 64-bit precision). You can check if your code is overly sensitive on your machine if you compile R with --disable-long-double which will make R only use double precision and then run your code - it does produce very different results on your example.

Cheers,
Simon

> On 3/03/2026, at 11:36, Louise McMillan <louise.mcmillan at vuw.ac.nz> wrote:
>
> Hi,
>
> My package is called clustord (Github latest version at github.com/vuw-clustering/clustord, and version 2.0.0 pushed to CRAN yesterday 2nd March 2026.)
>
> I have an odd problem: I added a function to my package and an extensive set of unit tests for it, and the unit tests run correctly on Windows and Linux, but half of one single file out of the three test files runs differently on MacOS than it does on Windows and Linux, and fails the tests.
>
> The package is a clustering package, and the new function is designed to be able to reorder the output clusters in order of their cluster effect sizes. The unit tests run the clustering algorithm and then the reorder function on a simulated dataset and then check the output orderings against what I've manually worked out the ordering should be.
>
> The dataset simulation process uses randomness, and the clustering algorithm uses randomness, but the reordering does not. The start of each section of the test script is set.seed(), in order to ensure the dataset is always the same, and then that seed should also fix the output of the clustering algorithm that runs just after the dataset simulation. So therefore the results should always be the same on all operating systems. This is why I'm so puzzled that almost all of the different versions of this test work on MacOS as on Windows and Linux, but this particular version of the test runs differently on MacOS, even though I set the seed at the start of simulating the dataset for this specific test run.
>
> Since I do not have a Mac, it is difficult for me to debug it, though I can see the error when the push to Github triggers the Github Actions check, which runs on multiple OSs.
>
> The start of the section of the test script that's failing is:
>
> ------------------------------
>    library(clustord)
>    ## Dataset simulation
>    set.seed(30)
>    n <- 30
>    p <- 5
>    long_df_sim <- data.frame(Y=factor(sample(1:3,n*p,replace=TRUE)),
>                              ROW=rep(1:n,times=p),COL=rep(1:p,each=n))
>
>    xr1 <- runif(n, min=0, max=2)
>    xr2 <- sample(c("A","B"),size=n, replace=TRUE, prob=c(0.3,0.7))
>    xr3 <- factor(sample(1:4, size=n, replace=TRUE))
>
>    xc1 <- runif(p, min=-1, max=1)
>
>    long_df_sim$xr1 <- rep(xr1, times=5)
>    long_df_sim$xr2 <- rep(xr2, times=5)
>    long_df_sim$xr3 <- rep(xr3, times=5)
>    long_df_sim$xc1 <- rep(xc1, each=30)
>
>    ## Clustering algorithm
>    # OSM results --------------------------------------------------------------
>    ## Model 1 ----
>    orig <- clustord(Y~ROWCLUST*xr1+xr2*xr3+COL, model="OSM", RG=4,
>                     long_df=long_df_sim, nstarts=1, constraint_sum_zero = FALSE,
>                     control_EM=list(maxiter=3,maxiter_start=2,keep_all_params=TRUE))
> ------------------------------
>
> This section is just the dataset simulation and the clustering algorithm. The reordering checks afterwards are failing, but I think it's more likely it's because the clustering algorithm is somehow producing a different result on the Mac than because the reordering (which is deterministic) is somehow producing a different result on the Mac.
>
> Iff you display orig$out_parlist after running the above code, I expect the $rowc values to be
>
> $rowc
>     rowc_1      rowc_2      rowc_3      rowc_4
> 0.00000000  0.08713465 -0.26123294  0.05820879
>
> I will keep investigating this myself, but if anyone has any suggestions why the randomness might be working slightly differently on the Mac, or any other possible causes for occasional mismatches between MacOS and other OSs, I would really appreciate reading them.
>
> Thanks very much
> Louise
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-package-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-package-devel<https://stat.ethz.ch/mailman/listinfo/r-package-devel>
>