Skip to content

[R-pkg-devel] Windows R 4.2.0 package will not load with UTF-8 encoding

8 messages · Ivan Krylov, Hiroaki Yutani, Duncan Murdoch +1 more

#
Dear R package developers,

Starting with R 4.2.0 package rEDM (https://cran.r-project.org/package=rEDM)
will not load [library( rEDM )] on Windows with the default UTF-8 encoding.

When the locale is changed from UTF-8 to non UTF-8, the package loads and
runs. One can also change the locale to non-UTF-8, load the package, detach
and unload the package, change the locale back to UTF-8, then load and run
without issue.

Note that installation from source reports:
   ** testing if installed package can be loaded from temporary location
and completes (record below).

This package uses Rcpp to wrap a C++ API.

Having searched here and in general, I don't find that others experiencing
this issue.

I have tried
  Ensure all source files are UTF-8 encoded
  Removed non-ASCII characters from all source files
  Specify non-ASCII characters with \uXXXX
  Checked vignette encoding
  Added "Encoding : UTF-8" to DESCRIPTION

Please excuse my encoding and Windows naivety.

Here is a demonstration changing the encoding to load the package, along
with unloading & reloading under UTF-8:
--
R version 4.2.0 (2022-04-22 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19044)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.utf8  LC_CTYPE=English_United
States.utf8
[3] LC_MONETARY=English_United States.utf8 LC_NUMERIC=C

[5] LC_TIME=English_United States.utf8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

loaded via a namespace (and not attached):
[1] compiler_4.2.0
[1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United
States.1252;LC_MONETARY=English_United
States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"
Warning message:
In Sys.setlocale("LC_ALL", "English") :
  using locale code page other than 65001 ("UTF-8") may cause problems
R version 4.2.0 (2022-04-22 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19044)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United
States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C

[5] LC_TIME=English_United States.1252
system code page: 65001

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

loaded via a namespace (and not attached):
[1] compiler_4.2.0
R version 4.2.0 (2022-04-22 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19044)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United
States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C

[5] LC_TIME=English_United States.1252
system code page: 65001

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] rEDM_1.12.2.1.0

loaded via a namespace (and not attached):
[1] compiler_4.2.0 Rcpp_1.0.8.3
### All package tests pass....
### Now detach and unload, change to UTF-8, and load
500", pred = "501 505", E = 5 )
Error in Simplex(dataFrame = Lorenz5D, columns = "V1", target = "V2",  :
  could not find function "Simplex"
[1] "LC_COLLATE=English_United States.utf8;LC_CTYPE=English_United
States.utf8;LC_MONETARY=English_United
States.utf8;LC_NUMERIC=C;LC_TIME=English_United States.utf8"
R version 4.2.0 (2022-04-22 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19044)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.utf8  LC_CTYPE=English_United
States.utf8
[3] LC_MONETARY=English_United States.utf8 LC_NUMERIC=C

[5] LC_TIME=English_United States.utf8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] devtools_2.4.3 usethis_2.1.6

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.8.3      magrittr_2.0.3    pkgload_1.2.4     R6_2.5.1
 rlang_1.0.2       fastmap_1.1.0
 [7] fansi_1.0.3       tools_4.2.0       pkgbuild_1.3.1
 sessioninfo_1.2.2 utf8_1.2.2        cli_3.3.0
[13] withr_2.5.0       ellipsis_0.3.2    remotes_2.4.2     rprojroot_2.0.3
  tibble_3.1.7      lifecycle_1.0.1
[19] crayon_1.5.1      brio_1.1.3        processx_3.6.0    purrr_0.3.4
  callr_3.7.0       vctrs_0.4.1
[25] fs_1.5.2          ps_1.7.0          testthat_3.1.4    memoise_2.0.1
  glue_1.6.2        cachem_1.0.6
[31] pillar_1.7.0      compiler_4.2.0    desc_1.4.1
 prettyunits_1.1.1 pkgconfig_2.0.3
### All tests pass
#
On 11/06/2022 5:02 a.m., Joseph Park wrote:
I don't see any attempt to load the package.  You attempted to use the 
function Simplex and it was not found.  That indicates the package is 
not loaded, but not why.

What you should show are the messages you get when you start a clean 
copy of R and immediately attempt to load the package using library(). 
It's helpful that you posted sessionInfo(); I'd include that again with 
the new information, in case anything is different.

Duncan Murdoch
#
On Sat, 11 Jun 2022 05:02:23 -0400
Joseph Park <josephpark at ieee.org> wrote:

            
Could you please explain what happens instead of the expected behaviour?

Your examples seem to work without errors on CRAN machines running
Windows with R ? 4.2.0 and UTF-8 encoding:

https://www.r-project.org/nosvn/R.check/r-release-windows-x86_64/rEDM-00check.html
https://www.r-project.org/nosvn/R.check/r-devel-windows-x86_64/rEDM-00check.html
#
Apologies for the pages of minutia.  I endeavored to post reproduceable
example. I'm unable to show the failure since it simply hangs at the prompt
with CPU spinning and memory cyclically ramping and declining.  One has to
kill R. The posted commands show the workaround, not the failure.

I since found that just changing the LC_COLLATE is enough to allow the
library to load :
[1] "English_United States.1252"
[1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United
States.utf8;LC_MONETARY=English_United
States.utf8;LC_NUMERIC=C;LC_TIME=English_United States.utf8"

Again, apologies for my naivety.

On Sat, Jun 11, 2022 at 6:16 AM Duncan Murdoch <murdoch.duncan at gmail.com>
wrote:

  
  
#
Thank you for the check of the CRAN builds.  I also checked that as a first
step.  Perhaps there is some difference between the CRAN setups, as I have
reproduced this on 3 Windows 10 machines with clean installs of R 4.2.0,
and it has been reported by other users.  I also noted in the post that
building and installing via devtools reports success (  ** testing if
installed package can be loaded from temporary location ), however, a
subsequent attempt to load hangs.
On Sat, Jun 11, 2022 at 6:33 AM Joseph Park <josephpark at ieee.org> wrote:

            

  
  
#
Hi,

As your package seems to use std::regex [1], you might hit this bug in GCC.

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98723

This thread might also help:

https://github.com/tesseract-ocr/tesseract/issues/3830

Best,
Yutani

[1]: https://github.com/SugiharaLab/rEDM/blob/be6d81fb586ceac3dab59b061b5ed867e276dd83/src/cppEDM/src/DateTime.cc#L16

2022?6?11?(?) 19:48 Joseph Park <josephpark at ieee.org>:
#
On 11/06/2022 6:43 a.m., Joseph Park wrote:
One possible difference is the version of Windows 10.  The UTF8 handling 
was described in the NEWS file this way:

"R uses UTF-8 as the native encoding on recent Windows systems (at least 
Windows 10 version 1903, Windows Server 2022 or Windows Server 1903). As 
a part of this change, R uses UCRT as the C runtime. UCRT should be 
installed manually on systems older than Windows 10 or Windows Server 
2016 before installing R."

Conceivably the systems where this fails don't have the new UCRT 
runtime.  I believe running Windows Update should get it.

If it doesn't, or for users on an older Windows version, this page lets 
you download it: 
https://www.microsoft.com/en-us/download/details.aspx?id=48234 .


Duncan Murdoch
#
It looks like Hiroaki identified the issue.

When the C++ std::regex code is removed from the underlying API, the
problem seems solved. Thank you!

The symptoms observed match those described in the tesseract issue thread.
The solution outlined in the gcc bug report seems the most prudent course:
Don't use std::regex.  I'll work on that and see if it resolves the issue.

Comment #6 seems relevant:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98723#c6

Again: Thank you!

On Sat, Jun 11, 2022 at 8:49 AM Duncan Murdoch <murdoch.duncan at gmail.com>
wrote: