Skip to content
Prev 58719 / 63424 Next

Possible Bug: file.exists() Function. Due to UTF-8 Encoding differences on Windows between R 4.0.1 and R 3.6.3?

Hi Yihui,

list.files() returns file names converted to native encoding by Windows, 
so one needs to use only characters representable in current native 
encoding for file names. If one wants to be safe, it makes sense to be 
much stricter than that (only ASCII, and only a subset of it, there is a 
number of recommendations that can be found online). Using more than 
that is asking for trouble.

Unicode "\u00e4" is a Latin-1 character, so representable in CP1252. On 
my Windows running in CP1252 as C locale and system code page, your 
example works fine, file.exists() returns TRUE, and this is the expected 
behavior (tested in R-devel and R4.0).

Your example was run in CP1252 as C locale but CP936 as the system code 
page (see the sessionInfo() output). On Windows, unfortunately, there 
are two different "current locales" at a time. With your settings 
(CP1252 as C locale and CP936 as system code page), I get the same 
results as you, file.exists() returns FALSE. enc2native(z) works fine 
and returns a valid Latin-1 string, but that is because here "native" is 
CP1252. Windows API functions and consequently some C library functions 
that return strings from the OS, however, convert to the encoding from 
the system code page, which is CP936 and it cannot represent "?". So, 
currently the behavior you are reporting is expected for R 4.0 and 
earlier. I don't think this is a regression, it couldn't have worked 
before, either - and I've tested in 3.6.3 and 3.4.3 on my system.

These problems will go away when UTF-8 is both the current native 
encoding for the C locale and the system code page. This is possible in 
recent Windows 10, but requires UCRT and hence a new toolchain to build 
R, and requires all packages and libraries to be rebuilt from source. 
More details on my blog, also there is experimental build of R 
(installer) and experimental toolchain available:
https://developer.r-project.org/Blog/public/2020/05/02/utf-8-support-on-windows/index.html

Best
Tomas
On 6/22/20 6:11 AM, Yihui Xie wrote:

Thread (13 messages)

Juan Telleria Ruiz de Aguirre Possible Bug: file.exists() Function. Due to UTF-8 Encoding differences on Windows between R 4.0.1 and R 3.6.3? Jun 10 Tomas Kalibera Possible Bug: file.exists() Function. Due to UTF-8 Encoding differences on Windows between R 4.0.1 and R 3.6.3? Jun 10 Dirk Eddelbuettel Possible Bug: file.exists() Function. Due to UTF-8 Encoding differences on Windows between R 4.0.1 and R 3.6.3? Jun 10 Kevin Ushey Possible Bug: file.exists() Function. Due to UTF-8 Encoding differences on Windows between R 4.0.1 and R 3.6.3? Jun 10 Juan Telleria Ruiz de Aguirre Possible Bug: file.exists() Function. Due to UTF-8 Encoding differences on Windows between R 4.0.1 and R 3.6.3? Jun 10 Yihui Xie Possible Bug: file.exists() Function. Due to UTF-8 Encoding differences on Windows between R 4.0.1 and R 3.6.3? Jun 21 Tomas Kalibera Possible Bug: file.exists() Function. Due to UTF-8 Encoding differences on Windows between R 4.0.1 and R 3.6.3? Jun 22 Yihui Xie Possible Bug: file.exists() Function. Due to UTF-8 Encoding differences on Windows between R 4.0.1 and R 3.6.3? Jun 23 Johannes Rauh `basename` and `dirname` change the encoding to "UTF-8" Jun 29 Duncan Murdoch `basename` and `dirname` change the encoding to "UTF-8" Jun 29 Kevin Ushey `basename` and `dirname` change the encoding to "UTF-8" Jun 29 Tomas Kalibera `basename` and `dirname` change the encoding to "UTF-8" Jun 30 Johannes Rauh `basename` and `dirname` change the encoding to "UTF-8" Jun 30