Skip to content
Prev 58762 / 63424 Next

`basename` and `dirname` change the encoding to "UTF-8"

On 6/29/20 4:39 PM, Johannes Rauh wrote:
Please try to always submit a minimal reproducible example with your 
reports and test with at least the latest released version of R, ideally 
also with R-devel.

As you have not sent a reproducible example, it is hard to tell for 
sure, but most likely as Kevin wrote you have run into a real bug, which 
was however already fixed in 4.0.2 and in R-devel (17833). The lazy 
loading cache did not work with file names in non-native encoding.

That real bug has been uncovered by legitimate and correct changes like 
the ones you report, where file operations started returning non-ASCII 
strings in UTF-8. Historically in R such functions would instead return 
native strings with misrepresented characters, and we were reluctant to 
change that expecting waking bugs in code silently assuming native 
encoding. Still, as people were increasingly running into problems with 
non-representable characters, we did that change in several functions 
anyway, and yes, it started waking up bugs.

With some performance overhead and added complexity, we could be 
returning preferentially results in native encoding, and in UTF-8 only 
when they included non-representable characters. That would increase the 
code complexity, increase performance overhead, but wake up existing 
bugs with smaller probability.? Note - some code that relied previously 
on best-fit conversions done by Windows will have been broken anyway. We 
would have to bypass win_iconv/iconv for that (adding more complexity). 
Bugs in code not handling encodings properly would still be triggered 
via non-representable characters. I've recently changed file.path() in 
R-devel to be slightly more conservative again, along these lines.

We can still do it more widely, but it is not high on the priority list. 
The way to fix all of these problems is switching to UTF-8 as native 
encoding on Windows and every day spent on tuning the existing behavior 
postpones that real solution.

Best
Tomas

Thread (13 messages)

Juan Telleria Ruiz de Aguirre Possible Bug: file.exists() Function. Due to UTF-8 Encoding differences on Windows between R 4.0.1 and R 3.6.3? Jun 10 Tomas Kalibera Possible Bug: file.exists() Function. Due to UTF-8 Encoding differences on Windows between R 4.0.1 and R 3.6.3? Jun 10 Dirk Eddelbuettel Possible Bug: file.exists() Function. Due to UTF-8 Encoding differences on Windows between R 4.0.1 and R 3.6.3? Jun 10 Kevin Ushey Possible Bug: file.exists() Function. Due to UTF-8 Encoding differences on Windows between R 4.0.1 and R 3.6.3? Jun 10 Juan Telleria Ruiz de Aguirre Possible Bug: file.exists() Function. Due to UTF-8 Encoding differences on Windows between R 4.0.1 and R 3.6.3? Jun 10 Yihui Xie Possible Bug: file.exists() Function. Due to UTF-8 Encoding differences on Windows between R 4.0.1 and R 3.6.3? Jun 21 Tomas Kalibera Possible Bug: file.exists() Function. Due to UTF-8 Encoding differences on Windows between R 4.0.1 and R 3.6.3? Jun 22 Yihui Xie Possible Bug: file.exists() Function. Due to UTF-8 Encoding differences on Windows between R 4.0.1 and R 3.6.3? Jun 23 Johannes Rauh `basename` and `dirname` change the encoding to "UTF-8" Jun 29 Duncan Murdoch `basename` and `dirname` change the encoding to "UTF-8" Jun 29 Kevin Ushey `basename` and `dirname` change the encoding to "UTF-8" Jun 29 Tomas Kalibera `basename` and `dirname` change the encoding to "UTF-8" Jun 30 Johannes Rauh `basename` and `dirname` change the encoding to "UTF-8" Jun 30