R-4.3 version list.files function could not work correctly in chinese

20 messages · 叶月光, Ivan Krylov, Yihui Xie +3 more

Original

1

20

叶月光

Thu, Aug 10, 2023 8:41 PM #

????
          ????R-4.3???????????R??????list.files??????????????????????BUG????????????????????????????
r4.3????dir????????????????? - COS??? | ?????? | ?????????????? (cosx.org)<https://d.cosx.org/d/424356-r43ban-ben-zhong-dirhan-shu-huo-qu-bu-liao-quan-bu-wen-jian/11>
          ???????????????????????

Ivan Krylov

Fri, Aug 11, 2023 4:24 AM #

Dear ???,

Thank you for your message, but please follow the posting guide in your
future messages: https://www.r-project.org/posting-guide.html
https://www.r-project.org/bugs.html

I understand from your link that list.files() ends up skipping some
Chinese filenames in R-4.3.1 (but not R-4.2.2) on Windows, but would you
(or perhaps Yihui Xie who I see is also participating in the discussion)
mind translating the rest of your findings into English? Have you been
able to narrow down the problem to certain character ranges, for
example?

Best regards,
Ivan

Yihui Xie

Fri, Aug 11, 2023 9:40 PM #

Yes, I participated in the discussion. Basically dir() failed to list all
files since R 4.3.0 when filenames start with Chinese characters. I don't
have a Windows machine to test it, but this might be a minimal reproducible
example:

file.create("????.R")
dir()

The OP said dir() would return "????.R" in R.4.2.2 but not in R 4.3.0. In
the same discussion another person mentioned that the problem could also be
related to the file encoding, i.e., if the file content is encoded in
UTF-8, it could be recognized by dir(), but not in ANSI.

Regards,
Yihui
--
https://yihui.org

On Fri, Aug 11, 2023 at 6:25?AM Ivan Krylov <krylov.r00t at gmail.com> wrote:

______________________________________________
R-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Ivan Krylov

Sat, Aug 12, 2023 8:33 AM #

Dear Yihui,

Thanks a lot for your help!

Unfortunately, I was not able to reproduce this. I've tried creating
files with Chinese characters in their names and populating them
with valid UTF-8 and valid non-UTF-8 text, but R seems to be able to
list them all in my case.

I'm running a US English evaluation ISO image of a slightly newer build
of Windows 10, and I also compiled R-4.3.1 from source, anticipating
having to single-step through the list.files() implementation:

sessionInfo()
# R version 4.3.1 (2023-06-16 ucrt)
# Platform: x86_64-w64-mingw32/x64 (64-bit)
# Running under: Windows 10 x64 (build 19045)
# 
# Matrix products: default
# 
# 
# locale:
# [1] LC_COLLATE=English_United States.utf8  LC_CTYPE=English_United
# States.utf8
# [3] LC_MONETARY=English_United States.utf8 LC_NUMERIC=C
# [5] LC_TIME=English_United States.utf8
# 
# time zone: America/Los_Angeles
# tzcode source: internal
# 
# attached base packages:
# [1] stats     graphics  grDevices utils     datasets  methods   base
#
# loaded via a namespace (and not attached):
# [1] compiler_4.3.1
dir("????")
# [1] "????-non-utf8-?????.txt" "????-utf-8.txt"         
system('cmd /c dir /s *.txt')
#  Volume in drive C has no label.
#  Volume Serial Number is A85A-AA74
# 
#  Directory of C:\R\R-4.3.1\bin\x64\????
# 
# 08/12/2023  07:57 AM                22 ????-non-utf8-?????.txt
# 08/12/2023  07:56 AM                18 ????-utf-8.txt
#                2 File(s)             40 bytes
# 
#      Total Files Listed:
#                2 File(s)             40 bytes
#                0 Dir(s)  29,538,418,688 bytes free
# [1] 0

(The OEM codepage cannot represent the characters I used in the file
names, but all the files are present in both lists.)

In order to find out what's wrong, it will be needed to download the R
source code and compile it [*], install gdb using pacman (part of
Rtools), then set a breakpoint on the list_files function from
src/main/platform.c and step through it [**], paying attention to the
R_readdir calls. Do the missing file names not even come out from
FindNextFile()? Are they somehow skipped around the time of regex match?

(I could help with the details of this, maybe off-list, if there's
interest.)

Unless Tomas Kalibera is able to deduce the root cause from the
observed symptoms, someone who can reproduce the problem will have to
investigate further.

Best regards,
Ivan

[*] https://cran.r-project.org/bin/windows/base/howto-R-devel.html

[**] https://beej.us/guide/bggdb/

叶月光

Sat, Aug 12, 2023 10:39 PM #

list.files function is notcorrect?


 

-----????-----
???: Ivan Krylov [mailto:krylov.r00t at gmail.com] 
????: 2023?8?12? 23:33
???: Yihui Xie <xie at yihui.name>
??: ??? <yeyueguang at goldwind.com>; r-devel at r-project.org
??: Re: [Rd] R-4.3 version list.files function could not work correctly in chinese

Dear Yihui,

Thanks a lot for your help!

Unfortunately, I was not able to reproduce this. I've tried creating files with Chinese characters in their names and populating them with valid UTF-8 and valid non-UTF-8 text, but R seems to be able to list them all in my case.

I'm running a US English evaluation ISO image of a slightly newer build of Windows 10, and I also compiled R-4.3.1 from source, anticipating having to single-step through the list.files() implementation:

sessionInfo()
# R version 4.3.1 (2023-06-16 ucrt)
# Platform: x86_64-w64-mingw32/x64 (64-bit) # Running under: Windows 10 x64 (build 19045) # # Matrix products: default # # # locale:
# [1] LC_COLLATE=English_United States.utf8  LC_CTYPE=English_United # States.utf8 # [3] LC_MONETARY=English_United States.utf8 LC_NUMERIC=C # [5] LC_TIME=English_United States.utf8 # # time zone: America/Los_Angeles # tzcode source: internal # # attached base packages:
# [1] stats     graphics  grDevices utils     datasets  methods   base
#
# loaded via a namespace (and not attached):
# [1] compiler_4.3.1
dir("????")
# [1] "????-non-utf8-?????.txt" "????-utf-8.txt"
system('cmd /c dir /s *.txt')
#  Volume in drive C has no label.
#  Volume Serial Number is A85A-AA74
#
#  Directory of C:\R\R-4.3.1\bin\x64\????
#
# 08/12/2023  07:57 AM                22 ????-non-utf8-?????.txt
# 08/12/2023  07:56 AM                18 ????-utf-8.txt
#                2 File(s)             40 bytes
#
#      Total Files Listed:
#                2 File(s)             40 bytes
#                0 Dir(s)  29,538,418,688 bytes free
# [1] 0

(The OEM codepage cannot represent the characters I used in the file names, but all the files are present in both lists.)

In order to find out what's wrong, it will be needed to download the R source code and compile it [*], install gdb using pacman (part of Rtools), then set a breakpoint on the list_files function from src/main/platform.c and step through it [**], paying attention to the R_readdir calls. Do the missing file names not even come out from FindNextFile()? Are they somehow skipped around the time of regex match?

(I could help with the details of this, maybe off-list, if there's
interest.)

Unless Tomas Kalibera is able to deduce the root cause from the observed symptoms, someone who can reproduce the problem will have to investigate further.

--
Best regards,
Ivan

[*] https://cran.r-project.org/bin/windows/base/howto-R-devel.html

[**] https://beej.us/guide/bggdb/
??????????????
?????????????????????????????????????????????????????????????????
?????????????????? ITSecurity at goldwind.com?

???????????????
Email system security tips?
The use of emails to collect personal information, account passwords, bank card information, help, subsidies, money transfers, etc. is "phishing email" or "virus email", no response is required, and please delete it immediately.
If you encounter email security issues, please contact ITSecurity at goldwind.com.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: r-sessioninfo.png
Type: image/png
Size: 127929 bytes
Desc: r-sessioninfo.png
URL: <https://stat.ethz.ch/pipermail/r-devel/attachments/20230813/ab7820ad/attachment.png>

-------------- next part --------------
A non-text attachment was scrubbed...
Name: list.files_test.png
Type: image/png
Size: 38952 bytes
Desc: list.files_test.png
URL: <https://stat.ethz.ch/pipermail/r-devel/attachments/20230813/ab7820ad/attachment-0001.png>

-------------- next part --------------
A non-text attachment was scrubbed...
Name: path-files.png
Type: image/png
Size: 29532 bytes
Desc: path-files.png
URL: <https://stat.ethz.ch/pipermail/r-devel/attachments/20230813/ab7820ad/attachment-0002.png>

Ivan Krylov

Sun, Aug 13, 2023 12:58 AM #

Dear ???,

I believe you that there's a problem with list.files() and file names
in Chinese. There is no need for additional proof. Unfortunately, it's
impossible to fix the problem unless its source is found:
https://www.chiark.greenend.org.uk/~sgtatham/bugs-cn.html

Can you give me more examples of file names, _as text_, that I could
_copy and paste_ into my computer in order to (hopefully) reproduce the
problem here?

Alternatively, can you use a debugger for programs written in C? Do you
know someone who does?

Best regards,
Ivan

yu gong

Sun, Aug 13, 2023 2:28 AM #

I am afraid this issue a bite more complicated.
Test Rgui and Rterm 4.3.1 and svn trunk on Windows 10 x64 (build 19044) , Chinese file name shows correctly (file content ANSI or UTF-8 ).
I saw OP picture (using Rstudio), maybe this is Rstudio issues?

yu gong

Sun, Aug 13, 2023 2:36 AM #

Could you test it on RGui and Rterm first, see it work or not. then try RStudio?

Ivan Krylov

Sun, Aug 13, 2023 4:16 AM #

Found it! Looks like a buffer length problem. This isn't limited to
Chinese, just more likely to happen when a character takes three bytes
to represent in UTF-8. (Any filename containing characters which take
more than one byte to represent in UTF-8 may fail.)

If a directory contains a file with a sufficiently long name,
FindNextFile() fails with ERROR_MORE_DATA (0xEA, 234), making
R_readdir() return NULL, stopping list_files() prematurely:

# everything seems to work fine...

list.files("????")
# [1] "????-non-utf8-?????
????????????????????????????????????????????????????.txt"
# [2] "????-non-utf8-?????.txt"
# [3] "????-utf-8.txt"

# now create a file with an even longer name

list.files("????")
# [1] "????-non-utf8-?????
????????????????????????????????????????????????????.txt"

# the files are still there, but not visible to list.files():

system("cmd /c dir /s *.txt")
#  Volume in drive C has no label.
#  Volume Serial Number is A85A-AA74
#
#  Directory of C:\R\R-4.3.1\bin\x64\????
#
# 08/12/2023  07:57 AM                22 ????-non-utf8-?????
????????????????????????????????????????????????????.txt
# 08/12/2023 07:57 AM                22 ????-non-utf8-?????
????????????????????????????????????????????????????????????????????????????????????????????????????????.txt
# 08/12/2023  07:57 AM                22 ????-non-utf8-?????.txt
# 08/12/2023  07:56 AM                18 ????-utf-8.txt
# 4 File(s)             84 bytes
# 
#       Total Files Listed:
#                4 File(s)             84 bytes
#                0 Dir(s)  29,281,538,048 bytes free
# [1] 0

Increasing the path length limits [*] doesn't help, since it's the
filename length limit that we're bumping against. While both
WIN32_FIND_DATAA and WIN32_FIND_DATAW contain fixed-size buffers, a
valid filename may take more than MAX_PATH bytes to represent in UTF-8
while still being under the limit of MAX_PATH wide characters. This may
mean having to rewrite list_files in terms of R_wopendir()/R_wreaddir()
for Windows. As a workaround, we may use the short filename (which
sometimes may not exist, alas) when FindNextFile() fails with
ERROR_MORE_DATA.

Best regards,
Ivan

[*]
https://learn.microsoft.com/en-us/windows/win32/fileio/maximum-file-path-limitation

叶月光

Sun, Aug 13, 2023 6:45 PM #

Rterm.exe  test result?

D:\Project_Delivery\

[1] "2022(1).xlsx"
[2] "conf_custom_wf_wt_map_202308091545.csv"
[3] ".R"
[4] ".xlsx"
[5] "_.xlsx"
[6] ".xlsx"
[7] " (3).xlsx"
[8] "20230222113605379(1).xlsx"
[9] "_2022_20230811.docx"

All  the file names which contains the Chinese  can not be printed.
The result of RGUI and RStudio are the same:

D:\Project_Delivery\??

[1] "2022????????????????(1).xlsx" "conf_custom_wf_wt_map_202308091545.csv"


???: yu gong [mailto:yugong at outlook.com]
????: 2023?8?13? 17:36
???: ??? <yeyueguang at goldwind.com>; r-devel at r-project.org
??: Re: R-4.3 version list.files function could not work correctly in chinese

Could you test it on RGui and Rterm first, see it work or not. then try RStudio?

Kevin Ushey

Mon, Aug 14, 2023 7:26 AM #

Just to rule it out... is it possible that R is listing these files
successfully, but is not printing the Chinese characters in those
names for some reason?

Using your example, what is the output of:

    f <- list.files(a, recursive = T)
    nchar(f)

Does the reported number of characters match what you see?

Best,
Kevin

On Mon, Aug 14, 2023 at 12:32?AM ??? <yeyueguang at goldwind.com> wrote:

________________________________
From: R-devel <r-devel-bounces at r-project.org<mailto:r-devel-bounces at r-project.org>> on behalf of ??? <yeyueguang at goldwind.com<mailto:yeyueguang at goldwind.com>>
Sent: Friday, August 11, 2023 11:41
To: r-devel at r-project.org<mailto:r-devel at r-project.org> <r-devel at r-project.org<mailto:r-devel at r-project.org>>
Subject: [Rd] R-4.3 version list.files function could not work correctly in chinese

     ????
          ????R-4.3???????????R??????list.files??????????????????????BUG????????????????????????????
r4.3????dir????????????????? - COS??? | ?????? | ?????????????? (cosx.org)<https://d.cosx.org/d/424356-r43ban-ben-zhong-dirhan-shu-huo-qu-bu-liao-quan-bu-wen-jian/11><ttps://d.cosx.org/d/424356-r43ban-ben-zhong-dirhan-shu-huo-qu-bu-liao-quan-bu-wen-jian/11%3e>
          ???????????????????????

        [[alternative HTML version deleted]]
??????????????
?????????????????????????????????????????????????????????????????
?????????????????? ITSecurity at goldwind.com?

???????????????
Email system security tips?
The use of emails to collect personal information, account passwords, bank card information, help, subsidies, money transfers, etc. is "phishing email" or "virus email", no response is required, and please delete it immediately.
If you encounter email security issues, please contact ITSecurity at goldwind.com.<mailto:TSecurity at goldwind.com.>

        [[alternative HTML version deleted]]

______________________________________________
R-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Mon, Aug 14, 2023 11:38 PM #

On 8/13/23 13:16, Ivan Krylov wrote:

Thanks, Ivan, could you please turn this into a complete minimal 
reproducible example, ideally with only ASCII characters (if enough to 
trigger)? Or any reproducible example would do. I would have a look 
later today.

I admit I didn't get your analysis. However, I've rewritten this code 
for R 4.3 to support long paths (when enabled in the system), more in 
https://blog.r-project.org/2023/03/07/path-length-limit-on-windows/index.html. 
As this was reported to be regression in 4.3, it is entirely possible 
this change came with a regression (though a bit surprising we didn't 
catch it earlier by testing), so it would be a great help if I could 
have the example and debug it.

Thanks,
Tomas

Ivan Krylov

Tue, Aug 15, 2023 12:04 AM #

? Tue, 15 Aug 2023 08:38:11 +0200
Tomas Kalibera <tomas.kalibera at gmail.com> ?????:

Sorry, let me try to be more clear.

The Windows filename length limit is 255(?) wide characters. The
WIN32_FIND_DATAA structure contains a 260-byte buffer for the filename
to be returned by FindFirstFileA()/FindNextFileA(). If a wide character
takes more than one byte to be represented in UTF-8, it may overflow
the 260 byte limit in the WIN32_FIND_DATAA structure despite being
below the 260 wide character limit. When such an overflow happens,
FindNextFile() returns FALSE with GetLastError() == ERROR_MORE_DATA,
which results in R_readdir() returning NULL and makes list_files() stop
before listing the rest of the directory.

This is easier to make happen by accident with Chinese characters,
because they take three UTF-8 bytes per character.

Take the ? (\uf8) letter. It takes two bytes to represent in UTF-8.
Create a file with a name consisting of this symbol repeated 140 times.
When you run list.files() on the resulting directory on Windows with a
UTF-8 locale, Windows tries to fit (0xc3 0xb8) times 140 into a
260-byte buffer, which doesn't work. I'm afraid the only way to avoid
such a failure is to rewrite R_readdir using the wide character API and
convert the file names on the fly. (Just like mingw readdir() did in
the past?)

stopifnot(.Platform$OS.type == 'windows', l10n_info()$`UTF-8`)
# any character for which nchar(enc2utf8(.), 'bytes') > 1 will do
# any number >260/2 should do
file.create(strrep('\uf8', 140))
list.files()

Does this work? I don't have access to a UTF-8 Windows machine right
now.

Best regards,
Ivan

Tue, Aug 15, 2023 7:00 AM #

On 8/15/23 09:04, Ivan Krylov wrote:

Thanks, yes, I can reproduce the problem. Some Windows functions impose 
260 wide characters limit, but other 260 bytes limit, so one can create 
a file with a name too long to be found by FindNextFileA.

In R 4.2, we used readdir() from mingw-w64, which itself used findnext, 
which however had the same problem, it used a buffer of size 260 bytes 
and from the code of mingw-w64 and the Windows documentation, it should 
have behaved the same, it should have stopped the search on such a long 
file name. However, in my use case, R 4.2.3 crashed inside findnext due 
to stack overrun, R 4.1.3 worked, but clearly it would require a 
different use case to overrun this buffer as it didn't use UTF-8. This 
suggests that findnext didn't have a check for this and hence caused 
memory corruption, which can lead to a crash or work by coincidence. 
Which could have been the case for the user reporting this as a 
regression compared to R 4.2. But it is not a regression, the problem 
existed for long.

So, yes, we'd probably have to use wide variants of FindNext/FindFirst. 
I'll fix.

Thanks for debugging this,
Tomas

Tue, Aug 15, 2023 7:14 AM #

Dear ???,

as discussed on this thread, Ivan Krylov found a bug in R, which could 
be causing the problem you have run into. To confirm this is the cause, 
could you please check outside R (say in explorer) if you have any file 
with a very long name in the directory? And if so, does moving that file 
away make the problem disappear? Files with up to 80 characters couldn't 
trigger this bug.

A workaround for this bug is not to use file names with more than 80 
(possibly all Chinese) characters.

The content of a file (or whether the content is in UTF-8 or not) cannot 
be influencing this problem directly, neither list.files() nor Windows 
looks into the files when listing them.

The bug Ivan found is not a regression: older versions of R may crash 
when you have such long file names. So there would be no point staying 
with an older version to overcome this problem: the only reliable 
work-around I can think of is use reasonably short file names.

Best
Tomas

On 8/14/23 03:45, ??? wrote:

________________________________
From: R-devel <r-devel-bounces at r-project.org<mailto:r-devel-bounces at r-project.org>> on behalf of ??? <yeyueguang at goldwind.com<mailto:yeyueguang at goldwind.com>>
Sent: Friday, August 11, 2023 11:41
To: r-devel at r-project.org<mailto:r-devel at r-project.org> <r-devel at r-project.org<mailto:r-devel at r-project.org>>
Subject: [Rd] R-4.3 version list.files function could not work correctly in chinese

      ????
           ????R-4.3???????????R??????list.files??????????????????????BUG????????????????????????????
r4.3????dir????????????????? - COS??? | ?????? | ?????????????? (cosx.org)<https://d.cosx.org/d/424356-r43ban-ben-zhong-dirhan-shu-huo-qu-bu-liao-quan-bu-wen-jian/11><ttps://d.cosx.org/d/424356-r43ban-ben-zhong-dirhan-shu-huo-qu-bu-liao-quan-bu-wen-jian/11%3e>
           ???????????????????????

         [[alternative HTML version deleted]]
??????????????
?????????????????????????????????????????????????????????????????
?????????????????? ITSecurity at goldwind.com?

???????????????
Email system security tips?
The use of emails to collect personal information, account passwords, bank card information, help, subsidies, money transfers, etc. is "phishing email" or "virus email", no response is required, and please delete it immediately.
If you encounter email security issues, please contact ITSecurity at goldwind.com.<mailto:TSecurity at goldwind.com.>

	[[alternative HTML version deleted]]

______________________________________________
R-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Wed, Aug 16, 2023 12:42 AM #

On 8/15/23 16:00, Tomas Kalibera wrote:

On 8/15/23 09:04, Ivan Krylov wrote:

? Tue, 15 Aug 2023 08:38:11 +0200
Tomas Kalibera <tomas.kalibera at gmail.com> ?????:

As this was reported to be regression in 4.3, it is entirely possible
this change came with a regression (though a bit surprising we didn't
catch it earlier by testing), so it would be a great help if I could
have the example and debug it.

Sorry, let me try to be more clear.

The Windows filename length limit is 255(?) wide characters. The
WIN32_FIND_DATAA structure contains a 260-byte buffer for the filename
to be returned by FindFirstFileA()/FindNextFileA(). If a wide character
takes more than one byte to be represented in UTF-8, it may overflow
the 260 byte limit in the WIN32_FIND_DATAA structure despite being
below the 260 wide character limit. When such an overflow happens,
FindNextFile() returns FALSE with GetLastError() == ERROR_MORE_DATA,
which results in R_readdir() returning NULL and makes list_files() stop
before listing the rest of the directory.

This is easier to make happen by accident with Chinese characters,
because they take three UTF-8 bytes per character.

Take the ? (\uf8) letter. It takes two bytes to represent in UTF-8.
Create a file with a name consisting of this symbol repeated 140 times.
When you run list.files() on the resulting directory on Windows with a
UTF-8 locale, Windows tries to fit (0xc3 0xb8) times 140 into a
260-byte buffer, which doesn't work. I'm afraid the only way to avoid
such a failure is to rewrite R_readdir using the wide character API and
convert the file names on the fly. (Just like mingw readdir() did in
the past?)

stopifnot(.Platform$OS.type == 'windows', l10n_info()$`UTF-8`)
# any character for which nchar(enc2utf8(.), 'bytes') > 1 will do
# any number >260/2 should do
file.create(strrep('\uf8', 140))
list.files()

Does this work? I don't have access to a UTF-8 Windows machine right
now.

Thanks, yes, I can reproduce the problem. Some Windows functions 
impose 260 wide characters limit, but other 260 bytes limit, so one 
can create a file with a name too long to be found by FindNextFileA.

In R 4.2, we used readdir() from mingw-w64, which itself used 
findnext, which however had the same problem, it used a buffer of size 
260 bytes and from the code of mingw-w64 and the Windows 
documentation, it should have behaved the same, it should have stopped 
the search on such a long file name. However, in my use case, R 4.2.3 
crashed inside findnext due to stack overrun, R 4.1.3 worked, but 
clearly it would require a different use case to overrun this buffer 
as it didn't use UTF-8. This suggests that findnext didn't have a 
check for this and hence caused memory corruption, which can lead to a 
crash or work by coincidence. Which could have been the case for the 
user reporting this as a regression compared to R 4.2. But it is not a 
regression, the problem existed for long.

So, yes, we'd probably have to use wide variants of 
FindNext/FindFirst. I'll fix.

Fixed in R-devel (84960). Please let me know if you see any problem with 
the fix.

Thanks,
Tomas

yu gong

Wed, Aug 16, 2023 4:11 AM #

a little more information for this issue.
Search in MS website today , found doc about "Maximum Path Length Limitation", Maximum Path Length Limitation - Win32 apps | Microsoft Learn<https://learn.microsoft.com/en-us/windows/win32/fileio/maximum-file-path-limitation?tabs=registry> .
According the doc, need to do two things to avoid this issue on window 10  and latter:
1 edit registry or group policy  set    HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\FileSystem] "LongPathsEnabled"=dword:00000001

2 app manifest (R already done it)

Regards,
yu

From: R-devel <r-devel-bounces at r-project.org> on behalf of Tomas Kalibera <tomas.kalibera at gmail.com>
Sent: Wednesday, August 16, 2023 15:42
To: Ivan Krylov <krylov.r00t at gmail.com>
Cc: r-devel at r-project.org <r-devel at r-project.org>
Subject: Re: [Rd] R-4.3 version list.files function could not work correctly in chinese

On 8/15/23 16:00, Tomas Kalibera wrote:
>
> On 8/15/23 09:04, Ivan Krylov wrote:
>> ?? Tue, 15 Aug 2023 08:38:11 +0200
>> Tomas Kalibera <tomas.kalibera at gmail.com> ??????:
>>
>>> As this was reported to be regression in 4.3, it is entirely possible
>>> this change came with a regression (though a bit surprising we didn't
>>> catch it earlier by testing), so it would be a great help if I could
>>> have the example and debug it.
>> Sorry, let me try to be more clear.
>>
>> The Windows filename length limit is 255(?) wide characters. The
>> WIN32_FIND_DATAA structure contains a 260-byte buffer for the filename
>> to be returned by FindFirstFileA()/FindNextFileA(). If a wide character
>> takes more than one byte to be represented in UTF-8, it may overflow
>> the 260 byte limit in the WIN32_FIND_DATAA structure despite being
>> below the 260 wide character limit. When such an overflow happens,
>> FindNextFile() returns FALSE with GetLastError() == ERROR_MORE_DATA,
>> which results in R_readdir() returning NULL and makes list_files() stop
>> before listing the rest of the directory.
>>
>> This is easier to make happen by accident with Chinese characters,
>> because they take three UTF-8 bytes per character.
>>
>> Take the ?? (\uf8) letter. It takes two bytes to represent in UTF-8.
>> Create a file with a name consisting of this symbol repeated 140 times.
>> When you run list.files() on the resulting directory on Windows with a
>> UTF-8 locale, Windows tries to fit (0xc3 0xb8) times 140 into a
>> 260-byte buffer, which doesn't work. I'm afraid the only way to avoid
>> such a failure is to rewrite R_readdir using the wide character API and
>> convert the file names on the fly. (Just like mingw readdir() did in
>> the past?)
>>
>> stopifnot(.Platform$OS.type == 'windows', l10n_info()$`UTF-8`)
>> # any character for which nchar(enc2utf8(.), 'bytes') > 1 will do
>> # any number >260/2 should do
>> file.create(strrep('\uf8', 140))
>> list.files()
>>
>> Does this work? I don't have access to a UTF-8 Windows machine right
>> now.
>
> Thanks, yes, I can reproduce the problem. Some Windows functions
> impose 260 wide characters limit, but other 260 bytes limit, so one
> can create a file with a name too long to be found by FindNextFileA.
>
> In R 4.2, we used readdir() from mingw-w64, which itself used
> findnext, which however had the same problem, it used a buffer of size
> 260 bytes and from the code of mingw-w64 and the Windows
> documentation, it should have behaved the same, it should have stopped
> the search on such a long file name. However, in my use case, R 4.2.3
> crashed inside findnext due to stack overrun, R 4.1.3 worked, but
> clearly it would require a different use case to overrun this buffer
> as it didn't use UTF-8. This suggests that findnext didn't have a
> check for this and hence caused memory corruption, which can lead to a
> crash or work by coincidence. Which could have been the case for the
> user reporting this as a regression compared to R 4.2. But it is not a
> regression, the problem existed for long.
>
> So, yes, we'd probably have to use wide variants of
> FindNext/FindFirst. I'll fix.

Fixed in R-devel (84960). Please let me know if you see any problem with
the fix.

Thanks,
Tomas

>
> Thanks for debugging this,
> Tomas
>
>
>

______________________________________________
R-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Ivan Krylov

Wed, Aug 16, 2023 4:22 AM #

On Wed, 16 Aug 2023 09:42:09 +0200

Tomas Kalibera <tomas.kalibera at gmail.com> wrote:

Thank you for implementing the fix! I gave ??? the link to the
GitHub Action build of the r84960 installer.

I'm worried that ??? was seeing FindNextFileA fail for a different
reason (all the examples given at the Capital of Statistics forum
seemed to use less than 256/4 = 64 characters per file name...), but
maybe this won't reappear with the switch to FindNextFileW. If this
keeps happening, it might be worth producing a warning when
FindNextFileW() fails with an unexpected GetLastError() value.

fs::dir_fs() uses NtQueryDirectoryFile() and WideCharToMultiByte()
instead of FindNextFileW() and wcstombs(), but maybe this shouldn't
matter. In particular, both list.files() and fs::dir_fs() would fail
given a file name that cannot be represented in UTF-8 (invalid UTF-16
surrogate pairs?)

Best regards,
Ivan

Wed, Aug 16, 2023 5:59 AM #

On 8/16/23 13:11, yu gong wrote:

These settings are for long paths (meaning a full path containing of 
multiple elements separated by backslashes), more about that is also in 
[1].


But the problem that Ivan reported (which is not clear whether it is the 
same problem as the one reported originally on this thread), is about 
the limit for a single file/directory name - that is, for a single 
element of a path. Having the long paths enabled in the registry 
wouldn't help with this.


These two limits are not directly related, except the obvious: by 
choosing rather long names for individual files, one usually soon runs 
out of the limit for the full path.


Best

Tomas


[1] - 
https://blog.r-project.org/2023/03/07/path-length-limit-on-windows/index.html

Regards,
yu

------------------------------------------------------------------------
*From:* R-devel <r-devel-bounces at r-project.org> on behalf of Tomas 
Kalibera <tomas.kalibera at gmail.com>
*Sent:* Wednesday, August 16, 2023 15:42
*To:* Ivan Krylov <krylov.r00t at gmail.com>
*Cc:* r-devel at r-project.org <r-devel at r-project.org>
*Subject:* Re: [Rd] R-4.3 version list.files function could not work 
correctly in chinese

On 8/15/23 16:00, Tomas Kalibera wrote:

On 8/15/23 09:04, Ivan Krylov wrote:

? Tue, 15 Aug 2023 08:38:11 +0200
Tomas Kalibera <tomas.kalibera at gmail.com> ?????:

As this was reported to be regression in 4.3, it is entirely possible
this change came with a regression (though a bit surprising we didn't
catch it earlier by testing), so it would be a great help if I could
have the example and debug it.

Sorry, let me try to be more clear.

The Windows filename length limit is 255(?) wide characters. The
WIN32_FIND_DATAA structure contains a 260-byte buffer for the filename
to be returned by FindFirstFileA()/FindNextFileA(). If a wide character
takes more than one byte to be represented in UTF-8, it may overflow
the 260 byte limit in the WIN32_FIND_DATAA structure despite being
below the 260 wide character limit. When such an overflow happens,
FindNextFile() returns FALSE with GetLastError() == ERROR_MORE_DATA,
which results in R_readdir() returning NULL and makes list_files() stop
before listing the rest of the directory.

This is easier to make happen by accident with Chinese characters,
because they take three UTF-8 bytes per character.

Take the ? (\uf8) letter. It takes two bytes to represent in UTF-8.
Create a file with a name consisting of this symbol repeated 140 times.
When you run list.files() on the resulting directory on Windows with a
UTF-8 locale, Windows tries to fit (0xc3 0xb8) times 140 into a
260-byte buffer, which doesn't work. I'm afraid the only way to avoid
such a failure is to rewrite R_readdir using the wide character API and
convert the file names on the fly. (Just like mingw readdir() did in
the past?)

stopifnot(.Platform$OS.type == 'windows', l10n_info()$`UTF-8`)
# any character for which nchar(enc2utf8(.), 'bytes') > 1 will do
# any number >260/2 should do
file.create(strrep('\uf8', 140))
list.files()

Does this work? I don't have access to a UTF-8 Windows machine right
now.

Thanks, yes, I can reproduce the problem. Some Windows functions
impose 260 wide characters limit, but other 260 bytes limit, so one
can create a file with a name too long to be found by FindNextFileA.

In R 4.2, we used readdir() from mingw-w64, which itself used
findnext, which however had the same problem, it used a buffer of size
260 bytes and from the code of mingw-w64 and the Windows
documentation, it should have behaved the same, it should have stopped
the search on such a long file name. However, in my use case, R 4.2.3
crashed inside findnext due to stack overrun, R 4.1.3 worked, but
clearly it would require a different use case to overrun this buffer
as it didn't use UTF-8. This suggests that findnext didn't have a
check for this and hence caused memory corruption, which can lead to a
crash or work by coincidence. Which could have been the case for the
user reporting this as a regression compared to R 4.2. But it is not a
regression, the problem existed for long.

So, yes, we'd probably have to use wide variants of
FindNext/FindFirst. I'll fix.

Fixed in R-devel (84960). Please let me know if you see any problem with
the fix.

Thanks,
Tomas

Thanks for debugging this,
Tomas

______________________________________________
R-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel 
<https://stat.ethz.ch/mailman/listinfo/r-devel>

Wed, Aug 16, 2023 7:00 AM #

On 8/16/23 13:22, Ivan Krylov wrote:

Thanks and thanks for looking at the change.

I've added a warning to R-devel when list.files() on Windows stops 
listing a directory due to an error.

There is probably not more we can do unless there is a revised bug 
report of the original problem.

Right, R only support file names that are valid strings, this assumption 
is present at many places in the code, so it is fine/consistent to be 
here as well. The choice of opendir/readdir in R was probably motivated 
by minimization of platform-specific code.

Best
Tomas