????
????R-4.3???????????R??????list.files??????????????????????BUG????????????????????????????
r4.3????dir????????????????? - COS??? | ?????? | ?????????????? (cosx.org)<https://d.cosx.org/d/424356-r43ban-ben-zhong-dirhan-shu-huo-qu-bu-liao-quan-bu-wen-jian/11>
???????????????????????
R-4.3 version list.files function could not work correctly in chinese
20 messages · 叶月光, Ivan Krylov, Yihui Xie +3 more
Dear ???, Thank you for your message, but please follow the posting guide in your future messages: https://www.r-project.org/posting-guide.html https://www.r-project.org/bugs.html I understand from your link that list.files() ends up skipping some Chinese filenames in R-4.3.1 (but not R-4.2.2) on Windows, but would you (or perhaps Yihui Xie who I see is also participating in the discussion) mind translating the rest of your findings into English? Have you been able to narrow down the problem to certain character ranges, for example?
Best regards, Ivan
Yes, I participated in the discussion. Basically dir() failed to list all
files since R 4.3.0 when filenames start with Chinese characters. I don't
have a Windows machine to test it, but this might be a minimal reproducible
example:
file.create("????.R")
dir()
The OP said dir() would return "????.R" in R.4.2.2 but not in R 4.3.0. In
the same discussion another person mentioned that the problem could also be
related to the file encoding, i.e., if the file content is encoded in
UTF-8, it could be recognized by dir(), but not in ANSI.
Regards,
Yihui
--
https://yihui.org
On Fri, Aug 11, 2023 at 6:25?AM Ivan Krylov <krylov.r00t at gmail.com> wrote:
Dear ???, Thank you for your message, but please follow the posting guide in your future messages: https://www.r-project.org/posting-guide.html https://www.r-project.org/bugs.html I understand from your link that list.files() ends up skipping some Chinese filenames in R-4.3.1 (but not R-4.2.2) on Windows, but would you (or perhaps Yihui Xie who I see is also participating in the discussion) mind translating the rest of your findings into English? Have you been able to narrow down the problem to certain character ranges, for example? -- Best regards, Ivan
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Dear Yihui,
Thanks a lot for your help!
Unfortunately, I was not able to reproduce this. I've tried creating
files with Chinese characters in their names and populating them
with valid UTF-8 and valid non-UTF-8 text, but R seems to be able to
list them all in my case.
I'm running a US English evaluation ISO image of a slightly newer build
of Windows 10, and I also compiled R-4.3.1 from source, anticipating
having to single-step through the list.files() implementation:
sessionInfo()
# R version 4.3.1 (2023-06-16 ucrt)
# Platform: x86_64-w64-mingw32/x64 (64-bit)
# Running under: Windows 10 x64 (build 19045)
#
# Matrix products: default
#
#
# locale:
# [1] LC_COLLATE=English_United States.utf8 LC_CTYPE=English_United
# States.utf8
# [3] LC_MONETARY=English_United States.utf8 LC_NUMERIC=C
# [5] LC_TIME=English_United States.utf8
#
# time zone: America/Los_Angeles
# tzcode source: internal
#
# attached base packages:
# [1] stats graphics grDevices utils datasets methods base
#
# loaded via a namespace (and not attached):
# [1] compiler_4.3.1
dir("????")
# [1] "????-non-utf8-?????.txt" "????-utf-8.txt"
system('cmd /c dir /s *.txt')
# Volume in drive C has no label.
# Volume Serial Number is A85A-AA74
#
# Directory of C:\R\R-4.3.1\bin\x64\????
#
# 08/12/2023 07:57 AM 22 ????-non-utf8-?????.txt
# 08/12/2023 07:56 AM 18 ????-utf-8.txt
# 2 File(s) 40 bytes
#
# Total Files Listed:
# 2 File(s) 40 bytes
# 0 Dir(s) 29,538,418,688 bytes free
# [1] 0
(The OEM codepage cannot represent the characters I used in the file
names, but all the files are present in both lists.)
In order to find out what's wrong, it will be needed to download the R
source code and compile it [*], install gdb using pacman (part of
Rtools), then set a breakpoint on the list_files function from
src/main/platform.c and step through it [**], paying attention to the
R_readdir calls. Do the missing file names not even come out from
FindNextFile()? Are they somehow skipped around the time of regex match?
(I could help with the details of this, maybe off-list, if there's
interest.)
Unless Tomas Kalibera is able to deduce the root cause from the
observed symptoms, someone who can reproduce the problem will have to
investigate further.
Best regards, Ivan [*] https://cran.r-project.org/bin/windows/base/howto-R-devel.html [**] https://beej.us/guide/bggdb/
list.files function is notcorrect?
-----????-----
???: Ivan Krylov [mailto:krylov.r00t at gmail.com]
????: 2023?8?12? 23:33
???: Yihui Xie <xie at yihui.name>
??: ??? <yeyueguang at goldwind.com>; r-devel at r-project.org
??: Re: [Rd] R-4.3 version list.files function could not work correctly in chinese
Dear Yihui,
Thanks a lot for your help!
Unfortunately, I was not able to reproduce this. I've tried creating files with Chinese characters in their names and populating them with valid UTF-8 and valid non-UTF-8 text, but R seems to be able to list them all in my case.
I'm running a US English evaluation ISO image of a slightly newer build of Windows 10, and I also compiled R-4.3.1 from source, anticipating having to single-step through the list.files() implementation:
sessionInfo()
# R version 4.3.1 (2023-06-16 ucrt)
# Platform: x86_64-w64-mingw32/x64 (64-bit) # Running under: Windows 10 x64 (build 19045) # # Matrix products: default # # # locale:
# [1] LC_COLLATE=English_United States.utf8 LC_CTYPE=English_United # States.utf8 # [3] LC_MONETARY=English_United States.utf8 LC_NUMERIC=C # [5] LC_TIME=English_United States.utf8 # # time zone: America/Los_Angeles # tzcode source: internal # # attached base packages:
# [1] stats graphics grDevices utils datasets methods base
#
# loaded via a namespace (and not attached):
# [1] compiler_4.3.1
dir("????")
# [1] "????-non-utf8-?????.txt" "????-utf-8.txt"
system('cmd /c dir /s *.txt')
# Volume in drive C has no label.
# Volume Serial Number is A85A-AA74
#
# Directory of C:\R\R-4.3.1\bin\x64\????
#
# 08/12/2023 07:57 AM 22 ????-non-utf8-?????.txt
# 08/12/2023 07:56 AM 18 ????-utf-8.txt
# 2 File(s) 40 bytes
#
# Total Files Listed:
# 2 File(s) 40 bytes
# 0 Dir(s) 29,538,418,688 bytes free
# [1] 0
(The OEM codepage cannot represent the characters I used in the file names, but all the files are present in both lists.)
In order to find out what's wrong, it will be needed to download the R source code and compile it [*], install gdb using pacman (part of Rtools), then set a breakpoint on the list_files function from src/main/platform.c and step through it [**], paying attention to the R_readdir calls. Do the missing file names not even come out from FindNextFile()? Are they somehow skipped around the time of regex match?
(I could help with the details of this, maybe off-list, if there's
interest.)
Unless Tomas Kalibera is able to deduce the root cause from the observed symptoms, someone who can reproduce the problem will have to investigate further.
--
Best regards,
Ivan
[*] https://cran.r-project.org/bin/windows/base/howto-R-devel.html
[**] https://beej.us/guide/bggdb/
??????????????
?????????????????????????????????????????????????????????????????
?????????????????? ITSecurity at goldwind.com?
???????????????
Email system security tips?
The use of emails to collect personal information, account passwords, bank card information, help, subsidies, money transfers, etc. is "phishing email" or "virus email", no response is required, and please delete it immediately.
If you encounter email security issues, please contact ITSecurity at goldwind.com.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: r-sessioninfo.png
Type: image/png
Size: 127929 bytes
Desc: r-sessioninfo.png
URL: <https://stat.ethz.ch/pipermail/r-devel/attachments/20230813/ab7820ad/attachment.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: list.files_test.png
Type: image/png
Size: 38952 bytes
Desc: list.files_test.png
URL: <https://stat.ethz.ch/pipermail/r-devel/attachments/20230813/ab7820ad/attachment-0001.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: path-files.png
Type: image/png
Size: 29532 bytes
Desc: path-files.png
URL: <https://stat.ethz.ch/pipermail/r-devel/attachments/20230813/ab7820ad/attachment-0002.png>
Dear ???, I believe you that there's a problem with list.files() and file names in Chinese. There is no need for additional proof. Unfortunately, it's impossible to fix the problem unless its source is found: https://www.chiark.greenend.org.uk/~sgtatham/bugs-cn.html Can you give me more examples of file names, _as text_, that I could _copy and paste_ into my computer in order to (hopefully) reproduce the problem here? Alternatively, can you use a debugger for programs written in C? Do you know someone who does?
Best regards, Ivan
I am afraid this issue a bite more complicated. Test Rgui and Rterm 4.3.1 and svn trunk on Windows 10 x64 (build 19044) , Chinese file name shows correctly (file content ANSI or UTF-8 ). I saw OP picture (using Rstudio), maybe this is Rstudio issues?
From: R-devel <r-devel-bounces at r-project.org> on behalf of Yihui Xie <xie at yihui.name>
Sent: Saturday, August 12, 2023 12:40
To: Ivan Krylov <krylov.r00t at gmail.com>
Cc: r-devel at r-project.org <r-devel at r-project.org>; ??? <yeyueguang at goldwind.com>
Subject: Re: [Rd] R-4.3 version list.files function could not work correctly in chinese
Sent: Saturday, August 12, 2023 12:40
To: Ivan Krylov <krylov.r00t at gmail.com>
Cc: r-devel at r-project.org <r-devel at r-project.org>; ??? <yeyueguang at goldwind.com>
Subject: Re: [Rd] R-4.3 version list.files function could not work correctly in chinese
Yes, I participated in the discussion. Basically dir() failed to list all
files since R 4.3.0 when filenames start with Chinese characters. I don't
have a Windows machine to test it, but this might be a minimal reproducible
example:
file.create("????.R")
dir()
The OP said dir() would return "????.R" in R.4.2.2 but not in R 4.3.0. In
the same discussion another person mentioned that the problem could also be
related to the file encoding, i.e., if the file content is encoded in
UTF-8, it could be recognized by dir(), but not in ANSI.
Regards,
Yihui
--
https://yihui.org
On Fri, Aug 11, 2023 at 6:25?AM Ivan Krylov <krylov.r00t at gmail.com> wrote:
> Dear ???,
>
> Thank you for your message, but please follow the posting guide in your
> future messages: https://www.r-project.org/posting-guide.html
> https://www.r-project.org/bugs.html
>
> I understand from your link that list.files() ends up skipping some
> Chinese filenames in R-4.3.1 (but not R-4.2.2) on Windows, but would you
> (or perhaps Yihui Xie who I see is also participating in the discussion)
> mind translating the rest of your findings into English? Have you been
> able to narrow down the problem to certain character ranges, for
> example?
>
> --
> Best regards,
> Ivan
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>
[[alternative HTML version deleted]]
______________________________________________
R-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Could you test it on RGui and Rterm first, see it work or not. then try RStudio?
From: R-devel <r-devel-bounces at r-project.org> on behalf of ??? <yeyueguang at goldwind.com>
Sent: Friday, August 11, 2023 11:41
To: r-devel at r-project.org <r-devel at r-project.org>
Subject: [Rd] R-4.3 version list.files function could not work correctly in chinese
Sent: Friday, August 11, 2023 11:41
To: r-devel at r-project.org <r-devel at r-project.org>
Subject: [Rd] R-4.3 version list.files function could not work correctly in chinese
????
????R-4.3???????????R??????list.files??????????????????????BUG????????????????????????????
r4.3????dir????????????????? - COS??? | ?????? | ?????????????? (cosx.org)<https://d.cosx.org/d/424356-r43ban-ben-zhong-dirhan-shu-huo-qu-bu-liao-quan-bu-wen-jian/11>
???????????????????????
[[alternative HTML version deleted]]
Found it! Looks like a buffer length problem. This isn't limited to
Chinese, just more likely to happen when a character takes three bytes
to represent in UTF-8. (Any filename containing characters which take
more than one byte to represent in UTF-8 may fail.)
If a directory contains a file with a sufficiently long name,
FindNextFile() fails with ERROR_MORE_DATA (0xEA, 234), making
R_readdir() return NULL, stopping list_files() prematurely:
# everything seems to work fine...
list.files("????")
# [1] "????-non-utf8-?????
????????????????????????????????????????????????????.txt"
# [2] "????-non-utf8-?????.txt"
# [3] "????-utf-8.txt"
# now create a file with an even longer name
list.files("????")
# [1] "????-non-utf8-?????
????????????????????????????????????????????????????.txt"
# the files are still there, but not visible to list.files():
system("cmd /c dir /s *.txt")
# Volume in drive C has no label.
# Volume Serial Number is A85A-AA74
#
# Directory of C:\R\R-4.3.1\bin\x64\????
#
# 08/12/2023 07:57 AM 22 ????-non-utf8-?????
????????????????????????????????????????????????????.txt
# 08/12/2023 07:57 AM 22 ????-non-utf8-?????
????????????????????????????????????????????????????????????????????????????????????????????????????????.txt
# 08/12/2023 07:57 AM 22 ????-non-utf8-?????.txt
# 08/12/2023 07:56 AM 18 ????-utf-8.txt
# 4 File(s) 84 bytes
#
# Total Files Listed:
# 4 File(s) 84 bytes
# 0 Dir(s) 29,281,538,048 bytes free
# [1] 0
Increasing the path length limits [*] doesn't help, since it's the
filename length limit that we're bumping against. While both
WIN32_FIND_DATAA and WIN32_FIND_DATAW contain fixed-size buffers, a
valid filename may take more than MAX_PATH bytes to represent in UTF-8
while still being under the limit of MAX_PATH wide characters. This may
mean having to rewrite list_files in terms of R_wopendir()/R_wreaddir()
for Windows. As a workaround, we may use the short filename (which
sometimes may not exist, alas) when FindNextFile() fails with
ERROR_MORE_DATA.
Best regards, Ivan [*] https://learn.microsoft.com/en-us/windows/win32/fileio/maximum-file-path-limitation
Rterm.exe test result?
a = readline()
D:\Project_Delivery\
list.files(a,recursive = T)
[1] "2022(1).xlsx" [2] "conf_custom_wf_wt_map_202308091545.csv" [3] ".R" [4] ".xlsx" [5] "_.xlsx" [6] ".xlsx" [7] " (3).xlsx" [8] "20230222113605379(1).xlsx" [9] "_2022_20230811.docx" All the file names which contains the Chinese can not be printed. The result of RGUI and RStudio are the same:
a = readline()
D:\Project_Delivery\??
list.files(a,recursive = T)
[1] "2022????????????????(1).xlsx" "conf_custom_wf_wt_map_202308091545.csv" ???: yu gong [mailto:yugong at outlook.com] ????: 2023?8?13? 17:36 ???: ??? <yeyueguang at goldwind.com>; r-devel at r-project.org ??: Re: R-4.3 version list.files function could not work correctly in chinese Could you test it on RGui and Rterm first, see it work or not. then try RStudio?
From: R-devel <r-devel-bounces at r-project.org<mailto:r-devel-bounces at r-project.org>> on behalf of ??? <yeyueguang at goldwind.com<mailto:yeyueguang at goldwind.com>>
Sent: Friday, August 11, 2023 11:41
To: r-devel at r-project.org<mailto:r-devel at r-project.org> <r-devel at r-project.org<mailto:r-devel at r-project.org>>
Subject: [Rd] R-4.3 version list.files function could not work correctly in chinese
Sent: Friday, August 11, 2023 11:41
To: r-devel at r-project.org<mailto:r-devel at r-project.org> <r-devel at r-project.org<mailto:r-devel at r-project.org>>
Subject: [Rd] R-4.3 version list.files function could not work correctly in chinese
????
????R-4.3???????????R??????list.files??????????????????????BUG????????????????????????????
r4.3????dir????????????????? - COS??? | ?????? | ?????????????? (cosx.org)<https://d.cosx.org/d/424356-r43ban-ben-zhong-dirhan-shu-huo-qu-bu-liao-quan-bu-wen-jian/11><ttps://d.cosx.org/d/424356-r43ban-ben-zhong-dirhan-shu-huo-qu-bu-liao-quan-bu-wen-jian/11%3e>
???????????????????????
[[alternative HTML version deleted]]
??????????????
?????????????????????????????????????????????????????????????????
?????????????????? ITSecurity at goldwind.com?
???????????????
Email system security tips?
The use of emails to collect personal information, account passwords, bank card information, help, subsidies, money transfers, etc. is "phishing email" or "virus email", no response is required, and please delete it immediately.
If you encounter email security issues, please contact ITSecurity at goldwind.com.<mailto:TSecurity at goldwind.com.>
Just to rule it out... is it possible that R is listing these files
successfully, but is not printing the Chinese characters in those
names for some reason?
Using your example, what is the output of:
f <- list.files(a, recursive = T)
nchar(f)
Does the reported number of characters match what you see?
Best,
Kevin
On Mon, Aug 14, 2023 at 12:32?AM ??? <yeyueguang at goldwind.com> wrote:
Rterm.exe test result?
a = readline()
D:\Project_Delivery\
list.files(a,recursive = T)
[1] "2022(1).xlsx" [2] "conf_custom_wf_wt_map_202308091545.csv" [3] ".R" [4] ".xlsx" [5] "_.xlsx" [6] ".xlsx" [7] " (3).xlsx" [8] "20230222113605379(1).xlsx" [9] "_2022_20230811.docx" All the file names which contains the Chinese can not be printed. The result of RGUI and RStudio are the same:
a = readline()
D:\Project_Delivery\??
list.files(a,recursive = T)
[1] "2022????????????????(1).xlsx" "conf_custom_wf_wt_map_202308091545.csv" ???: yu gong [mailto:yugong at outlook.com] ????: 2023?8?13? 17:36 ???: ??? <yeyueguang at goldwind.com>; r-devel at r-project.org ??: Re: R-4.3 version list.files function could not work correctly in chinese Could you test it on RGui and Rterm first, see it work or not. then try RStudio?
________________________________
From: R-devel <r-devel-bounces at r-project.org<mailto:r-devel-bounces at r-project.org>> on behalf of ??? <yeyueguang at goldwind.com<mailto:yeyueguang at goldwind.com>>
Sent: Friday, August 11, 2023 11:41
To: r-devel at r-project.org<mailto:r-devel at r-project.org> <r-devel at r-project.org<mailto:r-devel at r-project.org>>
Subject: [Rd] R-4.3 version list.files function could not work correctly in chinese
????
????R-4.3???????????R??????list.files??????????????????????BUG????????????????????????????
r4.3????dir????????????????? - COS??? | ?????? | ?????????????? (cosx.org)<https://d.cosx.org/d/424356-r43ban-ben-zhong-dirhan-shu-huo-qu-bu-liao-quan-bu-wen-jian/11><ttps://d.cosx.org/d/424356-r43ban-ben-zhong-dirhan-shu-huo-qu-bu-liao-quan-bu-wen-jian/11%3e>
???????????????????????
[[alternative HTML version deleted]]
??????????????
?????????????????????????????????????????????????????????????????
?????????????????? ITSecurity at goldwind.com?
???????????????
Email system security tips?
The use of emails to collect personal information, account passwords, bank card information, help, subsidies, money transfers, etc. is "phishing email" or "virus email", no response is required, and please delete it immediately.
If you encounter email security issues, please contact ITSecurity at goldwind.com.<mailto:TSecurity at goldwind.com.>
[[alternative HTML version deleted]]
______________________________________________
R-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
On 8/13/23 13:16, Ivan Krylov wrote:
Found it! Looks like a buffer length problem. This isn't limited to
Chinese, just more likely to happen when a character takes three bytes
to represent in UTF-8. (Any filename containing characters which take
more than one byte to represent in UTF-8 may fail.)
If a directory contains a file with a sufficiently long name,
FindNextFile() fails with ERROR_MORE_DATA (0xEA, 234), making
R_readdir() return NULL, stopping list_files() prematurely:
# everything seems to work fine...
list.files("????")
# [1] "????-non-utf8-?????
????????????????????????????????????????????????????.txt"
# [2] "????-non-utf8-?????.txt"
# [3] "????-utf-8.txt"
# now create a file with an even longer name
list.files("????")
# [1] "????-non-utf8-?????
????????????????????????????????????????????????????.txt"
# the files are still there, but not visible to list.files():
Thanks, Ivan, could you please turn this into a complete minimal reproducible example, ideally with only ASCII characters (if enough to trigger)? Or any reproducible example would do. I would have a look later today.
system("cmd /c dir /s *.txt")
# Volume in drive C has no label.
# Volume Serial Number is A85A-AA74
#
# Directory of C:\R\R-4.3.1\bin\x64\????
#
# 08/12/2023 07:57 AM 22 ????-non-utf8-?????
????????????????????????????????????????????????????.txt
# 08/12/2023 07:57 AM 22 ????-non-utf8-?????
????????????????????????????????????????????????????????????????????????????????????????????????????????.txt
# 08/12/2023 07:57 AM 22 ????-non-utf8-?????.txt
# 08/12/2023 07:56 AM 18 ????-utf-8.txt
# 4 File(s) 84 bytes
#
# Total Files Listed:
# 4 File(s) 84 bytes
# 0 Dir(s) 29,281,538,048 bytes free
# [1] 0
Increasing the path length limits [*] doesn't help, since it's the
filename length limit that we're bumping against. While both
WIN32_FIND_DATAA and WIN32_FIND_DATAW contain fixed-size buffers, a
valid filename may take more than MAX_PATH bytes to represent in UTF-8
while still being under the limit of MAX_PATH wide characters. This may
mean having to rewrite list_files in terms of R_wopendir()/R_wreaddir()
for Windows. As a workaround, we may use the short filename (which
sometimes may not exist, alas) when FindNextFile() fails with
ERROR_MORE_DATA.
I admit I didn't get your analysis. However, I've rewritten this code for R 4.3 to support long paths (when enabled in the system), more in https://blog.r-project.org/2023/03/07/path-length-limit-on-windows/index.html. As this was reported to be regression in 4.3, it is entirely possible this change came with a regression (though a bit surprising we didn't catch it earlier by testing), so it would be a great help if I could have the example and debug it. Thanks, Tomas
? Tue, 15 Aug 2023 08:38:11 +0200 Tomas Kalibera <tomas.kalibera at gmail.com> ?????:
As this was reported to be regression in 4.3, it is entirely possible this change came with a regression (though a bit surprising we didn't catch it earlier by testing), so it would be a great help if I could have the example and debug it.
Sorry, let me try to be more clear.
The Windows filename length limit is 255(?) wide characters. The
WIN32_FIND_DATAA structure contains a 260-byte buffer for the filename
to be returned by FindFirstFileA()/FindNextFileA(). If a wide character
takes more than one byte to be represented in UTF-8, it may overflow
the 260 byte limit in the WIN32_FIND_DATAA structure despite being
below the 260 wide character limit. When such an overflow happens,
FindNextFile() returns FALSE with GetLastError() == ERROR_MORE_DATA,
which results in R_readdir() returning NULL and makes list_files() stop
before listing the rest of the directory.
This is easier to make happen by accident with Chinese characters,
because they take three UTF-8 bytes per character.
Take the ? (\uf8) letter. It takes two bytes to represent in UTF-8.
Create a file with a name consisting of this symbol repeated 140 times.
When you run list.files() on the resulting directory on Windows with a
UTF-8 locale, Windows tries to fit (0xc3 0xb8) times 140 into a
260-byte buffer, which doesn't work. I'm afraid the only way to avoid
such a failure is to rewrite R_readdir using the wide character API and
convert the file names on the fly. (Just like mingw readdir() did in
the past?)
stopifnot(.Platform$OS.type == 'windows', l10n_info()$`UTF-8`)
# any character for which nchar(enc2utf8(.), 'bytes') > 1 will do
# any number >260/2 should do
file.create(strrep('\uf8', 140))
list.files()
Does this work? I don't have access to a UTF-8 Windows machine right
now.
Best regards, Ivan
On 8/15/23 09:04, Ivan Krylov wrote:
? Tue, 15 Aug 2023 08:38:11 +0200 Tomas Kalibera <tomas.kalibera at gmail.com> ?????:
As this was reported to be regression in 4.3, it is entirely possible this change came with a regression (though a bit surprising we didn't catch it earlier by testing), so it would be a great help if I could have the example and debug it.
Sorry, let me try to be more clear.
The Windows filename length limit is 255(?) wide characters. The
WIN32_FIND_DATAA structure contains a 260-byte buffer for the filename
to be returned by FindFirstFileA()/FindNextFileA(). If a wide character
takes more than one byte to be represented in UTF-8, it may overflow
the 260 byte limit in the WIN32_FIND_DATAA structure despite being
below the 260 wide character limit. When such an overflow happens,
FindNextFile() returns FALSE with GetLastError() == ERROR_MORE_DATA,
which results in R_readdir() returning NULL and makes list_files() stop
before listing the rest of the directory.
This is easier to make happen by accident with Chinese characters,
because they take three UTF-8 bytes per character.
Take the ? (\uf8) letter. It takes two bytes to represent in UTF-8.
Create a file with a name consisting of this symbol repeated 140 times.
When you run list.files() on the resulting directory on Windows with a
UTF-8 locale, Windows tries to fit (0xc3 0xb8) times 140 into a
260-byte buffer, which doesn't work. I'm afraid the only way to avoid
such a failure is to rewrite R_readdir using the wide character API and
convert the file names on the fly. (Just like mingw readdir() did in
the past?)
stopifnot(.Platform$OS.type == 'windows', l10n_info()$`UTF-8`)
# any character for which nchar(enc2utf8(.), 'bytes') > 1 will do
# any number >260/2 should do
file.create(strrep('\uf8', 140))
list.files()
Does this work? I don't have access to a UTF-8 Windows machine right
now.
Thanks, yes, I can reproduce the problem. Some Windows functions impose 260 wide characters limit, but other 260 bytes limit, so one can create a file with a name too long to be found by FindNextFileA. In R 4.2, we used readdir() from mingw-w64, which itself used findnext, which however had the same problem, it used a buffer of size 260 bytes and from the code of mingw-w64 and the Windows documentation, it should have behaved the same, it should have stopped the search on such a long file name. However, in my use case, R 4.2.3 crashed inside findnext due to stack overrun, R 4.1.3 worked, but clearly it would require a different use case to overrun this buffer as it didn't use UTF-8. This suggests that findnext didn't have a check for this and hence caused memory corruption, which can lead to a crash or work by coincidence. Which could have been the case for the user reporting this as a regression compared to R 4.2. But it is not a regression, the problem existed for long. So, yes, we'd probably have to use wide variants of FindNext/FindFirst. I'll fix. Thanks for debugging this, Tomas
Dear ???, as discussed on this thread, Ivan Krylov found a bug in R, which could be causing the problem you have run into. To confirm this is the cause, could you please check outside R (say in explorer) if you have any file with a very long name in the directory? And if so, does moving that file away make the problem disappear? Files with up to 80 characters couldn't trigger this bug. A workaround for this bug is not to use file names with more than 80 (possibly all Chinese) characters. The content of a file (or whether the content is in UTF-8 or not) cannot be influencing this problem directly, neither list.files() nor Windows looks into the files when listing them. The bug Ivan found is not a regression: older versions of R may crash when you have such long file names. So there would be no point staying with an older version to overcome this problem: the only reliable work-around I can think of is use reasonably short file names. Best Tomas
On 8/14/23 03:45, ??? wrote:
Rterm.exe test result?
a = readline()
D:\Project_Delivery\
list.files(a,recursive = T)
[1] "2022(1).xlsx" [2] "conf_custom_wf_wt_map_202308091545.csv" [3] ".R" [4] ".xlsx" [5] "_.xlsx" [6] ".xlsx" [7] " (3).xlsx" [8] "20230222113605379(1).xlsx" [9] "_2022_20230811.docx" All the file names which contains the Chinese can not be printed. The result of RGUI and RStudio are the same:
a = readline()
D:\Project_Delivery\??
list.files(a,recursive = T)
[1] "2022????????????????(1).xlsx" "conf_custom_wf_wt_map_202308091545.csv" ???: yu gong [mailto:yugong at outlook.com] ????: 2023?8?13? 17:36 ???: ??? <yeyueguang at goldwind.com>; r-devel at r-project.org ??: Re: R-4.3 version list.files function could not work correctly in chinese Could you test it on RGui and Rterm first, see it work or not. then try RStudio?
________________________________
From: R-devel <r-devel-bounces at r-project.org<mailto:r-devel-bounces at r-project.org>> on behalf of ??? <yeyueguang at goldwind.com<mailto:yeyueguang at goldwind.com>>
Sent: Friday, August 11, 2023 11:41
To: r-devel at r-project.org<mailto:r-devel at r-project.org> <r-devel at r-project.org<mailto:r-devel at r-project.org>>
Subject: [Rd] R-4.3 version list.files function could not work correctly in chinese
????
????R-4.3???????????R??????list.files??????????????????????BUG????????????????????????????
r4.3????dir????????????????? - COS??? | ?????? | ?????????????? (cosx.org)<https://d.cosx.org/d/424356-r43ban-ben-zhong-dirhan-shu-huo-qu-bu-liao-quan-bu-wen-jian/11><ttps://d.cosx.org/d/424356-r43ban-ben-zhong-dirhan-shu-huo-qu-bu-liao-quan-bu-wen-jian/11%3e>
???????????????????????
[[alternative HTML version deleted]]
??????????????
?????????????????????????????????????????????????????????????????
?????????????????? ITSecurity at goldwind.com?
???????????????
Email system security tips?
The use of emails to collect personal information, account passwords, bank card information, help, subsidies, money transfers, etc. is "phishing email" or "virus email", no response is required, and please delete it immediately.
If you encounter email security issues, please contact ITSecurity at goldwind.com.<mailto:TSecurity at goldwind.com.>
[[alternative HTML version deleted]]
______________________________________________
R-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
On 8/15/23 16:00, Tomas Kalibera wrote:
On 8/15/23 09:04, Ivan Krylov wrote:
? Tue, 15 Aug 2023 08:38:11 +0200 Tomas Kalibera <tomas.kalibera at gmail.com> ?????:
As this was reported to be regression in 4.3, it is entirely possible this change came with a regression (though a bit surprising we didn't catch it earlier by testing), so it would be a great help if I could have the example and debug it.
Sorry, let me try to be more clear.
The Windows filename length limit is 255(?) wide characters. The
WIN32_FIND_DATAA structure contains a 260-byte buffer for the filename
to be returned by FindFirstFileA()/FindNextFileA(). If a wide character
takes more than one byte to be represented in UTF-8, it may overflow
the 260 byte limit in the WIN32_FIND_DATAA structure despite being
below the 260 wide character limit. When such an overflow happens,
FindNextFile() returns FALSE with GetLastError() == ERROR_MORE_DATA,
which results in R_readdir() returning NULL and makes list_files() stop
before listing the rest of the directory.
This is easier to make happen by accident with Chinese characters,
because they take three UTF-8 bytes per character.
Take the ? (\uf8) letter. It takes two bytes to represent in UTF-8.
Create a file with a name consisting of this symbol repeated 140 times.
When you run list.files() on the resulting directory on Windows with a
UTF-8 locale, Windows tries to fit (0xc3 0xb8) times 140 into a
260-byte buffer, which doesn't work. I'm afraid the only way to avoid
such a failure is to rewrite R_readdir using the wide character API and
convert the file names on the fly. (Just like mingw readdir() did in
the past?)
stopifnot(.Platform$OS.type == 'windows', l10n_info()$`UTF-8`)
# any character for which nchar(enc2utf8(.), 'bytes') > 1 will do
# any number >260/2 should do
file.create(strrep('\uf8', 140))
list.files()
Does this work? I don't have access to a UTF-8 Windows machine right
now.
Thanks, yes, I can reproduce the problem. Some Windows functions impose 260 wide characters limit, but other 260 bytes limit, so one can create a file with a name too long to be found by FindNextFileA. In R 4.2, we used readdir() from mingw-w64, which itself used findnext, which however had the same problem, it used a buffer of size 260 bytes and from the code of mingw-w64 and the Windows documentation, it should have behaved the same, it should have stopped the search on such a long file name. However, in my use case, R 4.2.3 crashed inside findnext due to stack overrun, R 4.1.3 worked, but clearly it would require a different use case to overrun this buffer as it didn't use UTF-8. This suggests that findnext didn't have a check for this and hence caused memory corruption, which can lead to a crash or work by coincidence. Which could have been the case for the user reporting this as a regression compared to R 4.2. But it is not a regression, the problem existed for long. So, yes, we'd probably have to use wide variants of FindNext/FindFirst. I'll fix.
Fixed in R-devel (84960). Please let me know if you see any problem with the fix. Thanks, Tomas
Thanks for debugging this, Tomas
a little more information for this issue. Search in MS website today , found doc about "Maximum Path Length Limitation", Maximum Path Length Limitation - Win32 apps | Microsoft Learn<https://learn.microsoft.com/en-us/windows/win32/fileio/maximum-file-path-limitation?tabs=registry> . According the doc, need to do two things to avoid this issue on window 10 and latter: 1 edit registry or group policy set HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\FileSystem] "LongPathsEnabled"=dword:00000001 2 app manifest (R already done it) Regards, yu
From: R-devel <r-devel-bounces at r-project.org> on behalf of Tomas Kalibera <tomas.kalibera at gmail.com>
Sent: Wednesday, August 16, 2023 15:42
To: Ivan Krylov <krylov.r00t at gmail.com>
Cc: r-devel at r-project.org <r-devel at r-project.org>
Subject: Re: [Rd] R-4.3 version list.files function could not work correctly in chinese
Sent: Wednesday, August 16, 2023 15:42
To: Ivan Krylov <krylov.r00t at gmail.com>
Cc: r-devel at r-project.org <r-devel at r-project.org>
Subject: Re: [Rd] R-4.3 version list.files function could not work correctly in chinese
On 8/15/23 16:00, Tomas Kalibera wrote:
>
> On 8/15/23 09:04, Ivan Krylov wrote:
>> ?? Tue, 15 Aug 2023 08:38:11 +0200
>> Tomas Kalibera <tomas.kalibera at gmail.com> ??????:
>>
>>> As this was reported to be regression in 4.3, it is entirely possible
>>> this change came with a regression (though a bit surprising we didn't
>>> catch it earlier by testing), so it would be a great help if I could
>>> have the example and debug it.
>> Sorry, let me try to be more clear.
>>
>> The Windows filename length limit is 255(?) wide characters. The
>> WIN32_FIND_DATAA structure contains a 260-byte buffer for the filename
>> to be returned by FindFirstFileA()/FindNextFileA(). If a wide character
>> takes more than one byte to be represented in UTF-8, it may overflow
>> the 260 byte limit in the WIN32_FIND_DATAA structure despite being
>> below the 260 wide character limit. When such an overflow happens,
>> FindNextFile() returns FALSE with GetLastError() == ERROR_MORE_DATA,
>> which results in R_readdir() returning NULL and makes list_files() stop
>> before listing the rest of the directory.
>>
>> This is easier to make happen by accident with Chinese characters,
>> because they take three UTF-8 bytes per character.
>>
>> Take the ?? (\uf8) letter. It takes two bytes to represent in UTF-8.
>> Create a file with a name consisting of this symbol repeated 140 times.
>> When you run list.files() on the resulting directory on Windows with a
>> UTF-8 locale, Windows tries to fit (0xc3 0xb8) times 140 into a
>> 260-byte buffer, which doesn't work. I'm afraid the only way to avoid
>> such a failure is to rewrite R_readdir using the wide character API and
>> convert the file names on the fly. (Just like mingw readdir() did in
>> the past?)
>>
>> stopifnot(.Platform$OS.type == 'windows', l10n_info()$`UTF-8`)
>> # any character for which nchar(enc2utf8(.), 'bytes') > 1 will do
>> # any number >260/2 should do
>> file.create(strrep('\uf8', 140))
>> list.files()
>>
>> Does this work? I don't have access to a UTF-8 Windows machine right
>> now.
>
> Thanks, yes, I can reproduce the problem. Some Windows functions
> impose 260 wide characters limit, but other 260 bytes limit, so one
> can create a file with a name too long to be found by FindNextFileA.
>
> In R 4.2, we used readdir() from mingw-w64, which itself used
> findnext, which however had the same problem, it used a buffer of size
> 260 bytes and from the code of mingw-w64 and the Windows
> documentation, it should have behaved the same, it should have stopped
> the search on such a long file name. However, in my use case, R 4.2.3
> crashed inside findnext due to stack overrun, R 4.1.3 worked, but
> clearly it would require a different use case to overrun this buffer
> as it didn't use UTF-8. This suggests that findnext didn't have a
> check for this and hence caused memory corruption, which can lead to a
> crash or work by coincidence. Which could have been the case for the
> user reporting this as a regression compared to R 4.2. But it is not a
> regression, the problem existed for long.
>
> So, yes, we'd probably have to use wide variants of
> FindNext/FindFirst. I'll fix.
Fixed in R-devel (84960). Please let me know if you see any problem with
the fix.
Thanks,
Tomas
>
> Thanks for debugging this,
> Tomas
>
>
>
______________________________________________
R-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
On Wed, 16 Aug 2023 09:42:09 +0200
Tomas Kalibera <tomas.kalibera at gmail.com> wrote:
Fixed in R-devel (84960). Please let me know if you see any problem with the fix.
Thank you for implementing the fix! I gave ??? the link to the GitHub Action build of the r84960 installer. I'm worried that ??? was seeing FindNextFileA fail for a different reason (all the examples given at the Capital of Statistics forum seemed to use less than 256/4 = 64 characters per file name...), but maybe this won't reappear with the switch to FindNextFileW. If this keeps happening, it might be worth producing a warning when FindNextFileW() fails with an unexpected GetLastError() value. fs::dir_fs() uses NtQueryDirectoryFile() and WideCharToMultiByte() instead of FindNextFileW() and wcstombs(), but maybe this shouldn't matter. In particular, both list.files() and fs::dir_fs() would fail given a file name that cannot be represented in UTF-8 (invalid UTF-16 surrogate pairs?)
Best regards, Ivan
On 8/16/23 13:11, yu gong wrote:
a little more information for this issue. Search in MS website today , found doc about "Maximum Path Length Limitation", Maximum Path Length Limitation - Win32 apps | Microsoft Learn <https://learn.microsoft.com/en-us/windows/win32/fileio/maximum-file-path-limitation?tabs=registry>?. According the doc, need to do two things to avoid this issue on window 10? and latter: 1 edit registry or group policy ?set HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\FileSystem] "LongPathsEnabled"=dword:00000001 2 app manifest (R already done it)
These settings are for long paths (meaning a full path containing of multiple elements separated by backslashes), more about that is also in [1]. But the problem that Ivan reported (which is not clear whether it is the same problem as the one reported originally on this thread), is about the limit for a single file/directory name - that is, for a single element of a path. Having the long paths enabled in the registry wouldn't help with this. These two limits are not directly related, except the obvious: by choosing rather long names for individual files, one usually soon runs out of the limit for the full path. Best Tomas [1] - https://blog.r-project.org/2023/03/07/path-length-limit-on-windows/index.html
Regards, yu ------------------------------------------------------------------------ *From:* R-devel <r-devel-bounces at r-project.org> on behalf of Tomas Kalibera <tomas.kalibera at gmail.com> *Sent:* Wednesday, August 16, 2023 15:42 *To:* Ivan Krylov <krylov.r00t at gmail.com> *Cc:* r-devel at r-project.org <r-devel at r-project.org> *Subject:* Re: [Rd] R-4.3 version list.files function could not work correctly in chinese On 8/15/23 16:00, Tomas Kalibera wrote:
On 8/15/23 09:04, Ivan Krylov wrote:
? Tue, 15 Aug 2023 08:38:11 +0200 Tomas Kalibera <tomas.kalibera at gmail.com> ?????:
As this was reported to be regression in 4.3, it is entirely possible this change came with a regression (though a bit surprising we didn't catch it earlier by testing), so it would be a great help if I could have the example and debug it.
Sorry, let me try to be more clear.
The Windows filename length limit is 255(?) wide characters. The
WIN32_FIND_DATAA structure contains a 260-byte buffer for the filename
to be returned by FindFirstFileA()/FindNextFileA(). If a wide character
takes more than one byte to be represented in UTF-8, it may overflow
the 260 byte limit in the WIN32_FIND_DATAA structure despite being
below the 260 wide character limit. When such an overflow happens,
FindNextFile() returns FALSE with GetLastError() == ERROR_MORE_DATA,
which results in R_readdir() returning NULL and makes list_files() stop
before listing the rest of the directory.
This is easier to make happen by accident with Chinese characters,
because they take three UTF-8 bytes per character.
Take the ? (\uf8) letter. It takes two bytes to represent in UTF-8.
Create a file with a name consisting of this symbol repeated 140 times.
When you run list.files() on the resulting directory on Windows with a
UTF-8 locale, Windows tries to fit (0xc3 0xb8) times 140 into a
260-byte buffer, which doesn't work. I'm afraid the only way to avoid
such a failure is to rewrite R_readdir using the wide character API and
convert the file names on the fly. (Just like mingw readdir() did in
the past?)
stopifnot(.Platform$OS.type == 'windows', l10n_info()$`UTF-8`)
# any character for which nchar(enc2utf8(.), 'bytes') > 1 will do
# any number >260/2 should do
file.create(strrep('\uf8', 140))
list.files()
Does this work? I don't have access to a UTF-8 Windows machine right
now.
Thanks, yes, I can reproduce the problem. Some Windows functions impose 260 wide characters limit, but other 260 bytes limit, so one can create a file with a name too long to be found by FindNextFileA. In R 4.2, we used readdir() from mingw-w64, which itself used findnext, which however had the same problem, it used a buffer of size 260 bytes and from the code of mingw-w64 and the Windows documentation, it should have behaved the same, it should have stopped the search on such a long file name. However, in my use case, R 4.2.3 crashed inside findnext due to stack overrun, R 4.1.3 worked, but clearly it would require a different use case to overrun this buffer as it didn't use UTF-8. This suggests that findnext didn't have a check for this and hence caused memory corruption, which can lead to a crash or work by coincidence. Which could have been the case for the user reporting this as a regression compared to R 4.2. But it is not a regression, the problem existed for long. So, yes, we'd probably have to use wide variants of FindNext/FindFirst. I'll fix.
Fixed in R-devel (84960). Please let me know if you see any problem with the fix. Thanks, Tomas
Thanks for debugging this, Tomas
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel <https://stat.ethz.ch/mailman/listinfo/r-devel>
On 8/16/23 13:22, Ivan Krylov wrote:
On Wed, 16 Aug 2023 09:42:09 +0200 Tomas Kalibera <tomas.kalibera at gmail.com> wrote:
Fixed in R-devel (84960). Please let me know if you see any problem with the fix.
Thank you for implementing the fix! I gave ??? the link to the GitHub Action build of the r84960 installer.
Thanks and thanks for looking at the change.
I'm worried that ??? was seeing FindNextFileA fail for a different reason (all the examples given at the Capital of Statistics forum seemed to use less than 256/4 = 64 characters per file name...), but maybe this won't reappear with the switch to FindNextFileW. If this keeps happening, it might be worth producing a warning when FindNextFileW() fails with an unexpected GetLastError() value.
I've added a warning to R-devel when list.files() on Windows stops listing a directory due to an error. There is probably not more we can do unless there is a revised bug report of the original problem.
fs::dir_fs() uses NtQueryDirectoryFile() and WideCharToMultiByte() instead of FindNextFileW() and wcstombs(), but maybe this shouldn't matter. In particular, both list.files() and fs::dir_fs() would fail given a file name that cannot be represented in UTF-8 (invalid UTF-16 surrogate pairs?)
Right, R only support file names that are valid strings, this assumption is present at many places in the code, so it is fine/consistent to be here as well. The choice of opendir/readdir in R was probably motivated by minimization of platform-specific code. Best Tomas