"vQ" == Wacek Kusnierczyk <waku at idi.ntnu.no>
on Tue, 21 Apr 2009 13:05:11 +0200 (CEST) writes:
vQ> Full_Name: Wacek Kusnierczyk
vQ> Version: 2.10.0 r48365
vQ> OS: Ubuntu 8.04 Linux 32bit
vQ> Submission from: (NULL) (129.241.110.141)
vQ> sprintf has a documented limit on strings included in the output using the
vQ> format '%s'. It appears that there is a limit on the length of strings included
vQ> with, e.g., the format '%d' beyond which surprising things happen (output
vQ> modified for conciseness):
vQ> gregexpr('1', sprintf('%9000d', 1))
vQ> # [1] 9000 9801
vQ> gregexpr('1', sprintf('%9000d', 1))
vQ> # [1] 9000 9801 10602
vQ> gregexpr('1', sprintf('%9000d', 1))
vQ> # [1] 9000 9801 10602 11403
vQ> gregexpr('1', sprintf('%9000d', 1))
vQ> # [1] 9000 9801 10602 11403 12204
vQ> ...
vQ> Note that not only more than one '1' is included in the output, but also that
vQ> the same functional expression (no side effects used beyond the interface) gives
vQ> different results on each execution. Analogous behaviour can be observed with
vQ> '%nd' where n > 8200.
vQ> The actual output above is consistent across separate sessions.
vQ> With sufficiently large field width values, R segfaults:
vQ> sprintf('%*d', 10^5, 1)
vQ> # *** caught segfault ***
vQ> # address 0xbfcfc000, cause 'memory not mapped'
vQ> # Segmentation fault
Thank you, Wacek.
That's all ``interesting'' ... unfortunately,
my version of 'man 3 sprintf' contains
BUGS
Because sprintf() and vsprintf() assume an arbitrarily
long string, callers must be careful not to overflow the
actual space; this is often impossible to assure. Note
that the length of the strings produced is
locale-dependent and difficult to predict. Use
snprintf() and vsnprintf() instead (or asprintf() and vasprintf).
(note the "impossible" part above)
and we haven't used snprintf() yet, probably because it
requires the C99 C standard, and AFAIK, we have only relatively
recently started to more or less rely on C99 in the R sources.
More precisely, I see that some windows-only code relies on
snprintf() being available whereas in at least on non-Windows
section, I read /* we cannot assume snprintf here */
Now such platform dependency issues and corresponding configure
settings I do typically leave to other R-corers with a much
wider overview about platforms and their compilers and C libraries.
BTW,
1) sprintf("%n %g", 1,1) also seg.faults
2) Did you have a true use case where the 8192 limit was an
undesirable limit?
Martin
vQ> sessionInfo()
vQ> # R version 2.10.0 Under development (unstable) (2009-04-20 r48365)
vQ> # i686-pc-linux-gnu
vQ> sprintf has a documented limit on strings included in the output using the
vQ> format '%s'. It appears that there is a limit on the length of strings included
vQ> with, e.g., the format '%d' beyond which surprising things happen (output
vQ> modified for conciseness):
... and this limit is *not* documented.
vQ> gregexpr('1', sprintf('%9000d', 1))
vQ> # [1] 9000 9801
vQ> gregexpr('1', sprintf('%9000d', 1))
vQ> # [1] 9000 9801 10602
vQ> gregexpr('1', sprintf('%9000d', 1))
vQ> # [1] 9000 9801 10602 11403
vQ> gregexpr('1', sprintf('%9000d', 1))
vQ> # [1] 9000 9801 10602 11403 12204
vQ> ...
vQ> Note that not only more than one '1' is included in the output, but also that
vQ> the same functional expression (no side effects used beyond the interface) gives
vQ> different results on each execution. Analogous behaviour can be observed with
vQ> '%nd' where n > 8200.
vQ> The actual output above is consistent across separate sessions.
vQ> With sufficiently large field width values, R segfaults:
vQ> sprintf('%*d', 10^5, 1)
vQ> # *** caught segfault ***
vQ> # address 0xbfcfc000, cause 'memory not mapped'
vQ> # Segmentation fault
Thank you, Wacek.
That's all ``interesting'' ... unfortunately,
my version of 'man 3 sprintf' contains
BUGS
Because sprintf() and vsprintf() assume an arbitrarily
long string, callers must be careful not to overflow the
actual space; this is often impossible to assure. Note
that the length of the strings produced is
locale-dependent and difficult to predict. Use
snprintf() and vsnprintf() instead (or asprintf() and vasprintf).
yes, but this is c documentation, not r documentation. it's applicable
to a degree, since ?sprintf does say that sprintf is "a wrapper for the
C function 'sprintf'". however, in c you use a buffer and you usually
have control over it's capacity, while in r this is a hidden
implementational detail, which should not be visible to the user, or
should cause an attempt to overflow the buffer to fail more gracefully
than with a segfault.
in r, sprintf('%9000d', 1) will produce a confused output with a count
of 1's variable (!) across runs (while sprintf('%*d', 9000, 1) seems to
do fine):
gregexpr('1', sprintf('%*d', 9000, 1))
# [1] 9000
gregexpr('1', sprintf('%9000d', 1))
# [1] 9000 9801 ..., variable across executions
on one execution in a series i actually got this:
Warning message:
In gregexpr("1", sprintf("%9000d", 1)) :
input string 1 is invalid in this locale
while the very next execution, still in the same session, gave
# [1] 9000 9801 10602
with sprintf('%*d', 10000, 1) i got segfaults on some executions but
correct output on others, while sprintf('%10000d', 1) is confused again.
(note the "impossible" part above)
yes, but it does also say "must be careful", and it seems that someone
has not been careful enough.
and we haven't used snprintf() yet, probably because it
requires the C99 C standard, and AFAIK, we have only relatively
recently started to more or less rely on C99 in the R sources.
while snprintf would help avoid buffer overflow, it may not be a
solution to the issue of confused output.
More precisely, I see that some windows-only code relies on
snprintf() being available whereas in at least on non-Windows
section, I read /* we cannot assume snprintf here */
Now such platform dependency issues and corresponding configure
settings I do typically leave to other R-corers with a much
wider overview about platforms and their compilers and C libraries.
it looks like src/main/sprintf.c is just buggy, and it's plausible that
the bug could be repaired in a platform-independent manner.
BTW,
1) sprintf("%n %g", 1,1) also seg.faults
as do
sprintf('%n%g', 1, 1)
sprintf('%n%')
etc., while
sprintf('%q%g', 1, 1)
sprintf('%q%')
work just fine. strange, because per ?sprintf 'n' is not recognized as
a format specifier, so the output from the first two above should be as
from the last two above, respectively. (and likewise in the %S case,
discussed and bug-reported earlier.)
2) Did you have a true use case where the 8192 limit was an
undesirable limit?
how does it matter? if you set a limit, be sure to consistently enforce
it and warn the user on attempts to exceed it. or write clearly in the
docs that such attempts will cause the output to be silently truncated.
examples such as
sprintf('%9000d', 1)
do not contribute to the reliability of r, and neither to the user's
confidence in it.
vQ