as.numeric(levels(factor(x))) may be a decreasing sequence
Martin Maechler wrote:
[...]
vQ> the first question is, why does ER return the string as const? it
vQ> appears that the returned pointer provides the address of a buffer used
vQ> internally in ER, which is allocated *statically*. that is, each call
vQ> to ER operates on the same memory location, and each call to ER returns
vQ> the address of that same location. i suspect this is intended to be a
vQ> smart optimization, to avoid heap- or stack-allocating a new buffer in
vQ> each call to ER, and deallocating it after use. however, this appraoch
vQ> is problematic, in that any two calls to ER return the address of the
vQ> same piece of memory, and this may easily lead to data corruption.
Well, that would be ok if R could be used "threaded" / parallel / ...
this can cause severe problems even without concurrency, as one of my examples hinted.
and we all know that there are many other pieces of code {not
just R's "own", but also in Fortran/C algorithms ..} that are
not thread-safe.
absolutely. again, ER is unsafe even in a sequential execution environment.
"Yes, of course", R looks like a horrible piece of software
telepathy?
to
some, because of that
vQ> under the assumption that the content of this piece of memory is copied
vQ> before any destructive use, and that after the string is copied the
vQ> address is not further distributed, the hack is relatively harmless.
vQ> this is what mkChar (via mkCharLenCE) does; in SFR it copies the
vQ> content of s with memcpy, and wraps it into a SEXP that becomes the
vQ> return value from SFR.
exactly.
but it should be made clear, by means of a comment, that ER is supposed to be used in this way. there is no hint at the interface level.
vQ> the original author of this hack seems to have had some concern about
vQ> exporting (from ER) the address of a static buffer, hence the returned
vQ> buffer is const. in principle, this should prevent corruption of the
vQ> buffer's content in situations such as
vQ> // hypothetical
vQ> char *p1 = ER(...);
vQ> // p1 is some string returned from ER
vQ> char p2 = ER(...);
vQ> // p2 is some other string returned from ER
vQ> // some modifications performed on the string referred to by p1
vQ> p1[0] = 'x';
vQ> // p2[0] is 'x' -- possible data corruption
vQ> still worse in a scenario with concurrent calls to ER.
(which will not happen in the near future)
unless you know a powerful and willing magician.
vQ> however, since the output from ER is const, this is no longer possible
vQ> -- at least, not without a deconstifying cast the petr style. the
vQ> problem with petr's solution is not only that it modifies shared memory
vQ> purposefully qualified as const (by virtue of ER's return type), but
vQ> also that it effectively distributes the address for further use.
vQ> unfortunately, like most of the r source code, ER is not appropriately
vQ> commented at the declaration and the definition, and without looking at
vQ> the code, one can hardly have any clue that ER always return the same
vQ> address of a static location. while the original developer might be
vQ> careful enough not to misuse ER, in a large multideveloper project it's
vQ> hard expect that from others. petr's function is precisely an example
vQ> of such misuse, and as it adds (again, without an appropriate comment) a
vQ> step of indirection; any use of petr's function other than what you have
vQ> in SFR (and can you guarantee no one will ever use DT for other
vQ> purposes?) is even more likely to end up in data corruption.
you have a point here, and as a consequence, I'm proposing to
put the following version of DT into the source :
------------------------------------------------------------------------
/* Note that we modify a 'const char*' which is unsafe in general,
* but ok in the context of filtering an Encode*() value into mkChar(): */
static const char* dropTrailing0(char *s, char cdec)
{
char *p = s;
for (p = s; *p; p++) {
if(*p == cdec) {
char *replace = p++;
while ('0' <= *p && *p <= '9')
if(*(p++) != '0')
replace = p;
while((*(replace++) = *(p++)))
;
break;
}
}
return s;
}
the first line appears inessential; to an informed programmer, taking a string as char* (as opposed to const char*) means that it *may* be modified within the call, irrespectively of whether it actually is, and on what occasions, and one should not assume the string is not destructively modified. i think it is much more appropriate to comment (a) ER, with a warning to the effect that it always returns the same address, hence the output should be used immediately and never written to, (b) the use of ER in SFR where it's output is cast to char* precisely for the purpose of destructive modification, to the contrary of what (a) says. i've attached a patch with an alternative comment.
------------------------------------------------------------------------ so it has a comment along the lines you suggest,
almost
*and* as static is not callable from outside coerce.c
indeed. unfortunately, ER is callable from throughout the place.
vQ> one simple way to improve the code is as follows; instead of (simplified)
vQ> const char* dropTrailing(const char* s, ...) {
vQ> const char *p = s;
vQ> char *replace;
vQ> ...
vQ> replace = (char*) p;
vQ> ...
vQ> return s; }
vQ> ...mkChar(dropTrailing(EncodeReal(...), ...) ...
vQ> you can have something like
vQ> const char* dropTrailing(char* s, ...) {
vQ> char *p = s, *replace;
vQ> ...
vQ> replace = p;
vQ> ...
vQ> return s; }
vQ> ...mkChar(dropTrailing((char*)EncodeReal(...), ...) ...
vQ> where it is clear, from DT's signature, that it may (as it purposefully
vQ> does, in fact) modify the content of s. that is, you drop the
vQ> promise-not-to-modify contract in DT, and move the need for
vQ> deconstifying ER's return out of DT, making it more explicit.
vQ> however, this is still an ad hoc hack; it still breaks the original
vQ> developer's assumption (if i'm correct) that the return from ER
vQ> (pointing to its internal buffer) should not be destructively modified
vQ> outside of ER.
I think that's a misinterpretation of the history of ER.
"if i'm correct"
IIRC, the main issue when changing from 'char*' to 'const char*' in *many* places "simultaneously" was the introduction of hashes / cashes of strings (in mkChar(.)) in order to make R-level character handling (and storage of STRSXP/CHARSXP) much more efficient.
what does it have to do with ER returning a pointer to a static location?
vQ> (3) modify petr's solution along the lines above, i.e., have the input
vQ> in the signature non-const and deconst-cast the output from ER outside
vQ> of the call to DT.
that's what I have adopted, as I'm sure you've noticed when you
saw the code above.
surprisingly, sometimes i'm able to notice ;) best, vQ -------------- next part -------------- A non-text attachment was scrubbed... Name: coerce.diff Type: text/x-diff Size: 1106 bytes Desc: not available URL: <https://stat.ethz.ch/pipermail/r-devel/attachments/20090531/f692e569/attachment.bin>