RFC: hexadecimal constants and decimal points - R-devel

Sun, Apr 17, 2005 6:38 AM #

These are some points stimulated by reading about C history (and 
related in their implementation).


1) On some platforms

[1] 10

but not all (not on Solaris nor Windows).  We do not define what is 
allowed, and rely on the OS's implementation of strtod (yes, not strtol). 
It seems that glibc does allow hex: C99 mandates it but C89 seems not to 
allow it.

I think that was a mistake, and strtol should have been used.  Then C89
does mandate the handling of hex constants and also octal ones.  So 
changing to strtol would change the meaning of as.integer("011").

Proposal: we handle this ourselves and define what values are acceptable,
namely for as.integer:

[+|-][0-9]+
NA
0[x|X][0-9A-fa-f]+

in all cases such that the converted value is in-range.  (This does mean 
as.integer("1e+05") would be invalid, but is it clear that is allowed 
now?)

For as.numeric(), probably the C99 rules (which include NaN, Inf, 
Infinity, and we need to add NA).

Alternatively, make and document the semantics to be
as.integer(as.numeric(char_string)) (which is effectively what we have 
now, although it causes surprises).

[As a side point, some locales may accept non-Roman digits.  I think we 
need to exclude those everywhere, not just some places like parsing.]


2) R does not have integer constants.  It would be convenient if it did, 
and I can see no difficulty in allowing the same conversions when parsing 
as when coercing.  This would have the side effect that 100 would be 
integer (but the coercion rules would come into play) but 
200000000000000000 would be double.  And x <-0xce80 would be valid.


3) We do allow setting LC_NUMERIC, but it partially breaks R if the 
decimal point is not ".".  (I know of no locale in which it is not "." or 
",", and we cannot allow "," as part of numeric constants when parsing.) 
E.g.:

[1] "fr_FR"
Warning message:
setting 'LC_NUMERIC' may cause R to function strangely in: 
setlocale(category, locale)

[1] 3

[1] 3,12

[1] NA
Warning message:
NAs introduced by coercion

We could do better by insisting that "." was the decimal point in all 
interval conversions _to_ numeric.  Then the effect of setting LC_NUMERIC 
would primarily be on conversions _from_ numeric, especially printing and 
graphical output.  (One issue would be what to do with scan(), which has a 
`dec' argument but is implemented assuming LC_NUMERIC=C.  I would hope to 
continue to have `dec' but perhaps with a locale-dependent default.)  The 
resulting asymmetry (R would not be able to parse its own output) would be 
unhappy, but seems inevitable. (This could be implemented easily by having 
a `dec' arg to EncodeReal and EncodeComplex, and using LC_NUMERIC to 
control that rather than actually setting the local category.  For 
example, deparsing needs to be done in LC_NUMERIC=C.)


All of these could be implemented by customized versions of 
strtod/strtol.

Brian D. Ripley,                  ripley@stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

Gabor Grothendieck

Sun, Apr 17, 2005 7:31 AM #

On 4/17/05, Prof Brian Ripley <ripley@stats.ox.ac.uk> wrote:

In the windows batch language the following (translated to R):
       month <- substr("20050817",5,2)
must be further processed to removed any leading zero.  Mostly
people don't even realize this and just wind up writing erroneous
programs.  Its actually a big nuisance IMHO.

Jan T. Kim

Sun, Apr 17, 2005 7:54 AM #

On Sun, Apr 17, 2005 at 12:38:10PM +0100, Prof Brian Ripley wrote:

I think interpretation of a leading "0" as a prefix indicating an octal
representation should indeed be avoided. People not familiar to C will
have a hard time understanding and getting used to this concept, and
in addition, it happens way too often that numeric data are provided left-
padded with zeros.

It can be a somewhat mixed blessing if the string representation of numeric
values contain information about their base, in the form of the 0x prefix
in this case.

The base argument (#3) of C's strtol function can be set to to a base
explicitly or to 0, which gives the prefix-based "auto-selection"
behaviour. On the R level, such a base argument (to as.integer) could be
included and a default could be set.

Personally, I would be equally happy with the default being 0 (auto-select)
or 10. Considering the perhaps limited spread of familiarity with C's
"0x" idiom, I somewhat favour a consistent and "stubborn" decimal behaviour
(base defaults to 10), though.

Best regards, Jan

+- Jan T. Kim -------------------------------------------------------+
 |    *NEW*    email: jtk@cmp.uea.ac.uk                               |
 |    *NEW*    WWW:   http://www.cmp.uea.ac.uk/people/jtk             |
 *-----=<  hierarchical systems are for files, not for humans  >=-----*

Brian Ripley

Sun, Apr 17, 2005 8:46 AM #

On Sun, 17 Apr 2005, Jan T. Kim wrote:

A lot of this is internal, not at R level.

Some people already rely on it, and those who don't know about it are 
unliekly to ever enter what they think is an illegal value, surely?

Brian D. Ripley,                  ripley@stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

Duncan Murdoch

Mon, Apr 18, 2005 3:33 AM #

I agree with this:  011 should be 11, it should not be 9.

As long as we document it, I think the 0x prefix is fine.

We should provide a way to use other bases on input and output.  This
could be through format specifiers, but it would be enough to have a pair
of dedicated functions to do the conversions.

Duncan Murdoch

Martin Maechler

Mon, Apr 18, 2005 5:06 AM #

>> On Sun, 17 Apr 2005, Jan T. Kim wrote:

>>

>>> On Sun, Apr 17, 2005 at 12:38:10PM +0100, Prof Brian Ripley wrote:

>>>> These are some points stimulated by reading about C history (and
    >>>> related in their implementation).
    >>>> 
    >>>> 
    >>>> 1) On some platforms
    >>>> 
    >>>>> as.integer("0xA")
    >>>> [1] 10
    >>>> 
    >>>> but not all (not on Solaris nor Windows).  We do not define what is
    >>>> allowed, and rely on the OS's implementation of strtod (yes, not
    >>>> strtol).
    >>>> It seems that glibc does allow hex: C99 mandates it but C89 seems not
    >>>> to
    >>>> allow it.
    >>>> 
    >>>> I think that was a mistake, and strtol should have been used.  Then C89
    >>>> does mandate the handling of hex constants and also octal ones.  So
    >>>> changing to strtol would change the meaning of as.integer("011").
    >>> 
    >>> I think interpretation of a leading "0" as a prefix indicating an octal
    >>> representation should indeed be avoided. People not familiar to C will
    >>> have a hard time understanding and getting used to this concept, and
    >>> in addition, it happens way too often that numeric data are provided
    >>> left-
    >>> padded with zeros.

    Duncan> I agree with this:  011 should be 11, it should not be 9.

I agree (with Duncan and Jan).

I'm sure the current (decimal) behavior is implicitly used in
many places of people's code that reads text files and
manipulates it.

Martin

Martin Maechler

Mon, Apr 18, 2005 5:24 AM #

BDR> These are some points stimulated by reading about C history (and 
    BDR> related in their implementation).

    <.....>


    BDR> 2) R does not have integer constants.  It would be
    BDR> convenient if it did, and I can see no difficulty in
    BDR> allowing the same conversions when parsing as when
    BDR> coercing.  This would have the side effect that 100
    BDR> would be integer (but the coercion rules would come
    BDR> into play) but 200000000000000000 would be double.  And
    BDR> x <- 0xce80 would be valid.

Hmm, I'm not sure if this (parser change, mainly) is worth the
potential problems.  Of course you (Brian) know better than
anyone here that, when that change was implemented for S-plus, I think
Mathsoft (the predecessor of 'Insightful') did also change all
their legacy S code and translate all '<n>' to '<n>.'  just in
order to make sure that things stayed back compatible.  
And, IIRC, they recommended users to do so similarly with their
own S source files. I had found this extremely ugly at the time,
but it was mandated by the fact they didn't want to break
existing code which in some places did assume that e.g. '0' was
a double but became an integer in the new version of S-plus
{and e.g., as.double(.) became absolutely mandated before passing
 things to C  --- of course, using as.double(.) ``everywhere''
 before passing to C has been recommended for a long time which
 didn't prevent people to rely on the current behavior (in R) that
 almost all numbers are double}. 

We (or rather the less sophisticated members of the R community)
may get into similar problems when, e.g.,
matrix(0, 3,4)  suddenly produces an integer matrix instead of a
double precision one.


    BDR> 3) We do allow setting LC_NUMERIC, but it partially breaks R if the 
    BDR> decimal point is not ".".  (I know of no locale in which it is not "." or 
    BDR> ",", and we cannot allow "," as part of numeric constants when parsing.) 
    BDR> E.g.:

    >> Sys.setlocale("LC_NUMERIC", "fr_FR")
    BDR> [1] "fr_FR"
    BDR> Warning message:
    BDR> setting 'LC_NUMERIC' may cause R to function strangely in: 
    BDR> setlocale(category, locale)
    >> x <- 3.12
    >> x
    BDR> [1] 3
    >> as.numeric("3,12")
    BDR> [1] 3,12
    >> as.numeric("3.12")
    BDR> [1] NA
    BDR> Warning message:
    BDR> NAs introduced by coercion

    BDR> We could do better by insisting that "." was the decimal point in all 
    BDR> interval conversions _to_ numeric.  Then the effect of setting LC_NUMERIC 
    BDR> would primarily be on conversions _from_ numeric, especially printing and 
    BDR> graphical output.  (One issue would be what to do with scan(), which has a 
    BDR> `dec' argument but is implemented assuming LC_NUMERIC=C.  I would hope to 
    BDR> continue to have `dec' but perhaps with a locale-dependent default.)  The 
    BDR> resulting asymmetry (R would not be able to parse its own output) would be 
    BDR> unhappy, but seems inevitable. (This could be implemented easily by having 
    BDR> a `dec' arg to EncodeReal and EncodeComplex, and using LC_NUMERIC to 
    BDR> control that rather than actually setting the local category.  For 
    BDR> example, deparsing needs to be done in LC_NUMERIC=C.)

Yes, I like this quite a bit:

 -  Only allow "." as decimal point in conversions to numeric.

 -  Allowing "," (or other locale settings if there are) for
    conversions _from_ numeric will be very attractive to some
    (not to me) and will make the use of R's ``reporting
    facility' much more natural to them. 

  That the asymmetry is bit unhappy -- and that will be a good reason
  to advocate (to the user community) that using "," for decimal
  point may be a bad idea in general.

Martin Maechler
ETH Zurich

    BDR> All of these could be implemented by customized versions of 
    BDR> strtod/strtol.

    BDR> -- 
    BDR> Brian D. Ripley,                  ripley@stats.ox.ac.uk
    BDR> Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
    BDR> University of Oxford,             Tel:  +44 1865 272861 (self)
    BDR> 1 South Parks Road,                     +44 1865 272866 (PA)
    BDR> Oxford OX1 3TG, UK                Fax:  +44 1865 272595

Peter Dalgaard

Mon, Apr 18, 2005 8:16 AM #

Martin Maechler <maechler@stat.math.ethz.ch> writes:

Could I suggest that we tread very carefully here? This issue has
caused several trip-ups historically:

- The locale-dependent "comma-separated variables" format, in some
  cases not separated by commas. And it seems that you can still get
  Excel files that use comma both for separation and as decimal point
  (I thought that problem disappeared with early versions of Paradox,
  but apparently not, according to a resent query on r-help).

- Exports from SAS as a text file cannot be read by SPSS and vice
  versa.

etc. Quite possibly, the "computer world" missed the opportunity to
agree on an international standard (what's the big deal with using
commas anyway?). As it is we probably have to adjust to it, but we
have to distinguish very carefully between reports, code, and data,
and choose appropriate conventions for each case.

O__  ---- Peter Dalgaard             Blegdamsvej 3  
  c/ /'_ --- Dept. of Biostatistics     2200 Cph. N   
 (*) \(*) -- University of Copenhagen   Denmark      Ph: (+45) 35327918
~~~~~~~~~~ - (p.dalgaard@biostat.ku.dk)             FAX: (+45) 35327907

Brian Ripley

Mon, Apr 18, 2005 12:08 PM #

On Mon, 18 Apr 2005, Peter Dalgaard wrote:

Martin Maechler <maechler@stat.math.ethz.ch> writes:

    BDR> We could do better by insisting that "." was the decimal
    BDR> point in all interval conversions _to_ numeric. Then the
    BDR> effect of setting LC_NUMERIC would primarily be on
    BDR> conversions _from_ numeric, especially printing and
    BDR> graphical output. (One issue would be what to do with
    BDR> scan(), which has a `dec' argument but is implemented
    BDR> assuming LC_NUMERIC=C. I would hope to continue to have
    BDR> `dec' but perhaps with a locale-dependent default.) The
    BDR> resulting asymmetry (R would not be able to parse its own
    BDR> output) would be unhappy, but seems inevitable. (This could
    BDR> be implemented easily by having a `dec' arg to EncodeReal
    BDR> and EncodeComplex, and using LC_NUMERIC to control that
    BDR> rather than actually setting the local category. For
    BDR> example, deparsing needs to be done in LC_NUMERIC=C.)

Yes, I like this quite a bit:

 -  Only allow "." as decimal point in conversions to numeric.

 -  Allowing "," (or other locale settings if there are) for
    conversions _from_ numeric will be very attractive to some
    (not to me) and will make the use of R's ``reporting
    facility' much more natural to them.

  That the asymmetry is bit unhappy -- and that will be a good reason
  to advocate (to the user community) that using "," for decimal
  point may be a bad idea in general.

Could I suggest that we tread very carefully here? This issue has
caused several trip-ups historically:

- The locale-dependent "comma-separated variables" format, in some
 cases not separated by commas. And it seems that you can still get
 Excel files that use comma both for separation and as decimal point
 (I thought that problem disappeared with early versions of Paradox,
 but apparently not, according to a resent query on r-help).

- Exports from SAS as a text file cannot be read by SPSS and vice
 versa.

etc. Quite possibly, the "computer world" missed the opportunity to
agree on an international standard (what's the big deal with using
commas anyway?). As it is we probably have to adjust to it, but we
have to distinguish very carefully between reports, code, and data,
and choose appropriate conventions for each case.

I was treading _very_ carefully.  Nowhere did I suggest altering any of
write.table and friends.  I did not even suggest altering read.table.
I tentatively suggested the default in scan() might be locale-specific,
but was otherwise leaving import/export completely alone.

The aim is to allow people to have commas in printed output and graph 
labels if they want.  Note, nothing would be done unless people explicitly 
did something like Sys.setlocale("LC_MISSING", "fr_FR") so this would not 
affect naive users in any way.

Brian

Brian D. Ripley,                  ripley@stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595