type.convert and doubles

An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-devel/attachments/20140411/9691324e/attachment.pl>
Greg,

Hi All,

I see this in the NEWS for R 3.1.0: 

type.convert() (and hence by default read.table()) returns a character vector or factor when representing a numeric input as a double would lose accuracy. Similarly for complex inputs.

This behavior seems likely to surprise users.
Can you elaborate why that would be surprising? It is consistent with the intention of type.convert() to determine the correct type to represent the value - it has always used character/factor as a fallback where native type doesn't match. It has never issued any warning in that case historically, so IMHO it would be rather surprising if it did now?

Cheers,
Simon
 Would it be possible to issue a warning when this occurs?

Aside: I?m very happy to see the new ?s? and ?f? browser (debugger) commands!

-Greg
	[[alternative HTML version deleted]]

______________________________________________
R-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Greg,

On Apr 11, 2014, at 11:50 AM, Gregory R. Warnes <greg at warnes.net>
wrote:

Hi All,

I see this in the NEWS for R 3.1.0:

type.convert() (and hence by default read.table()) returns a
character vector or factor when representing a numeric input as a
double would lose accuracy. Similarly for complex inputs.

This behavior seems likely to surprise users.
Can you elaborate why that would be surprising? It is consistent with
the intention of type.convert() to determine the correct type to
represent the value - it has always used character/factor as a
fallback where native type doesn't match.
Strictly speaking, I don't think this is true. If it were, it would not 
have been necessary to make the change so that it does now fallback to 
using character/factor. It may, however, have always been the intent.

I don't really think a warning is necessary, but there are some surprises:

 > str(type.convert(format(1/3, digits=17))) # R-3.0.3
  num 0.333

 > str(type.convert(format(1/3, digits=17))) # R-3.1.0
  Factor w/ 1 level "0.33333333333333331": 1

Now you could say that one should never do that, and the change is just 
flushing out a bug that was always there. But the point is that in 
serialization situations there can be some surprises. So, for example, 
RODBC talking to PostgresSQL databases is now returning factors rather 
than numerics for double precision fields, whereas with RPostgresSQL the 
behaviour has not changed.

Paul

It has never issued any
warning in that case historically, so IMHO it would be rather
surprising if it did now?

Cheers, Simon

Would it be possible to issue a warning when this occurs?

Aside: I?m very happy to see the new ?s? and ?f? browser (debugger)
commands!

-Greg [[alternative HTML version deleted]]

______________________________________________
R-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________ R-devel at r-project.org
mailing list https://stat.ethz.ch/mailman/listinfo/r-devel

Hi,
As Greg suggested, this new feature in type.convert certainly did surprise one user (me), enough so that I had to downgrade back to 3.0.3 until our code was modified to handle the new behavior.

Here's my use case: I have a function that pulls arbitrary financial data from a web service call such as a stock's industry, price, volume, etc. by reading the web output as a text table. The data may be either character (industry, stock name, etc.) or numeric (price, volume, etc.), and the function generally doesn't know the class in advance. The problem is that we frequently get numeric values represented with more precision than actually exists, for instance a price of "2.6999999999999999" rather than "2.70". The numeric representation is exactly one digit too much for type.convert which (in R 3.10.0) converts it to character instead of numeric (not what I want). This caused a bunch of "non-numeric argument to binary operator" errors to appear today as numeric data was now being represented as characters.

I have no doubt that this probably will cause some unwanted RODBC side effects for us as well. IMO, getting the class right is more important than infinite precision. What use is a character representation of a number anyway if you can't perform arithmetic on it? I would favor at least making the new behavior optional, but I think many packages (like RODBC) potentially need to be patched to code around the new feature if it's left in.

(This aside, thanks for all the nice features and bug fixes in the new version!)
Cheers, Robert

-----Original Message-----
From: r-devel-bounces at r-project.org [mailto:r-devel-bounces at r-project.org] On Behalf Of Paul Gilbert
Sent: Friday, April 11, 2014 5:38 PM
To: Simon Urbanek; Gregory R. Warnes
Cc: R-devel
Subject: Re: [Rd] type.convert and doubles
Greg,

On Apr 11, 2014, at 11:50 AM, Gregory R. Warnes <greg at warnes.net>
wrote:

Hi All,

I see this in the NEWS for R 3.1.0:

type.convert() (and hence by default read.table()) returns a
character vector or factor when representing a numeric input as a
double would lose accuracy. Similarly for complex inputs.

This behavior seems likely to surprise users.
Can you elaborate why that would be surprising? It is consistent with
the intention of type.convert() to determine the correct type to
represent the value - it has always used character/factor as a
fallback where native type doesn't match.
Strictly speaking, I don't think this is true. If it were, it would not 
have been necessary to make the change so that it does now fallback to 
using character/factor. It may, however, have always been the intent.

I don't really think a warning is necessary, but there are some surprises:

 > str(type.convert(format(1/3, digits=17))) # R-3.0.3
  num 0.333

 > str(type.convert(format(1/3, digits=17))) # R-3.1.0
  Factor w/ 1 level "0.33333333333333331": 1

Now you could say that one should never do that, and the change is just 
flushing out a bug that was always there. But the point is that in 
serialization situations there can be some surprises. So, for example, 
RODBC talking to PostgresSQL databases is now returning factors rather 
than numerics for double precision fields, whereas with RPostgresSQL the 
behaviour has not changed.

Paul

It has never issued any
warning in that case historically, so IMHO it would be rather
surprising if it did now...

Cheers, Simon

Would it be possible to issue a warning when this occurs?

Aside: I'm very happy to see the new 's' and 'f' browser (debugger)
commands!

-Greg [[alternative HTML version deleted]]

______________________________________________
R-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________ R-devel at r-project.org
mailing list https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
R-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Strictly speaking, I don't think this is true. If it were, it would not have
been necessary to make the change so that it does now fallback to using
character/factor. It may, however, have always been the intent.

I don't really think a warning is necessary, but there are some surprises:

str(type.convert(format(1/3, digits=17))) # R-3.0.3
 num 0.333

str(type.convert(format(1/3, digits=17))) # R-3.1.0
 Factor w/ 1 level "0.33333333333333331": 1
It is bizarre that it makes a factor rather than a string.
0.333333333333 is pretty obviously not a categorical value.

Hadley
http://had.co.nz/
Hi,
As Greg suggested, this new feature in type.convert certainly did surprise one user (me), enough so that I had to downgrade back to 3.0.3 until our code was modified to handle the new behavior.
I don't have an opinion on this particular change, but one way to avoid 
surprises like this is to test releases when they become available.  For 
3.1.0, the alpha became available on March 13.

If you are especially eager, you can follow the news feed at 
http://developer.r-project.org/blosxom.cgi/R-devel/NEWS.  This 
particular change was announced there more than a year ago 
(http://developer.r-project.org/blosxom.cgi/R-devel/NEWS/2013/03/19), 
and has shown up several times since as minor edits have been made to 
the announcement.

Duncan Murdoch
Here's my use case: I have a function that pulls arbitrary financial data from a web service call such as a stock's industry, price, volume, etc. by reading the web output as a text table. The data may be either character (industry, stock name, etc.) or numeric (price, volume, etc.), and the function generally doesn't know the class in advance. The problem is that we frequently get numeric values represented with more precision than actually exists, for instance a price of "2.6999999999999999" rather than "2.70". The numeric representation is exactly one digit too much for type.convert which (in R 3.10.0) converts it to character instead of numeric (not what I want). This caused a bunch of "non-numeric argument to binary operator" errors to appear today as numeric data was now being represented as characters.

I have no doubt that this probably will cause some unwanted RODBC side effects for us as well. IMO, getting the class right is more important than infinite precision. What use is a character representation of a number anyway if you can't perform arithmetic on it? I would favor at least making the new behavior optional, but I think many packages (like RODBC) potentially need to be patched to code around the new feature if it's left in.

(This aside, thanks for all the nice features and bug fixes in the new version!)
Cheers, Robert

-----Original Message-----
From: r-devel-bounces at r-project.org [mailto:r-devel-bounces at r-project.org] On Behalf Of Paul Gilbert
Sent: Friday, April 11, 2014 5:38 PM
To: Simon Urbanek; Gregory R. Warnes
Cc: R-devel
Subject: Re: [Rd] type.convert and doubles

On 04/11/2014 01:43 PM, Simon Urbanek wrote:
Greg,

On Apr 11, 2014, at 11:50 AM, Gregory R. Warnes <greg at warnes.net>
wrote:

Hi All,

I see this in the NEWS for R 3.1.0:

type.convert() (and hence by default read.table()) returns a
character vector or factor when representing a numeric input as a
double would lose accuracy. Similarly for complex inputs.

This behavior seems likely to surprise users.
Can you elaborate why that would be surprising? It is consistent with
the intention of type.convert() to determine the correct type to
represent the value - it has always used character/factor as a
fallback where native type doesn't match.
Strictly speaking, I don't think this is true. If it were, it would not
have been necessary to make the change so that it does now fallback to
using character/factor. It may, however, have always been the intent.

I don't really think a warning is necessary, but there are some surprises:

  > str(type.convert(format(1/3, digits=17))) # R-3.0.3
   num 0.333

  > str(type.convert(format(1/3, digits=17))) # R-3.1.0
   Factor w/ 1 level "0.33333333333333331": 1

Now you could say that one should never do that, and the change is just
flushing out a bug that was always there. But the point is that in
serialization situations there can be some surprises. So, for example,
RODBC talking to PostgresSQL databases is now returning factors rather
than numerics for double precision fields, whereas with RPostgresSQL the
behaviour has not changed.

Paul

It has never issued any
warning in that case historically, so IMHO it would be rather
surprising if it did now...

Cheers, Simon

Would it be possible to issue a warning when this occurs?

Aside: I'm very happy to see the new 's' and 'f' browser (debugger)
commands!

-Greg [[alternative HTML version deleted]]

______________________________________________
R-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________ R-devel at r-project.org
mailing list https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
R-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
R-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
On Thu, Apr 17, 2014 at 6:42 AM, McGehee, Robert
Here's my use case: I have a function that pulls arbitrary financial data from a web service call such as a stock's industry, price, volume, etc. by reading the web output as a text table. The data may be either character (industry, stock name, etc.) or numeric (price, volume, etc.), and the function generally doesn't know the class in advance. The problem is that we frequently get numeric values represented with more precision than actually exists, for instance a price of "2.6999999999999999" rather than "2.70". The numeric representation is exactly one digit too much for type.convert which (in R 3.10.0) converts it to character instead of numeric (not what I want). This caused a bunch of "non-numeric argument to binary operator" errors to appear today as numeric data was now being represented as characters.

I have no doubt that this probably will cause some unwanted RODBC side effects for us as well. IMO, getting the class right is more important than infinite precision. What use is a character representation of a number anyway if you can't perform arithmetic on it? I would favor at least making the new behavior optional, but I think many packages (like RODBC) potentially need to be patched to code around the new feature if it's left in.
The uses of character representation of a number are many: unique
identifiers/user ids, hash codes, timestamps, or other values where
rounding results to the nearest value that can be represented as a
numeric type would completely change the results of any data analysis
performed on that data.

Database join operations are certainly an area where R's previous
behavior of silently dropping precision of numbers with type.convert
can get you into trouble.  For example, things like join operations or
group by operations performed in R code would produce erroneous
results if you are joining/grouping by a key without the full
precision of your underlying data.  Records can get joined up
incorrectly or aggregated with the wrong groups.

If you later want to do arithmetic on them, you can choose to lose
precision by using as.numeric() or use one of the large number
packages on CRAN (GMP, int64, bit64, etc.).  But once you've dropped
the precision with as.numeric you can never get it back, which is why
the previous behavior was clearly dangerous.

I think I had some additional examples in the original bug/patch I
filed about this issue a few years ago, but I'm unable to find it on
bugs.r-project.org and its not referenced in the cl descriptions or
news file.

     - Murray
If you later want to do arithmetic on them, you can choose to lose
precision by using as.numeric() or use one of the large number
packages on CRAN (GMP, int64, bit64, etc.).  But once you've dropped
the precision with as.numeric you can never get it back, which is why
the previous behavior was clearly dangerous.
Only if you knew that that column was supposed to be numeric. There is
nothing in type.convert or read.table to allow you to override how it
works (colClasses only works if you knew which columns are which in
the first place) nor is there anything to allow you to know which
columns were affected so that you know which columns to look at to fix
it yourself afterwards.
On Thu, Apr 17, 2014 at 2:35 PM, Gabor Grothendieck
Only if you knew that that column was supposed to be numeric. There is
The columns that are "supposed" to be numeric are those that can fit
into a numeric data type.  Previously that was not always the case
with columns that could not be represented as a numeric erroneously
coerced into a truncated/rounded numeric.
nothing in type.convert or read.table to allow you to override how it
works (colClasses only works if you knew which columns are which in
the first place) nor is there anything to allow you to know which
columns were affected so that you know which columns to look at to fix
it yourself afterwards.
You want a casting operation in your SQL query or similar if you want
a rounded type that will always fit in a double.  Cast or Convert
operators in SQL, or similar for however you are getting the data you
want to use with type.convert().  This is all application specific and
sort of beyond the scope of type.convert(), which now behaves as it
has been documented to behave.

In my code for this kind of thing I have however typically introduced
an option() to let the user control casting behavior for e.g. 64-bit
ints in C++.  Should they be returned as truncated precision numeric
types or the full precision data in a character string representation?
 In the RProtoBuf package we let the user specify an option() to
specify which behavior they need for their application as a shortcut
to just always returning the safer character representation and making
them coerce to numeric often.

            - Murray
On Thu, Apr 17, 2014 at 6:42 AM, McGehee, Robert
<Robert.McGehee at geodecapital.com> wrote:
Here's my use case: I have a function that pulls arbitrary
financial data from a web service call such as a stock's industry,
price, volume, etc. by reading the web output as a text table. The
data may be either character (industry, stock name, etc.) or
numeric (price, volume, etc.), and the function generally doesn't
know the class in advance. The problem is that we frequently get
numeric values represented with more precision than actually
exists, for instance a price of "2.6999999999999999" rather than
"2.70". The numeric representation is exactly one digit too much
for type.convert which (in R 3.10.0) converts it to character
instead of numeric (not what I want). This caused a bunch of
"non-numeric argument to binary operator" errors to appear today as
numeric data was now being represented as characters.

I have no doubt that this probably will cause some unwanted RODBC
side effects for us as well. IMO, getting the class right is more
important than infinite precision. What use is a character
representation of a number anyway if you can't perform arithmetic
on it? I would favor at least making the new behavior optional, but
I think many packages (like RODBC) potentially need to be patched
to code around the new feature if it's left in.
The uses of character representation of a number are many: unique
identifiers/user ids, hash codes, timestamps, or other values where
rounding results to the nearest value that can be represented as a
numeric type would completely change the results of any data
analysis performed on that data.

Database join operations are certainly an area where R's previous
behavior of silently dropping precision of numbers with type.convert
can get you into trouble.  For example, things like join operations
or group by operations performed in R code would produce erroneous
results if you are joining/grouping by a key without the full
precision of your underlying data.  Records can get joined up
incorrectly or aggregated with the wrong groups.
I don't understand this. Assuming you are sending the SQL statement to 
the database engine, none of this erroneous matching is happening in R. 
The calculations all happens on the database.

But, for the case where the database does know that numbers are double 
precision, it would be nice if they got transmitted by ODBC to R as 
numerics (the usual translation) just as they are by the native 
interfaces like RPostgreSQL. Do you get the erroneous results when you 
use a native interface?

( from second response:)
You want a casting operation in your SQL query or similar if you want
a rounded type that will always fit in a double.  Cast or Convert
operators in SQL, or similar for however you are getting the data you
want to use with type.convert().  This is all application specific and
sort of beyond the scope of type.convert(), which now behaves as it
has been documented to behave.
This seems to suggests I need to use different SQL statements depending 
on which interface I use to talk to the database.

If you do 1/3 in a database calculation and that ends up being 
represented as something more accurate than double precision on the 
database, then it needs to be transmitted as something with higher 
precision (character/factor?). If the result is double precision it 
should be sent as double precision, not as something pretending to be 
more accurate.

I suspect the difficulty with ODBC may be that type.convert() really 
should not be called when both ends of the communication know that a 
double precision number is being exchanged.

Paul
If you later want to do arithmetic on them, you can choose to lose
precision by using as.numeric() or use one of the large number
packages on CRAN (GMP, int64, bit64, etc.).  But once you've dropped
the precision with as.numeric you can never get it back, which is
why the previous behavior was clearly dangerous.

I think I had some additional examples in the original bug/patch I
filed about this issue a few years ago, but I'm unable to find it on
bugs.r-project.org and its not referenced in the cl descriptions or
news file.

- Murray

This is all application specific and
sort of beyond the scope of type.convert(), which now behaves as it
has been documented to behave.
That's only a true statement because the documentation was changed to reflect the new behavior! The new feature in type.convert certainly does not behave according to the documentation as of R 3.0.3. Here's a snippit:

     The first type that can accept all the
     non-missing values is chosen (numeric and complex return values
     will represented approximately, of course).

The key phrase is in parentheses, which reminds the user to expect a possible loss of precision. That important parenthetical was removed from the documentation in R 3.1.0 (among other changes).

Putting aside the fact that this introduces a large amount of unnecessary work rewriting SQL / data import code, SQL packages, my biggest conceptual problem is that I can no longer rely on a particular function call returning a particular class. In my example querying stock prices, about 5% of prices came back as factors and the remaining 95% as numeric, so we had random errors popping in throughout the morning.

Here's a short example showing us how the new behavior can be unreliable. I pass a character representation of a uniformly distributed random variable to type.convert. 90% of the time it is converted to "numeric" and 10% it is a "factor" (in R 3.1.0). In the 10% of cases in which type.convert converts to a factor the leading non-zero digit is always a 9. So if you were expecting a numeric value, then 1 in 10 times you may have a bug in your code that didn't exist before.
options(digits=16)
cl <- NULL; for (i in 1:10000) cl[i] <- class(type.convert(format(runif(1))))
table(cl)
cl
factor numeric
   990    9010

Cheers, Robert
McGehee, Robert <Robert.McGehee at geodecapital.com>
    on Thu, 17 Apr 2014 19:15:47 -0400 writes:
>> This is all application specific and
    >> sort of beyond the scope of type.convert(), which now behaves as it
    >> has been documented to behave.

    > That's only a true statement because the documentation was changed to reflect the new behavior! The new feature in type.convert certainly does not behave according to the documentation as of R 3.0.3. Here's a snippit:

    > The first type that can accept all the
    > non-missing values is chosen (numeric and complex return values
    > will represented approximately, of course).

    > The key phrase is in parentheses, which reminds the user to expect a possible loss of precision. That important parenthetical was removed from the documentation in R 3.1.0 (among other changes).

    > Putting aside the fact that this introduces a large amount of unnecessary work rewriting SQL / data import code, SQL packages, my biggest conceptual problem is that I can no longer rely on a particular function call returning a particular class. In my example querying stock prices, about 5% of prices came back as factors and the remaining 95% as numeric, so we had random errors popping in throughout the morning.

    > Here's a short example showing us how the new behavior can be unreliable. I pass a character representation of a uniformly distributed random variable to type.convert. 90% of the time it is converted to "numeric" and 10% it is a "factor" (in R 3.1.0). In the 10% of cases in which type.convert converts to a factor the leading non-zero digit is always a 9. So if you were expecting a numeric value, then 1 in 10 times you may have a bug in your code that didn't exist before.

    >> options(digits=16)
    >> cl <- NULL; for (i in 1:10000) cl[i] <- class(type.convert(format(runif(1))))
    >> table(cl)
    > cl
    > factor numeric
    > 990    9010

Yes.  

Murray's point is valid, too.

But in my view, with the reasoning we have seen here,
*and* with the well known software design principle of
 "least surprise" in mind,
I also do think that the default for type.convert() should be what
it has been for > 10 years now.

Martin

McGehee, Robert <Robert.McGehee at geodecapital.com>
   on Thu, 17 Apr 2014 19:15:47 -0400 writes:

This is all application specific and
sort of beyond the scope of type.convert(), which now behaves as it
has been documented to behave.

That's only a true statement because the documentation was changed to reflect the new behavior! The new feature in type.convert certainly does not behave according to the documentation as of R 3.0.3. Here's a snippit:

The first type that can accept all the
non-missing values is chosen (numeric and complex return values
will represented approximately, of course).

The key phrase is in parentheses, which reminds the user to expect a possible loss of precision. That important parenthetical was removed from the documentation in R 3.1.0 (among other changes).

Putting aside the fact that this introduces a large amount of unnecessary work rewriting SQL / data import code, SQL packages, my biggest conceptual problem is that I can no longer rely on a particular function call returning a particular class. In my example querying stock prices, about 5% of prices came back as factors and the remaining 95% as numeric, so we had random errors popping in throughout the morning.

Here's a short example showing us how the new behavior can be unreliable. I pass a character representation of a uniformly distributed random variable to type.convert. 90% of the time it is converted to "numeric" and 10% it is a "factor" (in R 3.1.0). In the 10% of cases in which type.convert converts to a factor the leading non-zero digit is always a 9. So if you were expecting a numeric value, then 1 in 10 times you may have a bug in your code that didn't exist before.

options(digits=16)
cl <- NULL; for (i in 1:10000) cl[i] <- class(type.convert(format(runif(1))))
table(cl)
cl
factor numeric
990    9010
Yes.  

Murray's point is valid, too.

But in my view, with the reasoning we have seen here,
*and* with the well known software design principle of
"least surprise" in mind,
I also do think that the default for type.convert() should be what
it has been for > 10 years now.

I think there should be two separate discussions:

a) have an option (argument to type.convert and possibly read.table) to enable/disable this behavior. I'm strongly in favor of this.

b) decide what the default for a) will be. I have no strong opinion, I can see arguments in both directions

But most importantly I think a) is better than the status quo - even if the discussion about b) drags out.

Cheers,
Simon
On Sat, Apr 19, 2014 at 1:06 PM, Simon Urbanek
On Apr 19, 2014, at 9:00 AM, Martin Maechler <maechler at stat.math.ethz.ch> wrote:

I think there should be two separate discussions:

a) have an option (argument to type.convert and possibly read.table) to enable/disable this behavior. I'm strongly in favor of this.

b) decide what the default for a) will be. I have no strong opinion, I can see arguments in both directions

But most importantly I think a) is better than the status quo - even if the discussion about b) drags out.

Cheers,
Simon
Another possibility is:

(c) Return the column as factor/character but with a distinguishing
class so that the user can reset its class later. e.g.

DF <- read.table(...)
DF[] <- lapply(DF, function(x) if (inherits(x, "special.class"))
as.numeric(x) else x)

Personally I would go with (a) in both type.convert and read.table
with a default that reflects the historical behavior rather than the
current 3.1 behavior.
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com
Yes, I'm also strongly in favor of having an option for this.  If
there was an option in base R for controlling this we would just use
that and get rid of the separate RProtoBuf.int64AsString option we use
in the RProtoBuf package on CRAN to control whether 64-bit int types
from C++ are returned to R as numerics or character vectors.

I agree that reasonable people can disagree about the default, but I
found my original bug report about this, so I will counter Robert's
example with my favorite example of what was wrong with the previous
behavior :

tmp<-data.frame(n=c("72057594037927936", "72057594037927937"),
name=c("foo", "bar"))
length(unique(tmp$n))
# 2
write.csv(tmp, "/tmp/foo.csv", quote=FALSE, row.names=FALSE)
data <- read.csv("/tmp/foo.csv")
length(unique(data$n))
# 1

          - Murray

On Sat, Apr 19, 2014 at 10:06 AM, Simon Urbanek
On Apr 19, 2014, at 9:00 AM, Martin Maechler <maechler at stat.math.ethz.ch> wrote:

McGehee, Robert <Robert.McGehee at geodecapital.com>
   on Thu, 17 Apr 2014 19:15:47 -0400 writes:

This is all application specific and
sort of beyond the scope of type.convert(), which now behaves as it
has been documented to behave.

That's only a true statement because the documentation was changed to reflect the new behavior! The new feature in type.convert certainly does not behave according to the documentation as of R 3.0.3. Here's a snippit:

The first type that can accept all the
non-missing values is chosen (numeric and complex return values
will represented approximately, of course).

The key phrase is in parentheses, which reminds the user to expect a possible loss of precision. That important parenthetical was removed from the documentation in R 3.1.0 (among other changes).

Putting aside the fact that this introduces a large amount of unnecessary work rewriting SQL / data import code, SQL packages, my biggest conceptual problem is that I can no longer rely on a particular function call returning a particular class. In my example querying stock prices, about 5% of prices came back as factors and the remaining 95% as numeric, so we had random errors popping in throughout the morning.

Here's a short example showing us how the new behavior can be unreliable. I pass a character representation of a uniformly distributed random variable to type.convert. 90% of the time it is converted to "numeric" and 10% it is a "factor" (in R 3.1.0). In the 10% of cases in which type.convert converts to a factor the leading non-zero digit is always a 9. So if you were expecting a numeric value, then 1 in 10 times you may have a bug in your code that didn't exist before.

options(digits=16)
cl <- NULL; for (i in 1:10000) cl[i] <- class(type.convert(format(runif(1))))
table(cl)
cl
factor numeric
990    9010
Yes.

Murray's point is valid, too.

But in my view, with the reasoning we have seen here,
*and* with the well known software design principle of
"least surprise" in mind,
I also do think that the default for type.convert() should be what
it has been for > 10 years now.

I think there should be two separate discussions:

a) have an option (argument to type.convert and possibly read.table) to enable/disable this behavior. I'm strongly in favor of this.

b) decide what the default for a) will be. I have no strong opinion, I can see arguments in both directions

But most importantly I think a) is better than the status quo - even if the discussion about b) drags out.

Cheers,
Simon

An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-devel/attachments/20140420/04fe5f37/attachment.pl>
How about using the quoting to decide what should be character, and what
not? You do not need to quote numbers, logical values, only characters, so
this would make sense imo.
That explicitly violates some of the CSV "standards".  The quotes must 
have no effect on the interpretation.

Duncan Murdoch
How about something like this:
- if it is quoted (and not specified otherwise in colClasses), then it is a
character/factor
- if it is not quoted (and not specified otherwise in colClasses), then the
type is automatically detected, according to the pre-3.1.x method, and a
(suppressible) warning or error is given if information is lost, when
coercing to numbers.

Just an idea.

Gabor

On Sun, Apr 20, 2014 at 3:24 AM, Murray Stokely <murray at stokely.org> wrote:

Yes, I'm also strongly in favor of having an option for this.  If
there was an option in base R for controlling this we would just use
that and get rid of the separate RProtoBuf.int64AsString option we use
in the RProtoBuf package on CRAN to control whether 64-bit int types
from C++ are returned to R as numerics or character vectors.

I agree that reasonable people can disagree about the default, but I
found my original bug report about this, so I will counter Robert's
example with my favorite example of what was wrong with the previous
behavior :

tmp<-data.frame(n=c("72057594037927936", "72057594037927937"),
name=c("foo", "bar"))
length(unique(tmp$n))
# 2
write.csv(tmp, "/tmp/foo.csv", quote=FALSE, row.names=FALSE)
data <- read.csv("/tmp/foo.csv")
length(unique(data$n))
# 1

           - Murray

On Sat, Apr 19, 2014 at 10:06 AM, Simon Urbanek
<simon.urbanek at r-project.org> wrote:
On Apr 19, 2014, at 9:00 AM, Martin Maechler <maechler at stat.math.ethz.ch>
wrote:

McGehee, Robert <Robert.McGehee at geodecapital.com>
    on Thu, 17 Apr 2014 19:15:47 -0400 writes:

This is all application specific and
sort of beyond the scope of type.convert(), which now behaves as it
has been documented to behave.

That's only a true statement because the documentation was changed to
reflect the new behavior! The new feature in type.convert certainly does
not behave according to the documentation as of R 3.0.3. Here's a snippit:

The first type that can accept all the
non-missing values is chosen (numeric and complex return values
will represented approximately, of course).

The key phrase is in parentheses, which reminds the user to expect a
possible loss of precision. That important parenthetical was removed from
the documentation in R 3.1.0 (among other changes).

Putting aside the fact that this introduces a large amount of
unnecessary work rewriting SQL / data import code, SQL packages, my biggest
conceptual problem is that I can no longer rely on a particular function
call returning a particular class. In my example querying stock prices,
about 5% of prices came back as factors and the remaining 95% as numeric,
so we had random errors popping in throughout the morning.

Here's a short example showing us how the new behavior can be
unreliable. I pass a character representation of a uniformly distributed
random variable to type.convert. 90% of the time it is converted to
"numeric" and 10% it is a "factor" (in R 3.1.0). In the 10% of cases in
which type.convert converts to a factor the leading non-zero digit is
always a 9. So if you were expecting a numeric value, then 1 in 10 times
you may have a bug in your code that didn't exist before.

options(digits=16)
cl <- NULL; for (i in 1:10000) cl[i] <-
class(type.convert(format(runif(1))))
table(cl)
cl
factor numeric
990    9010
Yes.

Murray's point is valid, too.

But in my view, with the reasoning we have seen here,
*and* with the well known software design principle of
"least surprise" in mind,
I also do think that the default for type.convert() should be what
it has been for > 10 years now.

I think there should be two separate discussions:

a) have an option (argument to type.convert and possibly read.table) to
enable/disable this behavior. I'm strongly in favor of this.
b) decide what the default for a) will be. I have no strong opinion, I
can see arguments in both directions
But most importantly I think a) is better than the status quo - even if
the discussion about b) drags out.
Cheers,
Simon

______________________________________________
R-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

	[[alternative HTML version deleted]]

______________________________________________
R-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Agreed. Perhaps even a global option would make sense. We already have an
option with a similar spirit: 'options(?stringsAsFactors"=T/F)'. Perhaps
'options(?exactNumericAsString?=T/F)' [or something else] would be
desirable, with the option being the default value to the type.convert
argument. 

I also like Gabor?s idea of a ?distinguishing class?. R doesn?t natively
support arbitrary precision numbers (AFAIK), but I think that?s what
Murray wants. I could imagine some kind of new class emerging here that
initially looks just like a character/factor, but may evolve over time to
accept arithmetic methods and act more like a number (e.g. knowing that
?0.1?, ?.10? and "1e-1" are the same number, or that ?-9?<?-0.2"). A class
?bignum? perhaps? 

Cheers, Robert

Yes, I'm also strongly in favor of having an option for this.  If
there was an option in base R for controlling this we would just use
that and get rid of the separate RProtoBuf.int64AsString option we use
in the RProtoBuf package on CRAN to control whether 64-bit int types
from C++ are returned to R as numerics or character vectors.

I agree that reasonable people can disagree about the default, but I
found my original bug report about this, so I will counter Robert's
example with my favorite example of what was wrong with the previous
behavior :

tmp<-data.frame(n=c("72057594037927936", "72057594037927937"),
name=c("foo", "bar"))
length(unique(tmp$n))
# 2
write.csv(tmp, "/tmp/foo.csv", quote=FALSE, row.names=FALSE)
data <- read.csv("/tmp/foo.csv")
length(unique(data$n))
# 1

         - Murray

On Sat, Apr 19, 2014 at 10:06 AM, Simon Urbanek
<simon.urbanek at r-project.org> wrote:
On Apr 19, 2014, at 9:00 AM, Martin Maechler
<maechler at stat.math.ethz.ch> wrote:

McGehee, Robert <Robert.McGehee at geodecapital.com>
   on Thu, 17 Apr 2014 19:15:47 -0400 writes:

This is all application specific and
sort of beyond the scope of type.convert(), which now behaves as it
has been documented to behave.

That's only a true statement because the documentation was changed to
reflect the new behavior! The new feature in type.convert certainly
does not behave according to the documentation as of R 3.0.3. Here's a
snippit:

The first type that can accept all the
non-missing values is chosen (numeric and complex return values
will represented approximately, of course).

The key phrase is in parentheses, which reminds the user to expect a
possible loss of precision. That important parenthetical was removed
from the documentation in R 3.1.0 (among other changes).

Putting aside the fact that this introduces a large amount of
unnecessary work rewriting SQL / data import code, SQL packages, my
biggest conceptual problem is that I can no longer rely on a
particular function call returning a particular class. In my example
querying stock prices, about 5% of prices came back as factors and the
remaining 95% as numeric, so we had random errors popping in
throughout the morning.

Here's a short example showing us how the new behavior can be
unreliable. I pass a character representation of a uniformly
distributed random variable to type.convert. 90% of the time it is
converted to "numeric" and 10% it is a "factor" (in R 3.1.0). In the
10% of cases in which type.convert converts to a factor the leading
non-zero digit is always a 9. So if you were expecting a numeric
value, then 1 in 10 times you may have a bug in your code that didn't
exist before.

options(digits=16)
cl <- NULL; for (i in 1:10000) cl[i] <-
class(type.convert(format(runif(1))))
table(cl)
cl
factor numeric
990    9010
Yes.

Murray's point is valid, too.

But in my view, with the reasoning we have seen here,
*and* with the well known software design principle of
"least surprise" in mind,
I also do think that the default for type.convert() should be what
it has been for > 10 years now.

I think there should be two separate discussions:

a) have an option (argument to type.convert and possibly read.table) to
enable/disable this behavior. I'm strongly in favor of this.

b) decide what the default for a) will be. I have no strong opinion, I
can see arguments in both directions

But most importantly I think a) is better than the status quo - even if
the discussion about b) drags out.

Cheers,
Simon

McGehee, Robert <Robert.McGehee at geodecapital.com>
    on Mon, 21 Apr 2014 09:24:13 -0400 writes:
> Agreed. Perhaps even a global option would make sense. We
    > already have an option with a similar spirit:
    > 'options(?stringsAsFactors"=T/F)'. Perhaps
    > 'options(?exactNumericAsString?=T/F)' [or something else]
    > would be desirable, with the option being the default
    > value to the type.convert argument.

No, please, no, not a global option here!

Global options that influence default behavior of basic
functions is too much against the principle of functional
programming, and my personal opinion has always been that
'stringsAsFactors' has been a mistake (as a global option, not
as an argument).

Note that with such global options, the output of sessionInfo()
would in principle have to contain all (such) global options in
addtion to R and package versions in order to diagnose behavior
of R functions.

I think we have more or less agreed that we'd like to have
a new function *argument* to type.convert(); 
passed "upstream" to read.table() and via ... the other
read.<foo>() that call read.table.

    > I also like Gabor?s idea of a ?distinguishing class?. R
    > doesn?t natively support arbitrary precision numbers
    > (AFAIK), but I think that?s what Murray wants. I could
    > imagine some kind of new class emerging here that
    > initially looks just like a character/factor, but may
    > evolve over time to accept arithmetic methods and act more
    > like a number (e.g. knowing that ?0.1?, ?.10? and "1e-1"
    > are the same number, or that ?-9?<?-0.2"). A class
    > ?bignum? perhaps?

That's another interesting idea. As maintainer of CRAN package
'Rmpfr' and co-maintainer of 'gmp', I'm even biased about this
issue.

Martin

    > Cheers, Robert

    > On 4/20/14, 3:24 AM, "Murray Stokely" <murray at stokely.org>
>> Yes, I'm also strongly in favor of having an option for
    >> this.  If there was an option in base R for controlling
    >> this we would just use that and get rid of the separate
    >> RProtoBuf.int64AsString option we use in the RProtoBuf
    >> package on CRAN to control whether 64-bit int types from
    >> C++ are returned to R as numerics or character vectors.
    >> 
    >> I agree that reasonable people can disagree about the
    >> default, but I found my original bug report about this,
    >> so I will counter Robert's example with my favorite
    >> example of what was wrong with the previous behavior :
    >> 
    >> tmp<-data.frame(n=c("72057594037927936",
    >> "72057594037927937"), name=c("foo", "bar"))
    >> length(unique(tmp$n)) # 2 write.csv(tmp, "/tmp/foo.csv",
    >> quote=FALSE, row.names=FALSE) data <-
    >> read.csv("/tmp/foo.csv") length(unique(data$n)) # 1
    >> 
    >> - Murray
    >> 
    >> 
    >> On Sat, Apr 19, 2014 at 10:06 AM, Simon Urbanek
>>> On Apr 19, 2014, at 9:00 AM, Martin Maechler
>>> 
    >>>>>>>>> McGehee, Robert <Robert.McGehee at geodecapital.com>
    >>>>>>>>> on Thu, 17 Apr 2014 19:15:47 -0400 writes:
    >>>>
This is all application specific and
sort of beyond the scope of type.convert(), which now
>>>> behaves as it
has been documented to behave.
>>>> 
    >>>>> That's only a true statement because the documentation
    >>>>> was changed to reflect the new behavior! The new
    >>>>> feature in type.convert certainly does not behave
    >>>>> according to the documentation as of R 3.0.3. Here's a
    >>>>> snippit:
    >>>> 
    >>>>> The first type that can accept all the non-missing
    >>>>> values is chosen (numeric and complex return values
    >>>>> will represented approximately, of course).
    >>>> 
    >>>>> The key phrase is in parentheses, which reminds the
    >>>>> user to expect a possible loss of precision. That
    >>>>> important parenthetical was removed from the
    >>>>> documentation in R 3.1.0 (among other changes).
    >>>> 
    >>>>> Putting aside the fact that this introduces a large
    >>>>> amount of unnecessary work rewriting SQL / data import
    >>>>> code, SQL packages, my biggest conceptual problem is
    >>>>> that I can no longer rely on a particular function
    >>>>> call returning a particular class. In my example
    >>>>> querying stock prices, about 5% of prices came back as
    >>>>> factors and the remaining 95% as numeric, so we had
    >>>>> random errors popping in throughout the morning.
    >>>> 
    >>>>> Here's a short example showing us how the new behavior
    >>>>> can be unreliable. I pass a character representation
    >>>>> of a uniformly distributed random variable to
    >>>>> type.convert. 90% of the time it is converted to
    >>>>> "numeric" and 10% it is a "factor" (in R 3.1.0). In
    >>>>> the 10% of cases in which type.convert converts to a
    >>>>> factor the leading non-zero digit is always a 9. So if
    >>>>> you were expecting a numeric value, then 1 in 10 times
    >>>>> you may have a bug in your code that didn't exist
    >>>>> before.
    >>>>
options(digits=16)
cl <- NULL; for (i in 1:10000) cl[i] <-
>>>>>> class(type.convert(format(runif(1))))
table(cl)
>>>>> cl factor numeric 990 9010
    >>>> 
    >>>> Yes.
    >>>> 
    >>>> Murray's point is valid, too.
    >>>> 
    >>>> But in my view, with the reasoning we have seen here,
    >>>> *and* with the well known software design principle of
    >>>> "least surprise" in mind, I also do think that the
    >>>> default for type.convert() should be what it has been
    >>>> for > 10 years now.
    >>>> 
    >>> 
    >>> I think there should be two separate discussions:
    >>> 
    >>> a) have an option (argument to type.convert and possibly
    >>> read.table) to enable/disable this behavior. I'm
    >>> strongly in favor of this.
    >>> 
    >>> b) decide what the default for a) will be. I have no
    >>> strong opinion, I can see arguments in both directions
    >>> 
    >>> But most importantly I think a) is better than the
    >>> status quo - even if the discussion about b) drags out.
    >>> 
    >>> Cheers, Simon
    >>> 
    >>> 
    >>>
Simon Urbanek <simon.urbanek at r-project.org>
    on Sat, 19 Apr 2014 13:06:15 -0400 writes:

>>>>>>> McGehee, Robert <Robert.McGehee at geodecapital.com>
    >>>>>>> on Thu, 17 Apr 2014 19:15:47 -0400 writes:
    >> 
    >>>> This is all application specific and
    >>>> sort of beyond the scope of type.convert(), which now behaves as it
    >>>> has been documented to behave.
    >> 
    >>> That's only a true statement because the documentation was changed to reflect the new behavior! The new feature in type.convert certainly does not behave according to the documentation as of R 3.0.3. Here's a snippit:
    >> 
    >>> The first type that can accept all the
    >>> non-missing values is chosen (numeric and complex return values
    >>> will represented approximately, of course).
    >> 
    >>> The key phrase is in parentheses, which reminds the user to expect a possible loss of precision. That important parenthetical was removed from the documentation in R 3.1.0 (among other changes).
    >> 
    >>> Putting aside the fact that this introduces a large amount of unnecessary work rewriting SQL / data import code, SQL packages, my biggest conceptual problem is that I can no longer rely on a particular function call returning a particular class. In my example querying stock prices, about 5% of prices came back as factors and the remaining 95% as numeric, so we had random errors popping in throughout the morning.
    >> 
    >>> Here's a short example showing us how the new behavior can be unreliable. I pass a character representation of a uniformly distributed random variable to type.convert. 90% of the time it is converted to "numeric" and 10% it is a "factor" (in R 3.1.0). In the 10% of cases in which type.convert converts to a factor the leading non-zero digit is always a 9. So if you were expecting a numeric value, then 1 in 10 times you may have a bug in your code that didn't exist before.
    >> 
    >>>> options(digits=16)
    >>>> cl <- NULL; for (i in 1:10000) cl[i] <- class(type.convert(format(runif(1))))
    >>>> table(cl)
    >>> cl
    >>> factor numeric
    >>> 990    9010
    >> 
    >> Yes.  
    >> 
    >> Murray's point is valid, too.
    >> 
    >> But in my view, with the reasoning we have seen here,
    >> *and* with the well known software design principle of
    >> "least surprise" in mind,
    >> I also do think that the default for type.convert() should be what
    >> it has been for > 10 years now.
    >> 

    > I think there should be two separate discussions:

    > a) have an option (argument to type.convert and possibly read.table) to enable/disable this behavior. I'm strongly in favor of this.

In my (not committed) version of R-devel, I now have

 > str(type.convert(format(1/3, digits=17), exact=TRUE)) 
  Factor w/ 1 level "0.33333333333333331": 1
 > str(type.convert(format(1/3, digits=17), exact=FALSE))
  num 0.333

where the 'exact' argument name has been ``imported'' from the
underlying C code.

[ As we CRAN package writers know by now, arguments nowadays can
  hardly be abbreviated anymore, and so I am not open to longer
  alternative argument names, as someone liking blind typing, I'm
  not fond of camel case or other keyboard gymnastics (;-) but if someone has a great idea for
  a better argument name.... ]

Instead of only  TRUE/FALSE, we could consider NA with 
semantics "FALSE + warning" or also "TRUE + warning".

    > b) decide what the default for a) will be. I have no strong opinion, I can see arguments in both directions

I think many have seen the good arguments in both directions.
I'm still strongly advocating that we value long term stability
higher here, and revert to more compatibility with the many
years of previous versions.

If we'd use a default of 'exact=NA', I'd like it to mean
FALSE + warning, but would not oppose much to  TRUE + warning.

I agree that for the TRUE case, it may make more sense to return
string-like object of a new (simple) class such as  "bignum"
that was mentioned in this thread.

OTOH, this functionality should make it into an R 3.1.1 in the
not so distant future, and thinking through consequences and
implementing the new class approach may just take a tad too much
time...

Martin

    > But most importantly I think a) is better than the status quo - even if the discussion about b) drags out.

    > Cheers,
    > Simon

I think there should be two separate discussions:

a) have an option (argument to type.convert and possibly read.table) to enable/disable this behavior. I'm strongly in favor of this.
In my (not committed) version of R-devel, I now have

str(type.convert(format(1/3, digits=17), exact=TRUE)) 
 Factor w/ 1 level "0.33333333333333331": 1
str(type.convert(format(1/3, digits=17), exact=FALSE))
 num 0.333

where the 'exact' argument name has been ``imported'' from the
underlying C code.

Looks good to me!
<snip>

Instead of only  TRUE/FALSE, we could consider NA with 
semantics "FALSE + warning" or also "TRUE + warning?.

b) decide what the default for a) will be. I have no strong opinion, I can see arguments in both directions
I think many have seen the good arguments in both directions.
I'm still strongly advocating that we value long term stability
higher here, and revert to more compatibility with the many
years of previous versions.

If we'd use a default of 'exact=NA', I'd like it to mean
FALSE + warning, but would not oppose much to  TRUE + warning.

I vote for the default to be ?exact=NA? meaning ?FALSE + warning" 

-Greg
An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-devel/attachments/20140427/50815a5c/attachment.pl>
Is there a reason it's a factor and not a string? A string would seem to be
more appropriate to me (given that we know it's a number that can't be
represented exactly by R)
The user asked that anything which can't be converted to a number should 
be converted to a factor.

Yes, that's a bad default, but some people rely on it.

Duncan Murdoch
Hadley

On Saturday, April 26, 2014, Martin Maechler <maechler at stat.math.ethz.ch>
wrote:

Simon Urbanek <simon.urbanek at r-project.org <javascript:;>>
     on Sat, 19 Apr 2014 13:06:15 -0400 writes:

     > On Apr 19, 2014, at 9:00 AM, Martin Maechler <
maechler at stat.math.ethz.ch <javascript:;>> wrote:
     >>>>>>> McGehee, Robert <Robert.McGehee at geodecapital.com<javascript:;>

     >>>>>>> on Thu, 17 Apr 2014 19:15:47 -0400 writes:
     >>
     >>>> This is all application specific and
     >>>> sort of beyond the scope of type.convert(), which now behaves as
it
     >>>> has been documented to behave.
     >>
     >>> That's only a true statement because the documentation was changed
to reflect the new behavior! The new feature in type.convert certainly does
not behave according to the documentation as of R 3.0.3. Here's a snippit:
     >>
     >>> The first type that can accept all the
     >>> non-missing values is chosen (numeric and complex return values
     >>> will represented approximately, of course).
     >>
     >>> The key phrase is in parentheses, which reminds the user to expect
a possible loss of precision. That important parenthetical was removed from
the documentation in R 3.1.0 (among other changes).
     >>
     >>> Putting aside the fact that this introduces a large amount of
unnecessary work rewriting SQL / data import code, SQL packages, my biggest
conceptual problem is that I can no longer rely on a particular function
call returning a particular class. In my example querying stock prices,
about 5% of prices came back as factors and the remaining 95% as numeric,
so we had random errors popping in throughout the morning.
     >>
     >>> Here's a short example showing us how the new behavior can be
unreliable. I pass a character representation of a uniformly distributed
random variable to type.convert. 90% of the time it is converted to
"numeric" and 10% it is a "factor" (in R 3.1.0). In the 10% of cases in
which type.convert converts to a factor the leading non-zero digit is
always a 9. So if you were expecting a numeric value, then 1 in 10 times
you may have a bug in your code that didn't exist before.
     >>
     >>>> options(digits=16)
     >>>> cl <- NULL; for (i in 1:10000) cl[i] <-
class(type.convert(format(runif(1))))
     >>>> table(cl)
     >>> cl
     >>> factor numeric
     >>> 990    9010
     >>
     >> Yes.
     >>
     >> Murray's point is valid, too.
     >>
     >> But in my view, with the reasoning we have seen here,
     >> *and* with the well known software design principle of
     >> "least surprise" in mind,
     >> I also do think that the default for type.convert() should be what
     >> it has been for > 10 years now.
     >>

     > I think there should be two separate discussions:

     > a) have an option (argument to type.convert and possibly read.table)
to enable/disable this behavior. I'm strongly in favor of this.

In my (not committed) version of R-devel, I now have

  > str(type.convert(format(1/3, digits=17), exact=TRUE))
   Factor w/ 1 level "0.33333333333333331": 1
  > str(type.convert(format(1/3, digits=17), exact=FALSE))
   num 0.333

where the 'exact' argument name has been ``imported'' from the
underlying C code.

[ As we CRAN package writers know by now, arguments nowadays can
   hardly be abbreviated anymore, and so I am not open to longer
   alternative argument names, as someone liking blind typing, I'm
   not fond of camel case or other keyboard gymnastics (;-) but if someone
has a great idea for
   a better argument name.... ]

Instead of only  TRUE/FALSE, we could consider NA with
semantics "FALSE + warning" or also "TRUE + warning".

     > b) decide what the default for a) will be. I have no strong opinion,
I can see arguments in both directions

I think many have seen the good arguments in both directions.
I'm still strongly advocating that we value long term stability
higher here, and revert to more compatibility with the many
years of previous versions.

If we'd use a default of 'exact=NA', I'd like it to mean
FALSE + warning, but would not oppose much to  TRUE + warning.

I agree that for the TRUE case, it may make more sense to return
string-like object of a new (simple) class such as  "bignum"
that was mentioned in this thread.

OTOH, this functionality should make it into an R 3.1.1 in the
not so distant future, and thinking through consequences and
implementing the new class approach may just take a tad too much
time...

Martin

     > But most importantly I think a) is better than the status quo - even
if the discussion about b) drags out.

     > Cheers,
     > Simon

______________________________________________
R-devel at r-project.org <javascript:;> mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Martin Maechler <maechler at stat.math.ethz.ch>
    on Sat, 26 Apr 2014 22:59:17 +0200 writes:
Simon Urbanek <simon.urbanek at r-project.org>
    on Sat, 19 Apr 2014 13:06:15 -0400 writes:
>> On Apr 19, 2014, at 9:00 AM, Martin Maechler
>>>>>>>> McGehee, Robert <Robert.McGehee at geodecapital.com>
    >>>>>>>> on Thu, 17 Apr 2014 19:15:47 -0400 writes:
    >>> 
    >>>>> This is all application specific and sort of beyond
    >>>>> the scope of type.convert(), which now behaves as it
    >>>>> has been documented to behave.
    >>> 
    >>>> That's only a true statement because the documentation
    >>>> was changed to reflect the new behavior! The new
    >>>> feature in type.convert certainly does not behave
    >>>> according to the documentation as of R 3.0.3. Here's a
    >>>> snippit:
    >>> 
    >>>> The first type that can accept all the non-missing
    >>>> values is chosen (numeric and complex return values
    >>>> will represented approximately, of course).
    >>> 
    >>>> The key phrase is in parentheses, which reminds the
    >>>> user to expect a possible loss of precision. That
    >>>> important parenthetical was removed from the
    >>>> documentation in R 3.1.0 (among other changes).
    >>> 
    >>>> Putting aside the fact that this introduces a large
    >>>> amount of unnecessary work rewriting SQL / data import
    >>>> code, SQL packages, my biggest conceptual problem is
    >>>> that I can no longer rely on a particular function call
    >>>> returning a particular class. In my example querying
    >>>> stock prices, about 5% of prices came back as factors
    >>>> and the remaining 95% as numeric, so we had random
    >>>> errors popping in throughout the morning.
    >>> 
    >>>> Here's a short example showing us how the new behavior
    >>>> can be unreliable. I pass a character representation of
    >>>> a uniformly distributed random variable to
    >>>> type.convert. 90% of the time it is converted to
    >>>> "numeric" and 10% it is a "factor" (in R 3.1.0). In the
    >>>> 10% of cases in which type.convert converts to a factor
    >>>> the leading non-zero digit is always a 9. So if you
    >>>> were expecting a numeric value, then 1 in 10 times you
    >>>> may have a bug in your code that didn't exist before.
    >>> 
    >>>>> options(digits=16) cl <- NULL; for (i in 1:10000)
    >>>>> cl[i] <- class(type.convert(format(runif(1))))
    >>>>> table(cl)
    >>>> cl factor numeric 990 9010
    >>> 
    >>> Yes.
    >>> 
    >>> Murray's point is valid, too.
    >>> 
    >>> But in my view, with the reasoning we have seen here,
    >>> *and* with the well known software design principle of
    >>> "least surprise" in mind, I also do think that the
    >>> default for type.convert() should be what it has been
    >>> for > 10 years now.
    >>> 

    >> I think there should be two separate discussions:

    >> a) have an option (argument to type.convert and possibly
    >> read.table) to enable/disable this behavior. I'm strongly
    >> in favor of this.

    > In my (not committed) version of R-devel, I now have

    >> str(type.convert(format(1/3, digits=17), exact=TRUE))
    >   Factor w/ 1 level "0.33333333333333331": 1
    >> str(type.convert(format(1/3, digits=17), exact=FALSE))
    >   num 0.333

    > where the 'exact' argument name has been ``imported'' from
    > the underlying C code.

    > [ As we CRAN package writers know by now, arguments
    > nowadays can hardly be abbreviated anymore, and so I am
    > not open to longer alternative argument names, as someone
    > liking blind typing, I'm not fond of camel case or other
    > keyboard gymnastics (;-) but if someone has a great idea
    > for a better argument name.... ]

    > Instead of only TRUE/FALSE, we could consider NA with
    > semantics "FALSE + warning" or also "TRUE + warning".

    >> b) decide what the default for a) will be. I have no
    >> strong opinion, I can see arguments in both directions

    > I think many have seen the good arguments in both
    > directions.  I'm still strongly advocating that we value
    > long term stability higher here, and revert to more
    > compatibility with the many years of previous versions.

    > If we'd use a default of 'exact=NA', I'd like it to mean
    > FALSE + warning, but would not oppose much to TRUE +
    > warning.

I have now committed svn rev 65507  --- to R-devel only for now ---
the above:   exact = NA  is the default
and it means  "warning + FALSE".

Interestingly, I currently get 5 identical warnings for one
simple call, so there seems clearly room for optimization, and
that is one main reason for this reason to not yet be migrated
to 'R 3.1.0 patched'.

Martin

    > I agree that for the TRUE case, it may make more sense to
    > return string-like object of a new (simple) class such as
    > "bignum" that was mentioned in this thread.

    > OTOH, this functionality should make it into an R 3.1.1 in
    > the not so distant future, and thinking through
    > consequences and implementing the new class approach may
    > just take a tad too much time...

    > Martin

    >> But most importantly I think a) is better than the status
    >> quo - even if the discussion about b) drags out.

    >> Cheers, Simon

    > ______________________________________________
    > R-devel at r-project.org mailing list
    > https://stat.ethz.ch/mailman/listinfo/r-devel

[...snip...]
I think there should be two separate discussions:

a) have an option (argument to type.convert and possibly
read.table) to enable/disable this behavior. I'm strongly
in favor of this.

In my (not committed) version of R-devel, I now have

str(type.convert(format(1/3, digits=17), exact=TRUE))
 Factor w/ 1 level "0.33333333333333331": 1
str(type.convert(format(1/3, digits=17), exact=FALSE))
 num 0.333

where the 'exact' argument name has been ``imported'' from
the underlying C code.

[ As we CRAN package writers know by now, arguments
nowadays can hardly be abbreviated anymore, and so I am
not open to longer alternative argument names, as someone
liking blind typing, I'm not fond of camel case or other
keyboard gymnastics (;-) but if someone has a great idea
for a better argument name.... ]

Instead of only TRUE/FALSE, we could consider NA with
semantics "FALSE + warning" or also "TRUE + warning".

b) decide what the default for a) will be. I have no
strong opinion, I can see arguments in both directions

I think many have seen the good arguments in both
directions.  I'm still strongly advocating that we value
long term stability higher here, and revert to more
compatibility with the many years of previous versions.

If we'd use a default of 'exact=NA', I'd like it to mean
FALSE + warning, but would not oppose much to TRUE +
warning.
I have now committed svn rev 65507  --- to R-devel only for now ---
the above:   exact = NA  is the default
and it means  "warning + FALSE".

Interestingly, I currently get 5 identical warnings for one
simple call, so there seems clearly room for optimization, and
that is one main reason for this reason to not yet be migrated
to 'R 3.1.0 patched'.
I actually think that the default should be the old behaviour. No warning, just potentially lose digits. If this gets a user in trouble, _then_ turn on the check for lost digits. 

After all, I think we had about one single use case, where lost digits caused trouble (I cannot even dig up what the case was - someone had, like, 20-digit ID labels, I reckon). In contrast, we have seen umpteen cases where people have exported floating point data to slightly beyond machine precision, "just in case", and relied on read.table() to do the sensible thing.

It's also an open question whether we really want to apply the same logic to doubles and integer inputs. The whole change went in as (r62327)

"force type.convert to read e.g. 64-bit integers as strings/factors"

I, for one, did not expect that "e.g." would include 0.12345678901234567. My eyes were on the upcoming 3.0.0 release at that point, so I might not have noticed it anyway, but apparently noone lifted an eyebrow. It seems that this was deliberately postponed for 3.1.0, but for more than a year, noone actually exercised the code. 

-pd

BTW, "exact" is a horrible name for an option, how about digitloss=c("allow", "warn", "forbid")?
Martin

I agree that for the TRUE case, it may make more sense to
return string-like object of a new (simple) class such as
"bignum" that was mentioned in this thread.

OTOH, this functionality should make it into an R 3.1.1 in
the not so distant future, and thinking through
consequences and implementing the new class approach may
just take a tad too much time...

Martin

But most importantly I think a) is better than the status
quo - even if the discussion about b) drags out.

Cheers, Simon

______________________________________________
R-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
R-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com