An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-devel/attachments/20140411/9691324e/attachment.pl>
type.convert and doubles
30 messages · Gregory R. Warnes, Paul Gilbert, Gabor Grothendieck +10 more
Messages 1–25 of 30
Greg,
On Apr 11, 2014, at 11:50 AM, Gregory R. Warnes <greg at warnes.net> wrote:
Hi All, I see this in the NEWS for R 3.1.0: type.convert() (and hence by default read.table()) returns a character vector or factor when representing a numeric input as a double would lose accuracy. Similarly for complex inputs. This behavior seems likely to surprise users.
Can you elaborate why that would be surprising? It is consistent with the intention of type.convert() to determine the correct type to represent the value - it has always used character/factor as a fallback where native type doesn't match. It has never issued any warning in that case historically, so IMHO it would be rather surprising if it did now? Cheers, Simon
Would it be possible to issue a warning when this occurs? Aside: I?m very happy to see the new ?s? and ?f? browser (debugger) commands! -Greg [[alternative HTML version deleted]]
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
On 04/11/2014 01:43 PM, Simon Urbanek wrote:
Greg, On Apr 11, 2014, at 11:50 AM, Gregory R. Warnes <greg at warnes.net> wrote:
Hi All, I see this in the NEWS for R 3.1.0: type.convert() (and hence by default read.table()) returns a character vector or factor when representing a numeric input as a double would lose accuracy. Similarly for complex inputs. This behavior seems likely to surprise users.
Can you elaborate why that would be surprising? It is consistent with the intention of type.convert() to determine the correct type to represent the value - it has always used character/factor as a fallback where native type doesn't match.
Strictly speaking, I don't think this is true. If it were, it would not have been necessary to make the change so that it does now fallback to using character/factor. It may, however, have always been the intent. I don't really think a warning is necessary, but there are some surprises: > str(type.convert(format(1/3, digits=17))) # R-3.0.3 num 0.333 > str(type.convert(format(1/3, digits=17))) # R-3.1.0 Factor w/ 1 level "0.33333333333333331": 1 Now you could say that one should never do that, and the change is just flushing out a bug that was always there. But the point is that in serialization situations there can be some surprises. So, for example, RODBC talking to PostgresSQL databases is now returning factors rather than numerics for double precision fields, whereas with RPostgresSQL the behaviour has not changed. Paul It has never issued any
warning in that case historically, so IMHO it would be rather surprising if it did now? Cheers, Simon
Would it be possible to issue a warning when this occurs? Aside: I?m very happy to see the new ?s? and ?f? browser (debugger) commands! -Greg [[alternative HTML version deleted]]
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
5 days later
Hi, As Greg suggested, this new feature in type.convert certainly did surprise one user (me), enough so that I had to downgrade back to 3.0.3 until our code was modified to handle the new behavior. Here's my use case: I have a function that pulls arbitrary financial data from a web service call such as a stock's industry, price, volume, etc. by reading the web output as a text table. The data may be either character (industry, stock name, etc.) or numeric (price, volume, etc.), and the function generally doesn't know the class in advance. The problem is that we frequently get numeric values represented with more precision than actually exists, for instance a price of "2.6999999999999999" rather than "2.70". The numeric representation is exactly one digit too much for type.convert which (in R 3.10.0) converts it to character instead of numeric (not what I want). This caused a bunch of "non-numeric argument to binary operator" errors to appear today as numeric data was now being represented as characters. I have no doubt that this probably will cause some unwanted RODBC side effects for us as well. IMO, getting the class right is more important than infinite precision. What use is a character representation of a number anyway if you can't perform arithmetic on it? I would favor at least making the new behavior optional, but I think many packages (like RODBC) potentially need to be patched to code around the new feature if it's left in. (This aside, thanks for all the nice features and bug fixes in the new version!) Cheers, Robert -----Original Message----- From: r-devel-bounces at r-project.org [mailto:r-devel-bounces at r-project.org] On Behalf Of Paul Gilbert Sent: Friday, April 11, 2014 5:38 PM To: Simon Urbanek; Gregory R. Warnes Cc: R-devel Subject: Re: [Rd] type.convert and doubles
On 04/11/2014 01:43 PM, Simon Urbanek wrote:
Greg, On Apr 11, 2014, at 11:50 AM, Gregory R. Warnes <greg at warnes.net> wrote:
Hi All, I see this in the NEWS for R 3.1.0: type.convert() (and hence by default read.table()) returns a character vector or factor when representing a numeric input as a double would lose accuracy. Similarly for complex inputs. This behavior seems likely to surprise users.
Can you elaborate why that would be surprising? It is consistent with the intention of type.convert() to determine the correct type to represent the value - it has always used character/factor as a fallback where native type doesn't match.
Strictly speaking, I don't think this is true. If it were, it would not have been necessary to make the change so that it does now fallback to using character/factor. It may, however, have always been the intent. I don't really think a warning is necessary, but there are some surprises: > str(type.convert(format(1/3, digits=17))) # R-3.0.3 num 0.333 > str(type.convert(format(1/3, digits=17))) # R-3.1.0 Factor w/ 1 level "0.33333333333333331": 1 Now you could say that one should never do that, and the change is just flushing out a bug that was always there. But the point is that in serialization situations there can be some surprises. So, for example, RODBC talking to PostgresSQL databases is now returning factors rather than numerics for double precision fields, whereas with RPostgresSQL the behaviour has not changed. Paul It has never issued any
warning in that case historically, so IMHO it would be rather surprising if it did now... Cheers, Simon
Would it be possible to issue a warning when this occurs? Aside: I'm very happy to see the new 's' and 'f' browser (debugger) commands! -Greg [[alternative HTML version deleted]]
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Strictly speaking, I don't think this is true. If it were, it would not have been necessary to make the change so that it does now fallback to using character/factor. It may, however, have always been the intent. I don't really think a warning is necessary, but there are some surprises:
str(type.convert(format(1/3, digits=17))) # R-3.0.3
num 0.333
str(type.convert(format(1/3, digits=17))) # R-3.1.0
Factor w/ 1 level "0.33333333333333331": 1
It is bizarre that it makes a factor rather than a string. 0.333333333333 is pretty obviously not a categorical value. Hadley
On 17/04/2014 9:42 AM, McGehee, Robert wrote:
Hi, As Greg suggested, this new feature in type.convert certainly did surprise one user (me), enough so that I had to downgrade back to 3.0.3 until our code was modified to handle the new behavior.
I don't have an opinion on this particular change, but one way to avoid surprises like this is to test releases when they become available. For 3.1.0, the alpha became available on March 13. If you are especially eager, you can follow the news feed at http://developer.r-project.org/blosxom.cgi/R-devel/NEWS. This particular change was announced there more than a year ago (http://developer.r-project.org/blosxom.cgi/R-devel/NEWS/2013/03/19), and has shown up several times since as minor edits have been made to the announcement. Duncan Murdoch
Here's my use case: I have a function that pulls arbitrary financial data from a web service call such as a stock's industry, price, volume, etc. by reading the web output as a text table. The data may be either character (industry, stock name, etc.) or numeric (price, volume, etc.), and the function generally doesn't know the class in advance. The problem is that we frequently get numeric values represented with more precision than actually exists, for instance a price of "2.6999999999999999" rather than "2.70". The numeric representation is exactly one digit too much for type.convert which (in R 3.10.0) converts it to character instead of numeric (not what I want). This caused a bunch of "non-numeric argument to binary operator" errors to appear today as numeric data was now being represented as characters. I have no doubt that this probably will cause some unwanted RODBC side effects for us as well. IMO, getting the class right is more important than infinite precision. What use is a character representation of a number anyway if you can't perform arithmetic on it? I would favor at least making the new behavior optional, but I think many packages (like RODBC) potentially need to be patched to code around the new feature if it's left in. (This aside, thanks for all the nice features and bug fixes in the new version!) Cheers, Robert -----Original Message----- From: r-devel-bounces at r-project.org [mailto:r-devel-bounces at r-project.org] On Behalf Of Paul Gilbert Sent: Friday, April 11, 2014 5:38 PM To: Simon Urbanek; Gregory R. Warnes Cc: R-devel Subject: Re: [Rd] type.convert and doubles On 04/11/2014 01:43 PM, Simon Urbanek wrote:
Greg, On Apr 11, 2014, at 11:50 AM, Gregory R. Warnes <greg at warnes.net> wrote:
Hi All, I see this in the NEWS for R 3.1.0: type.convert() (and hence by default read.table()) returns a character vector or factor when representing a numeric input as a double would lose accuracy. Similarly for complex inputs. This behavior seems likely to surprise users.
Can you elaborate why that would be surprising? It is consistent with the intention of type.convert() to determine the correct type to represent the value - it has always used character/factor as a fallback where native type doesn't match.
Strictly speaking, I don't think this is true. If it were, it would not have been necessary to make the change so that it does now fallback to using character/factor. It may, however, have always been the intent. I don't really think a warning is necessary, but there are some surprises:
> str(type.convert(format(1/3, digits=17))) # R-3.0.3
num 0.333
> str(type.convert(format(1/3, digits=17))) # R-3.1.0
Factor w/ 1 level "0.33333333333333331": 1 Now you could say that one should never do that, and the change is just flushing out a bug that was always there. But the point is that in serialization situations there can be some surprises. So, for example, RODBC talking to PostgresSQL databases is now returning factors rather than numerics for double precision fields, whereas with RPostgresSQL the behaviour has not changed. Paul It has never issued any
warning in that case historically, so IMHO it would be rather surprising if it did now... Cheers, Simon
Would it be possible to issue a warning when this occurs? Aside: I'm very happy to see the new 's' and 'f' browser (debugger) commands! -Greg [[alternative HTML version deleted]]
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel ______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
On Thu, Apr 17, 2014 at 6:42 AM, McGehee, Robert
<Robert.McGehee at geodecapital.com> wrote:
Here's my use case: I have a function that pulls arbitrary financial data from a web service call such as a stock's industry, price, volume, etc. by reading the web output as a text table. The data may be either character (industry, stock name, etc.) or numeric (price, volume, etc.), and the function generally doesn't know the class in advance. The problem is that we frequently get numeric values represented with more precision than actually exists, for instance a price of "2.6999999999999999" rather than "2.70". The numeric representation is exactly one digit too much for type.convert which (in R 3.10.0) converts it to character instead of numeric (not what I want). This caused a bunch of "non-numeric argument to binary operator" errors to appear today as numeric data was now being represented as characters. I have no doubt that this probably will cause some unwanted RODBC side effects for us as well. IMO, getting the class right is more important than infinite precision. What use is a character representation of a number anyway if you can't perform arithmetic on it? I would favor at least making the new behavior optional, but I think many packages (like RODBC) potentially need to be patched to code around the new feature if it's left in.
The uses of character representation of a number are many: unique
identifiers/user ids, hash codes, timestamps, or other values where
rounding results to the nearest value that can be represented as a
numeric type would completely change the results of any data analysis
performed on that data.
Database join operations are certainly an area where R's previous
behavior of silently dropping precision of numbers with type.convert
can get you into trouble. For example, things like join operations or
group by operations performed in R code would produce erroneous
results if you are joining/grouping by a key without the full
precision of your underlying data. Records can get joined up
incorrectly or aggregated with the wrong groups.
If you later want to do arithmetic on them, you can choose to lose
precision by using as.numeric() or use one of the large number
packages on CRAN (GMP, int64, bit64, etc.). But once you've dropped
the precision with as.numeric you can never get it back, which is why
the previous behavior was clearly dangerous.
I think I had some additional examples in the original bug/patch I
filed about this issue a few years ago, but I'm unable to find it on
bugs.r-project.org and its not referenced in the cl descriptions or
news file.
- Murray
On Thu, Apr 17, 2014 at 2:21 PM, Murray Stokely <murray at stokely.org> wrote:
If you later want to do arithmetic on them, you can choose to lose precision by using as.numeric() or use one of the large number packages on CRAN (GMP, int64, bit64, etc.). But once you've dropped the precision with as.numeric you can never get it back, which is why the previous behavior was clearly dangerous.
Only if you knew that that column was supposed to be numeric. There is nothing in type.convert or read.table to allow you to override how it works (colClasses only works if you knew which columns are which in the first place) nor is there anything to allow you to know which columns were affected so that you know which columns to look at to fix it yourself afterwards.
On Thu, Apr 17, 2014 at 2:35 PM, Gabor Grothendieck
<ggrothendieck at gmail.com> wrote:
Only if you knew that that column was supposed to be numeric. There is
The columns that are "supposed" to be numeric are those that can fit into a numeric data type. Previously that was not always the case with columns that could not be represented as a numeric erroneously coerced into a truncated/rounded numeric.
nothing in type.convert or read.table to allow you to override how it works (colClasses only works if you knew which columns are which in the first place) nor is there anything to allow you to know which columns were affected so that you know which columns to look at to fix it yourself afterwards.
You want a casting operation in your SQL query or similar if you want
a rounded type that will always fit in a double. Cast or Convert
operators in SQL, or similar for however you are getting the data you
want to use with type.convert(). This is all application specific and
sort of beyond the scope of type.convert(), which now behaves as it
has been documented to behave.
In my code for this kind of thing I have however typically introduced
an option() to let the user control casting behavior for e.g. 64-bit
ints in C++. Should they be returned as truncated precision numeric
types or the full precision data in a character string representation?
In the RProtoBuf package we let the user specify an option() to
specify which behavior they need for their application as a shortcut
to just always returning the safer character representation and making
them coerce to numeric often.
- Murray
On 04/17/2014 02:21 PM, Murray Stokely wrote:
On Thu, Apr 17, 2014 at 6:42 AM, McGehee, Robert <Robert.McGehee at geodecapital.com> wrote:
Here's my use case: I have a function that pulls arbitrary financial data from a web service call such as a stock's industry, price, volume, etc. by reading the web output as a text table. The data may be either character (industry, stock name, etc.) or numeric (price, volume, etc.), and the function generally doesn't know the class in advance. The problem is that we frequently get numeric values represented with more precision than actually exists, for instance a price of "2.6999999999999999" rather than "2.70". The numeric representation is exactly one digit too much for type.convert which (in R 3.10.0) converts it to character instead of numeric (not what I want). This caused a bunch of "non-numeric argument to binary operator" errors to appear today as numeric data was now being represented as characters. I have no doubt that this probably will cause some unwanted RODBC side effects for us as well. IMO, getting the class right is more important than infinite precision. What use is a character representation of a number anyway if you can't perform arithmetic on it? I would favor at least making the new behavior optional, but I think many packages (like RODBC) potentially need to be patched to code around the new feature if it's left in.
The uses of character representation of a number are many: unique identifiers/user ids, hash codes, timestamps, or other values where rounding results to the nearest value that can be represented as a numeric type would completely change the results of any data analysis performed on that data. Database join operations are certainly an area where R's previous behavior of silently dropping precision of numbers with type.convert can get you into trouble. For example, things like join operations or group by operations performed in R code would produce erroneous results if you are joining/grouping by a key without the full precision of your underlying data. Records can get joined up incorrectly or aggregated with the wrong groups.
I don't understand this. Assuming you are sending the SQL statement to the database engine, none of this erroneous matching is happening in R. The calculations all happens on the database. But, for the case where the database does know that numbers are double precision, it would be nice if they got transmitted by ODBC to R as numerics (the usual translation) just as they are by the native interfaces like RPostgreSQL. Do you get the erroneous results when you use a native interface? ( from second response:)
You want a casting operation in your SQL query or similar if you want a rounded type that will always fit in a double. Cast or Convert operators in SQL, or similar for however you are getting the data you want to use with type.convert(). This is all application specific and sort of beyond the scope of type.convert(), which now behaves as it has been documented to behave.
This seems to suggests I need to use different SQL statements depending on which interface I use to talk to the database. If you do 1/3 in a database calculation and that ends up being represented as something more accurate than double precision on the database, then it needs to be transmitted as something with higher precision (character/factor?). If the result is double precision it should be sent as double precision, not as something pretending to be more accurate. I suspect the difficulty with ODBC may be that type.convert() really should not be called when both ends of the communication know that a double precision number is being exchanged. Paul
If you later want to do arithmetic on them, you can choose to lose precision by using as.numeric() or use one of the large number packages on CRAN (GMP, int64, bit64, etc.). But once you've dropped the precision with as.numeric you can never get it back, which is why the previous behavior was clearly dangerous. I think I had some additional examples in the original bug/patch I filed about this issue a few years ago, but I'm unable to find it on bugs.r-project.org and its not referenced in the cl descriptions or news file. - Murray
This is all application specific and sort of beyond the scope of type.convert(), which now behaves as it has been documented to behave.
That's only a true statement because the documentation was changed to reflect the new behavior! The new feature in type.convert certainly does not behave according to the documentation as of R 3.0.3. Here's a snippit:
The first type that can accept all the
non-missing values is chosen (numeric and complex return values
will represented approximately, of course).
The key phrase is in parentheses, which reminds the user to expect a possible loss of precision. That important parenthetical was removed from the documentation in R 3.1.0 (among other changes).
Putting aside the fact that this introduces a large amount of unnecessary work rewriting SQL / data import code, SQL packages, my biggest conceptual problem is that I can no longer rely on a particular function call returning a particular class. In my example querying stock prices, about 5% of prices came back as factors and the remaining 95% as numeric, so we had random errors popping in throughout the morning.
Here's a short example showing us how the new behavior can be unreliable. I pass a character representation of a uniformly distributed random variable to type.convert. 90% of the time it is converted to "numeric" and 10% it is a "factor" (in R 3.1.0). In the 10% of cases in which type.convert converts to a factor the leading non-zero digit is always a 9. So if you were expecting a numeric value, then 1 in 10 times you may have a bug in your code that didn't exist before.
options(digits=16) cl <- NULL; for (i in 1:10000) cl[i] <- class(type.convert(format(runif(1)))) table(cl)
cl factor numeric 990 9010 Cheers, Robert
1 day later
McGehee, Robert <Robert.McGehee at geodecapital.com>
on Thu, 17 Apr 2014 19:15:47 -0400 writes:
>> This is all application specific and
>> sort of beyond the scope of type.convert(), which now behaves as it
>> has been documented to behave.
> That's only a true statement because the documentation was changed to reflect the new behavior! The new feature in type.convert certainly does not behave according to the documentation as of R 3.0.3. Here's a snippit:
> The first type that can accept all the
> non-missing values is chosen (numeric and complex return values
> will represented approximately, of course).
> The key phrase is in parentheses, which reminds the user to expect a possible loss of precision. That important parenthetical was removed from the documentation in R 3.1.0 (among other changes).
> Putting aside the fact that this introduces a large amount of unnecessary work rewriting SQL / data import code, SQL packages, my biggest conceptual problem is that I can no longer rely on a particular function call returning a particular class. In my example querying stock prices, about 5% of prices came back as factors and the remaining 95% as numeric, so we had random errors popping in throughout the morning.
> Here's a short example showing us how the new behavior can be unreliable. I pass a character representation of a uniformly distributed random variable to type.convert. 90% of the time it is converted to "numeric" and 10% it is a "factor" (in R 3.1.0). In the 10% of cases in which type.convert converts to a factor the leading non-zero digit is always a 9. So if you were expecting a numeric value, then 1 in 10 times you may have a bug in your code that didn't exist before.
>> options(digits=16)
>> cl <- NULL; for (i in 1:10000) cl[i] <- class(type.convert(format(runif(1))))
>> table(cl)
> cl
> factor numeric
> 990 9010
Yes.
Murray's point is valid, too.
But in my view, with the reasoning we have seen here,
*and* with the well known software design principle of
"least surprise" in mind,
I also do think that the default for type.convert() should be what
it has been for > 10 years now.
Martin
On Apr 19, 2014, at 9:00 AM, Martin Maechler <maechler at stat.math.ethz.ch> wrote:
McGehee, Robert <Robert.McGehee at geodecapital.com> on Thu, 17 Apr 2014 19:15:47 -0400 writes:
This is all application specific and sort of beyond the scope of type.convert(), which now behaves as it has been documented to behave.
That's only a true statement because the documentation was changed to reflect the new behavior! The new feature in type.convert certainly does not behave according to the documentation as of R 3.0.3. Here's a snippit:
The first type that can accept all the non-missing values is chosen (numeric and complex return values will represented approximately, of course).
The key phrase is in parentheses, which reminds the user to expect a possible loss of precision. That important parenthetical was removed from the documentation in R 3.1.0 (among other changes).
Putting aside the fact that this introduces a large amount of unnecessary work rewriting SQL / data import code, SQL packages, my biggest conceptual problem is that I can no longer rely on a particular function call returning a particular class. In my example querying stock prices, about 5% of prices came back as factors and the remaining 95% as numeric, so we had random errors popping in throughout the morning.
Here's a short example showing us how the new behavior can be unreliable. I pass a character representation of a uniformly distributed random variable to type.convert. 90% of the time it is converted to "numeric" and 10% it is a "factor" (in R 3.1.0). In the 10% of cases in which type.convert converts to a factor the leading non-zero digit is always a 9. So if you were expecting a numeric value, then 1 in 10 times you may have a bug in your code that didn't exist before.
options(digits=16) cl <- NULL; for (i in 1:10000) cl[i] <- class(type.convert(format(runif(1)))) table(cl)
cl factor numeric 990 9010
Yes. Murray's point is valid, too. But in my view, with the reasoning we have seen here, *and* with the well known software design principle of "least surprise" in mind, I also do think that the default for type.convert() should be what it has been for > 10 years now.
I think there should be two separate discussions: a) have an option (argument to type.convert and possibly read.table) to enable/disable this behavior. I'm strongly in favor of this. b) decide what the default for a) will be. I have no strong opinion, I can see arguments in both directions But most importantly I think a) is better than the status quo - even if the discussion about b) drags out. Cheers, Simon
On Sat, Apr 19, 2014 at 1:06 PM, Simon Urbanek
<simon.urbanek at r-project.org> wrote:
On Apr 19, 2014, at 9:00 AM, Martin Maechler <maechler at stat.math.ethz.ch> wrote: I think there should be two separate discussions: a) have an option (argument to type.convert and possibly read.table) to enable/disable this behavior. I'm strongly in favor of this. b) decide what the default for a) will be. I have no strong opinion, I can see arguments in both directions But most importantly I think a) is better than the status quo - even if the discussion about b) drags out. Cheers, Simon
Another possibility is: (c) Return the column as factor/character but with a distinguishing class so that the user can reset its class later. e.g. DF <- read.table(...) DF[] <- lapply(DF, function(x) if (inherits(x, "special.class")) as.numeric(x) else x) Personally I would go with (a) in both type.convert and read.table with a default that reflects the historical behavior rather than the current 3.1 behavior.
Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com
Yes, I'm also strongly in favor of having an option for this. If
there was an option in base R for controlling this we would just use
that and get rid of the separate RProtoBuf.int64AsString option we use
in the RProtoBuf package on CRAN to control whether 64-bit int types
from C++ are returned to R as numerics or character vectors.
I agree that reasonable people can disagree about the default, but I
found my original bug report about this, so I will counter Robert's
example with my favorite example of what was wrong with the previous
behavior :
tmp<-data.frame(n=c("72057594037927936", "72057594037927937"),
name=c("foo", "bar"))
length(unique(tmp$n))
# 2
write.csv(tmp, "/tmp/foo.csv", quote=FALSE, row.names=FALSE)
data <- read.csv("/tmp/foo.csv")
length(unique(data$n))
# 1
- Murray
On Sat, Apr 19, 2014 at 10:06 AM, Simon Urbanek
<simon.urbanek at r-project.org> wrote:
On Apr 19, 2014, at 9:00 AM, Martin Maechler <maechler at stat.math.ethz.ch> wrote:
McGehee, Robert <Robert.McGehee at geodecapital.com> on Thu, 17 Apr 2014 19:15:47 -0400 writes:
This is all application specific and sort of beyond the scope of type.convert(), which now behaves as it has been documented to behave.
That's only a true statement because the documentation was changed to reflect the new behavior! The new feature in type.convert certainly does not behave according to the documentation as of R 3.0.3. Here's a snippit:
The first type that can accept all the non-missing values is chosen (numeric and complex return values will represented approximately, of course).
The key phrase is in parentheses, which reminds the user to expect a possible loss of precision. That important parenthetical was removed from the documentation in R 3.1.0 (among other changes).
Putting aside the fact that this introduces a large amount of unnecessary work rewriting SQL / data import code, SQL packages, my biggest conceptual problem is that I can no longer rely on a particular function call returning a particular class. In my example querying stock prices, about 5% of prices came back as factors and the remaining 95% as numeric, so we had random errors popping in throughout the morning.
Here's a short example showing us how the new behavior can be unreliable. I pass a character representation of a uniformly distributed random variable to type.convert. 90% of the time it is converted to "numeric" and 10% it is a "factor" (in R 3.1.0). In the 10% of cases in which type.convert converts to a factor the leading non-zero digit is always a 9. So if you were expecting a numeric value, then 1 in 10 times you may have a bug in your code that didn't exist before.
options(digits=16) cl <- NULL; for (i in 1:10000) cl[i] <- class(type.convert(format(runif(1)))) table(cl)
cl factor numeric 990 9010
Yes. Murray's point is valid, too. But in my view, with the reasoning we have seen here, *and* with the well known software design principle of "least surprise" in mind, I also do think that the default for type.convert() should be what it has been for > 10 years now.
I think there should be two separate discussions: a) have an option (argument to type.convert and possibly read.table) to enable/disable this behavior. I'm strongly in favor of this. b) decide what the default for a) will be. I have no strong opinion, I can see arguments in both directions But most importantly I think a) is better than the status quo - even if the discussion about b) drags out. Cheers, Simon
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-devel/attachments/20140420/04fe5f37/attachment.pl>
On 20/04/2014, 2:22 PM, G?bor Cs?rdi wrote:
How about using the quoting to decide what should be character, and what not? You do not need to quote numbers, logical values, only characters, so this would make sense imo.
That explicitly violates some of the CSV "standards". The quotes must have no effect on the interpretation. Duncan Murdoch
How about something like this: - if it is quoted (and not specified otherwise in colClasses), then it is a character/factor - if it is not quoted (and not specified otherwise in colClasses), then the type is automatically detected, according to the pre-3.1.x method, and a (suppressible) warning or error is given if information is lost, when coercing to numbers. Just an idea. Gabor On Sun, Apr 20, 2014 at 3:24 AM, Murray Stokely <murray at stokely.org> wrote:
Yes, I'm also strongly in favor of having an option for this. If
there was an option in base R for controlling this we would just use
that and get rid of the separate RProtoBuf.int64AsString option we use
in the RProtoBuf package on CRAN to control whether 64-bit int types
from C++ are returned to R as numerics or character vectors.
I agree that reasonable people can disagree about the default, but I
found my original bug report about this, so I will counter Robert's
example with my favorite example of what was wrong with the previous
behavior :
tmp<-data.frame(n=c("72057594037927936", "72057594037927937"),
name=c("foo", "bar"))
length(unique(tmp$n))
# 2
write.csv(tmp, "/tmp/foo.csv", quote=FALSE, row.names=FALSE)
data <- read.csv("/tmp/foo.csv")
length(unique(data$n))
# 1
- Murray
On Sat, Apr 19, 2014 at 10:06 AM, Simon Urbanek
<simon.urbanek at r-project.org> wrote:
On Apr 19, 2014, at 9:00 AM, Martin Maechler <maechler at stat.math.ethz.ch>
wrote:
McGehee, Robert <Robert.McGehee at geodecapital.com>
on Thu, 17 Apr 2014 19:15:47 -0400 writes:
This is all application specific and sort of beyond the scope of type.convert(), which now behaves as it has been documented to behave.
That's only a true statement because the documentation was changed to
reflect the new behavior! The new feature in type.convert certainly does not behave according to the documentation as of R 3.0.3. Here's a snippit:
The first type that can accept all the non-missing values is chosen (numeric and complex return values will represented approximately, of course).
The key phrase is in parentheses, which reminds the user to expect a
possible loss of precision. That important parenthetical was removed from the documentation in R 3.1.0 (among other changes).
Putting aside the fact that this introduces a large amount of
unnecessary work rewriting SQL / data import code, SQL packages, my biggest conceptual problem is that I can no longer rely on a particular function call returning a particular class. In my example querying stock prices, about 5% of prices came back as factors and the remaining 95% as numeric, so we had random errors popping in throughout the morning.
Here's a short example showing us how the new behavior can be
unreliable. I pass a character representation of a uniformly distributed random variable to type.convert. 90% of the time it is converted to "numeric" and 10% it is a "factor" (in R 3.1.0). In the 10% of cases in which type.convert converts to a factor the leading non-zero digit is always a 9. So if you were expecting a numeric value, then 1 in 10 times you may have a bug in your code that didn't exist before.
options(digits=16) cl <- NULL; for (i in 1:10000) cl[i] <-
class(type.convert(format(runif(1))))
table(cl)
cl factor numeric 990 9010
Yes. Murray's point is valid, too. But in my view, with the reasoning we have seen here, *and* with the well known software design principle of "least surprise" in mind, I also do think that the default for type.convert() should be what it has been for > 10 years now.
I think there should be two separate discussions: a) have an option (argument to type.convert and possibly read.table) to
enable/disable this behavior. I'm strongly in favor of this.
b) decide what the default for a) will be. I have no strong opinion, I
can see arguments in both directions
But most importantly I think a) is better than the status quo - even if
the discussion about b) drags out.
Cheers, Simon
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[[alternative HTML version deleted]]
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Agreed. Perhaps even a global option would make sense. We already have an option with a similar spirit: 'options(?stringsAsFactors"=T/F)'. Perhaps 'options(?exactNumericAsString?=T/F)' [or something else] would be desirable, with the option being the default value to the type.convert argument. I also like Gabor?s idea of a ?distinguishing class?. R doesn?t natively support arbitrary precision numbers (AFAIK), but I think that?s what Murray wants. I could imagine some kind of new class emerging here that initially looks just like a character/factor, but may evolve over time to accept arithmetic methods and act more like a number (e.g. knowing that ?0.1?, ?.10? and "1e-1" are the same number, or that ?-9?<?-0.2"). A class ?bignum? perhaps? Cheers, Robert
On 4/20/14, 3:24 AM, "Murray Stokely" <murray at stokely.org> wrote:
Yes, I'm also strongly in favor of having an option for this. If
there was an option in base R for controlling this we would just use
that and get rid of the separate RProtoBuf.int64AsString option we use
in the RProtoBuf package on CRAN to control whether 64-bit int types
from C++ are returned to R as numerics or character vectors.
I agree that reasonable people can disagree about the default, but I
found my original bug report about this, so I will counter Robert's
example with my favorite example of what was wrong with the previous
behavior :
tmp<-data.frame(n=c("72057594037927936", "72057594037927937"),
name=c("foo", "bar"))
length(unique(tmp$n))
# 2
write.csv(tmp, "/tmp/foo.csv", quote=FALSE, row.names=FALSE)
data <- read.csv("/tmp/foo.csv")
length(unique(data$n))
# 1
- Murray
On Sat, Apr 19, 2014 at 10:06 AM, Simon Urbanek
<simon.urbanek at r-project.org> wrote:
On Apr 19, 2014, at 9:00 AM, Martin Maechler <maechler at stat.math.ethz.ch> wrote:
McGehee, Robert <Robert.McGehee at geodecapital.com> on Thu, 17 Apr 2014 19:15:47 -0400 writes:
This is all application specific and sort of beyond the scope of type.convert(), which now behaves as it has been documented to behave.
That's only a true statement because the documentation was changed to reflect the new behavior! The new feature in type.convert certainly does not behave according to the documentation as of R 3.0.3. Here's a snippit:
The first type that can accept all the non-missing values is chosen (numeric and complex return values will represented approximately, of course).
The key phrase is in parentheses, which reminds the user to expect a possible loss of precision. That important parenthetical was removed from the documentation in R 3.1.0 (among other changes).
Putting aside the fact that this introduces a large amount of unnecessary work rewriting SQL / data import code, SQL packages, my biggest conceptual problem is that I can no longer rely on a particular function call returning a particular class. In my example querying stock prices, about 5% of prices came back as factors and the remaining 95% as numeric, so we had random errors popping in throughout the morning.
Here's a short example showing us how the new behavior can be unreliable. I pass a character representation of a uniformly distributed random variable to type.convert. 90% of the time it is converted to "numeric" and 10% it is a "factor" (in R 3.1.0). In the 10% of cases in which type.convert converts to a factor the leading non-zero digit is always a 9. So if you were expecting a numeric value, then 1 in 10 times you may have a bug in your code that didn't exist before.
options(digits=16) cl <- NULL; for (i in 1:10000) cl[i] <- class(type.convert(format(runif(1)))) table(cl)
cl factor numeric 990 9010
Yes. Murray's point is valid, too. But in my view, with the reasoning we have seen here, *and* with the well known software design principle of "least surprise" in mind, I also do think that the default for type.convert() should be what it has been for > 10 years now.
I think there should be two separate discussions: a) have an option (argument to type.convert and possibly read.table) to enable/disable this behavior. I'm strongly in favor of this. b) decide what the default for a) will be. I have no strong opinion, I can see arguments in both directions But most importantly I think a) is better than the status quo - even if the discussion about b) drags out. Cheers, Simon
McGehee, Robert <Robert.McGehee at geodecapital.com>
on Mon, 21 Apr 2014 09:24:13 -0400 writes:
> Agreed. Perhaps even a global option would make sense. We
> already have an option with a similar spirit:
> 'options(?stringsAsFactors"=T/F)'. Perhaps
> 'options(?exactNumericAsString?=T/F)' [or something else]
> would be desirable, with the option being the default
> value to the type.convert argument.
No, please, no, not a global option here!
Global options that influence default behavior of basic
functions is too much against the principle of functional
programming, and my personal opinion has always been that
'stringsAsFactors' has been a mistake (as a global option, not
as an argument).
Note that with such global options, the output of sessionInfo()
would in principle have to contain all (such) global options in
addtion to R and package versions in order to diagnose behavior
of R functions.
I think we have more or less agreed that we'd like to have
a new function *argument* to type.convert();
passed "upstream" to read.table() and via ... the other
read.<foo>() that call read.table.
> I also like Gabor?s idea of a ?distinguishing class?. R
> doesn?t natively support arbitrary precision numbers
> (AFAIK), but I think that?s what Murray wants. I could
> imagine some kind of new class emerging here that
> initially looks just like a character/factor, but may
> evolve over time to accept arithmetic methods and act more
> like a number (e.g. knowing that ?0.1?, ?.10? and "1e-1"
> are the same number, or that ?-9?<?-0.2"). A class
> ?bignum? perhaps?
That's another interesting idea. As maintainer of CRAN package
'Rmpfr' and co-maintainer of 'gmp', I'm even biased about this
issue.
Martin
> Cheers, Robert
> On 4/20/14, 3:24 AM, "Murray Stokely" <murray at stokely.org>
> wrote:
>> Yes, I'm also strongly in favor of having an option for
>> this. If there was an option in base R for controlling
>> this we would just use that and get rid of the separate
>> RProtoBuf.int64AsString option we use in the RProtoBuf
>> package on CRAN to control whether 64-bit int types from
>> C++ are returned to R as numerics or character vectors.
>>
>> I agree that reasonable people can disagree about the
>> default, but I found my original bug report about this,
>> so I will counter Robert's example with my favorite
>> example of what was wrong with the previous behavior :
>>
>> tmp<-data.frame(n=c("72057594037927936",
>> "72057594037927937"), name=c("foo", "bar"))
>> length(unique(tmp$n)) # 2 write.csv(tmp, "/tmp/foo.csv",
>> quote=FALSE, row.names=FALSE) data <-
>> read.csv("/tmp/foo.csv") length(unique(data$n)) # 1
>>
>> - Murray
>>
>>
>> On Sat, Apr 19, 2014 at 10:06 AM, Simon Urbanek
>> <simon.urbanek at r-project.org> wrote:
>>> On Apr 19, 2014, at 9:00 AM, Martin Maechler
>>> <maechler at stat.math.ethz.ch> wrote:
>>>
>>>>>>>>> McGehee, Robert <Robert.McGehee at geodecapital.com>
>>>>>>>>> on Thu, 17 Apr 2014 19:15:47 -0400 writes:
>>>>
This is all application specific and sort of beyond the scope of type.convert(), which now
>>>> behaves as it
has been documented to behave.
>>>>
>>>>> That's only a true statement because the documentation
>>>>> was changed to reflect the new behavior! The new
>>>>> feature in type.convert certainly does not behave
>>>>> according to the documentation as of R 3.0.3. Here's a
>>>>> snippit:
>>>>
>>>>> The first type that can accept all the non-missing
>>>>> values is chosen (numeric and complex return values
>>>>> will represented approximately, of course).
>>>>
>>>>> The key phrase is in parentheses, which reminds the
>>>>> user to expect a possible loss of precision. That
>>>>> important parenthetical was removed from the
>>>>> documentation in R 3.1.0 (among other changes).
>>>>
>>>>> Putting aside the fact that this introduces a large
>>>>> amount of unnecessary work rewriting SQL / data import
>>>>> code, SQL packages, my biggest conceptual problem is
>>>>> that I can no longer rely on a particular function
>>>>> call returning a particular class. In my example
>>>>> querying stock prices, about 5% of prices came back as
>>>>> factors and the remaining 95% as numeric, so we had
>>>>> random errors popping in throughout the morning.
>>>>
>>>>> Here's a short example showing us how the new behavior
>>>>> can be unreliable. I pass a character representation
>>>>> of a uniformly distributed random variable to
>>>>> type.convert. 90% of the time it is converted to
>>>>> "numeric" and 10% it is a "factor" (in R 3.1.0). In
>>>>> the 10% of cases in which type.convert converts to a
>>>>> factor the leading non-zero digit is always a 9. So if
>>>>> you were expecting a numeric value, then 1 in 10 times
>>>>> you may have a bug in your code that didn't exist
>>>>> before.
>>>>
options(digits=16) cl <- NULL; for (i in 1:10000) cl[i] <-
>>>>>> class(type.convert(format(runif(1))))
table(cl)
>>>>> cl factor numeric 990 9010
>>>>
>>>> Yes.
>>>>
>>>> Murray's point is valid, too.
>>>>
>>>> But in my view, with the reasoning we have seen here,
>>>> *and* with the well known software design principle of
>>>> "least surprise" in mind, I also do think that the
>>>> default for type.convert() should be what it has been
>>>> for > 10 years now.
>>>>
>>>
>>> I think there should be two separate discussions:
>>>
>>> a) have an option (argument to type.convert and possibly
>>> read.table) to enable/disable this behavior. I'm
>>> strongly in favor of this.
>>>
>>> b) decide what the default for a) will be. I have no
>>> strong opinion, I can see arguments in both directions
>>>
>>> But most importantly I think a) is better than the
>>> status quo - even if the discussion about b) drags out.
>>>
>>> Cheers, Simon
>>>
>>>
>>>
4 days later
Simon Urbanek <simon.urbanek at r-project.org>
on Sat, 19 Apr 2014 13:06:15 -0400 writes:
> On Apr 19, 2014, at 9:00 AM, Martin Maechler <maechler at stat.math.ethz.ch> wrote:
>>>>>>> McGehee, Robert <Robert.McGehee at geodecapital.com>
>>>>>>> on Thu, 17 Apr 2014 19:15:47 -0400 writes:
>>
>>>> This is all application specific and
>>>> sort of beyond the scope of type.convert(), which now behaves as it
>>>> has been documented to behave.
>>
>>> That's only a true statement because the documentation was changed to reflect the new behavior! The new feature in type.convert certainly does not behave according to the documentation as of R 3.0.3. Here's a snippit:
>>
>>> The first type that can accept all the
>>> non-missing values is chosen (numeric and complex return values
>>> will represented approximately, of course).
>>
>>> The key phrase is in parentheses, which reminds the user to expect a possible loss of precision. That important parenthetical was removed from the documentation in R 3.1.0 (among other changes).
>>
>>> Putting aside the fact that this introduces a large amount of unnecessary work rewriting SQL / data import code, SQL packages, my biggest conceptual problem is that I can no longer rely on a particular function call returning a particular class. In my example querying stock prices, about 5% of prices came back as factors and the remaining 95% as numeric, so we had random errors popping in throughout the morning.
>>
>>> Here's a short example showing us how the new behavior can be unreliable. I pass a character representation of a uniformly distributed random variable to type.convert. 90% of the time it is converted to "numeric" and 10% it is a "factor" (in R 3.1.0). In the 10% of cases in which type.convert converts to a factor the leading non-zero digit is always a 9. So if you were expecting a numeric value, then 1 in 10 times you may have a bug in your code that didn't exist before.
>>
>>>> options(digits=16)
>>>> cl <- NULL; for (i in 1:10000) cl[i] <- class(type.convert(format(runif(1))))
>>>> table(cl)
>>> cl
>>> factor numeric
>>> 990 9010
>>
>> Yes.
>>
>> Murray's point is valid, too.
>>
>> But in my view, with the reasoning we have seen here,
>> *and* with the well known software design principle of
>> "least surprise" in mind,
>> I also do think that the default for type.convert() should be what
>> it has been for > 10 years now.
>>
> I think there should be two separate discussions:
> a) have an option (argument to type.convert and possibly read.table) to enable/disable this behavior. I'm strongly in favor of this.
In my (not committed) version of R-devel, I now have
> str(type.convert(format(1/3, digits=17), exact=TRUE))
Factor w/ 1 level "0.33333333333333331": 1
> str(type.convert(format(1/3, digits=17), exact=FALSE))
num 0.333
where the 'exact' argument name has been ``imported'' from the
underlying C code.
[ As we CRAN package writers know by now, arguments nowadays can
hardly be abbreviated anymore, and so I am not open to longer
alternative argument names, as someone liking blind typing, I'm
not fond of camel case or other keyboard gymnastics (;-) but if someone has a great idea for
a better argument name.... ]
Instead of only TRUE/FALSE, we could consider NA with
semantics "FALSE + warning" or also "TRUE + warning".
> b) decide what the default for a) will be. I have no strong opinion, I can see arguments in both directions
I think many have seen the good arguments in both directions.
I'm still strongly advocating that we value long term stability
higher here, and revert to more compatibility with the many
years of previous versions.
If we'd use a default of 'exact=NA', I'd like it to mean
FALSE + warning, but would not oppose much to TRUE + warning.
I agree that for the TRUE case, it may make more sense to return
string-like object of a new (simple) class such as "bignum"
that was mentioned in this thread.
OTOH, this functionality should make it into an R 3.1.1 in the
not so distant future, and thinking through consequences and
implementing the new class approach may just take a tad too much
time...
Martin
> But most importantly I think a) is better than the status quo - even if the discussion about b) drags out.
> Cheers,
> Simon
On Apr 26, 2014, at 4:59 PM, Martin Maechler <maechler at stat.math.ethz.ch> wrote:
I think there should be two separate discussions:
a) have an option (argument to type.convert and possibly read.table) to enable/disable this behavior. I'm strongly in favor of this.
In my (not committed) version of R-devel, I now have
str(type.convert(format(1/3, digits=17), exact=TRUE))
Factor w/ 1 level "0.33333333333333331": 1
str(type.convert(format(1/3, digits=17), exact=FALSE))
num 0.333 where the 'exact' argument name has been ``imported'' from the underlying C code.
Looks good to me!
<snip> Instead of only TRUE/FALSE, we could consider NA with semantics "FALSE + warning" or also "TRUE + warning?.
b) decide what the default for a) will be. I have no strong opinion, I can see arguments in both directions
I think many have seen the good arguments in both directions. I'm still strongly advocating that we value long term stability higher here, and revert to more compatibility with the many years of previous versions. If we'd use a default of 'exact=NA', I'd like it to mean FALSE + warning, but would not oppose much to TRUE + warning.
I vote for the default to be ?exact=NA? meaning ?FALSE + warning" -Greg
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-devel/attachments/20140427/50815a5c/attachment.pl>
On 27/04/2014, 10:16 AM, Hadley Wickham wrote:
Is there a reason it's a factor and not a string? A string would seem to be more appropriate to me (given that we know it's a number that can't be represented exactly by R)
The user asked that anything which can't be converted to a number should be converted to a factor. Yes, that's a bad default, but some people rely on it. Duncan Murdoch
Hadley On Saturday, April 26, 2014, Martin Maechler <maechler at stat.math.ethz.ch> wrote:
Simon Urbanek <simon.urbanek at r-project.org <javascript:;>>
on Sat, 19 Apr 2014 13:06:15 -0400 writes:
> On Apr 19, 2014, at 9:00 AM, Martin Maechler <
maechler at stat.math.ethz.ch <javascript:;>> wrote:
>>>>>>> McGehee, Robert <Robert.McGehee at geodecapital.com<javascript:;>
>>>>>>> on Thu, 17 Apr 2014 19:15:47 -0400 writes:
>>
>>>> This is all application specific and
>>>> sort of beyond the scope of type.convert(), which now behaves as
it
>>>> has been documented to behave.
>>
>>> That's only a true statement because the documentation was changed
to reflect the new behavior! The new feature in type.convert certainly does not behave according to the documentation as of R 3.0.3. Here's a snippit:
>>
>>> The first type that can accept all the
>>> non-missing values is chosen (numeric and complex return values
>>> will represented approximately, of course).
>>
>>> The key phrase is in parentheses, which reminds the user to expect
a possible loss of precision. That important parenthetical was removed from the documentation in R 3.1.0 (among other changes).
>>
>>> Putting aside the fact that this introduces a large amount of
unnecessary work rewriting SQL / data import code, SQL packages, my biggest conceptual problem is that I can no longer rely on a particular function call returning a particular class. In my example querying stock prices, about 5% of prices came back as factors and the remaining 95% as numeric, so we had random errors popping in throughout the morning.
>>
>>> Here's a short example showing us how the new behavior can be
unreliable. I pass a character representation of a uniformly distributed random variable to type.convert. 90% of the time it is converted to "numeric" and 10% it is a "factor" (in R 3.1.0). In the 10% of cases in which type.convert converts to a factor the leading non-zero digit is always a 9. So if you were expecting a numeric value, then 1 in 10 times you may have a bug in your code that didn't exist before.
>>
>>>> options(digits=16)
>>>> cl <- NULL; for (i in 1:10000) cl[i] <-
class(type.convert(format(runif(1))))
>>>> table(cl)
>>> cl
>>> factor numeric
>>> 990 9010
>>
>> Yes.
>>
>> Murray's point is valid, too.
>>
>> But in my view, with the reasoning we have seen here,
>> *and* with the well known software design principle of
>> "least surprise" in mind,
>> I also do think that the default for type.convert() should be what
>> it has been for > 10 years now.
>>
> I think there should be two separate discussions:
> a) have an option (argument to type.convert and possibly read.table)
to enable/disable this behavior. I'm strongly in favor of this. In my (not committed) version of R-devel, I now have
> str(type.convert(format(1/3, digits=17), exact=TRUE))
Factor w/ 1 level "0.33333333333333331": 1
> str(type.convert(format(1/3, digits=17), exact=FALSE))
num 0.333 where the 'exact' argument name has been ``imported'' from the underlying C code. [ As we CRAN package writers know by now, arguments nowadays can hardly be abbreviated anymore, and so I am not open to longer alternative argument names, as someone liking blind typing, I'm not fond of camel case or other keyboard gymnastics (;-) but if someone has a great idea for a better argument name.... ] Instead of only TRUE/FALSE, we could consider NA with semantics "FALSE + warning" or also "TRUE + warning".
> b) decide what the default for a) will be. I have no strong opinion,
I can see arguments in both directions I think many have seen the good arguments in both directions. I'm still strongly advocating that we value long term stability higher here, and revert to more compatibility with the many years of previous versions. If we'd use a default of 'exact=NA', I'd like it to mean FALSE + warning, but would not oppose much to TRUE + warning. I agree that for the TRUE case, it may make more sense to return string-like object of a new (simple) class such as "bignum" that was mentioned in this thread. OTOH, this functionality should make it into an R 3.1.1 in the not so distant future, and thinking through consequences and implementing the new class approach may just take a tad too much time... Martin
> But most importantly I think a) is better than the status quo - even
if the discussion about b) drags out.
> Cheers,
> Simon
______________________________________________ R-devel at r-project.org <javascript:;> mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
1 day later
Martin Maechler <maechler at stat.math.ethz.ch>
on Sat, 26 Apr 2014 22:59:17 +0200 writes:
Simon Urbanek <simon.urbanek at r-project.org>
on Sat, 19 Apr 2014 13:06:15 -0400 writes:
>> On Apr 19, 2014, at 9:00 AM, Martin Maechler
>> <maechler at stat.math.ethz.ch> wrote:
>>>>>>>> McGehee, Robert <Robert.McGehee at geodecapital.com>
>>>>>>>> on Thu, 17 Apr 2014 19:15:47 -0400 writes:
>>>
>>>>> This is all application specific and sort of beyond
>>>>> the scope of type.convert(), which now behaves as it
>>>>> has been documented to behave.
>>>
>>>> That's only a true statement because the documentation
>>>> was changed to reflect the new behavior! The new
>>>> feature in type.convert certainly does not behave
>>>> according to the documentation as of R 3.0.3. Here's a
>>>> snippit:
>>>
>>>> The first type that can accept all the non-missing
>>>> values is chosen (numeric and complex return values
>>>> will represented approximately, of course).
>>>
>>>> The key phrase is in parentheses, which reminds the
>>>> user to expect a possible loss of precision. That
>>>> important parenthetical was removed from the
>>>> documentation in R 3.1.0 (among other changes).
>>>
>>>> Putting aside the fact that this introduces a large
>>>> amount of unnecessary work rewriting SQL / data import
>>>> code, SQL packages, my biggest conceptual problem is
>>>> that I can no longer rely on a particular function call
>>>> returning a particular class. In my example querying
>>>> stock prices, about 5% of prices came back as factors
>>>> and the remaining 95% as numeric, so we had random
>>>> errors popping in throughout the morning.
>>>
>>>> Here's a short example showing us how the new behavior
>>>> can be unreliable. I pass a character representation of
>>>> a uniformly distributed random variable to
>>>> type.convert. 90% of the time it is converted to
>>>> "numeric" and 10% it is a "factor" (in R 3.1.0). In the
>>>> 10% of cases in which type.convert converts to a factor
>>>> the leading non-zero digit is always a 9. So if you
>>>> were expecting a numeric value, then 1 in 10 times you
>>>> may have a bug in your code that didn't exist before.
>>>
>>>>> options(digits=16) cl <- NULL; for (i in 1:10000)
>>>>> cl[i] <- class(type.convert(format(runif(1))))
>>>>> table(cl)
>>>> cl factor numeric 990 9010
>>>
>>> Yes.
>>>
>>> Murray's point is valid, too.
>>>
>>> But in my view, with the reasoning we have seen here,
>>> *and* with the well known software design principle of
>>> "least surprise" in mind, I also do think that the
>>> default for type.convert() should be what it has been
>>> for > 10 years now.
>>>
>> I think there should be two separate discussions:
>> a) have an option (argument to type.convert and possibly
>> read.table) to enable/disable this behavior. I'm strongly
>> in favor of this.
> In my (not committed) version of R-devel, I now have
>> str(type.convert(format(1/3, digits=17), exact=TRUE))
> Factor w/ 1 level "0.33333333333333331": 1
>> str(type.convert(format(1/3, digits=17), exact=FALSE))
> num 0.333
> where the 'exact' argument name has been ``imported'' from
> the underlying C code.
> [ As we CRAN package writers know by now, arguments
> nowadays can hardly be abbreviated anymore, and so I am
> not open to longer alternative argument names, as someone
> liking blind typing, I'm not fond of camel case or other
> keyboard gymnastics (;-) but if someone has a great idea
> for a better argument name.... ]
> Instead of only TRUE/FALSE, we could consider NA with
> semantics "FALSE + warning" or also "TRUE + warning".
>> b) decide what the default for a) will be. I have no
>> strong opinion, I can see arguments in both directions
> I think many have seen the good arguments in both
> directions. I'm still strongly advocating that we value
> long term stability higher here, and revert to more
> compatibility with the many years of previous versions.
> If we'd use a default of 'exact=NA', I'd like it to mean
> FALSE + warning, but would not oppose much to TRUE +
> warning.
I have now committed svn rev 65507 --- to R-devel only for now ---
the above: exact = NA is the default
and it means "warning + FALSE".
Interestingly, I currently get 5 identical warnings for one
simple call, so there seems clearly room for optimization, and
that is one main reason for this reason to not yet be migrated
to 'R 3.1.0 patched'.
Martin
> I agree that for the TRUE case, it may make more sense to
> return string-like object of a new (simple) class such as
> "bignum" that was mentioned in this thread.
> OTOH, this functionality should make it into an R 3.1.1 in
> the not so distant future, and thinking through
> consequences and implementing the new class approach may
> just take a tad too much time...
> Martin
>> But most importantly I think a) is better than the status
>> quo - even if the discussion about b) drags out.
>> Cheers, Simon
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
On 28 Apr 2014, at 19:17 , Martin Maechler <maechler at stat.math.ethz.ch> wrote:
[...snip...]
I think there should be two separate discussions:
a) have an option (argument to type.convert and possibly read.table) to enable/disable this behavior. I'm strongly in favor of this.
In my (not committed) version of R-devel, I now have
str(type.convert(format(1/3, digits=17), exact=TRUE))
Factor w/ 1 level "0.33333333333333331": 1
str(type.convert(format(1/3, digits=17), exact=FALSE))
num 0.333
where the 'exact' argument name has been ``imported'' from the underlying C code.
[ As we CRAN package writers know by now, arguments nowadays can hardly be abbreviated anymore, and so I am not open to longer alternative argument names, as someone liking blind typing, I'm not fond of camel case or other keyboard gymnastics (;-) but if someone has a great idea for a better argument name.... ]
Instead of only TRUE/FALSE, we could consider NA with semantics "FALSE + warning" or also "TRUE + warning".
b) decide what the default for a) will be. I have no strong opinion, I can see arguments in both directions
I think many have seen the good arguments in both directions. I'm still strongly advocating that we value long term stability higher here, and revert to more compatibility with the many years of previous versions.
If we'd use a default of 'exact=NA', I'd like it to mean FALSE + warning, but would not oppose much to TRUE + warning.
I have now committed svn rev 65507 --- to R-devel only for now --- the above: exact = NA is the default and it means "warning + FALSE". Interestingly, I currently get 5 identical warnings for one simple call, so there seems clearly room for optimization, and that is one main reason for this reason to not yet be migrated to 'R 3.1.0 patched'.
I actually think that the default should be the old behaviour. No warning, just potentially lose digits. If this gets a user in trouble, _then_ turn on the check for lost digits.
After all, I think we had about one single use case, where lost digits caused trouble (I cannot even dig up what the case was - someone had, like, 20-digit ID labels, I reckon). In contrast, we have seen umpteen cases where people have exported floating point data to slightly beyond machine precision, "just in case", and relied on read.table() to do the sensible thing.
It's also an open question whether we really want to apply the same logic to doubles and integer inputs. The whole change went in as (r62327)
"force type.convert to read e.g. 64-bit integers as strings/factors"
I, for one, did not expect that "e.g." would include 0.12345678901234567. My eyes were on the upcoming 3.0.0 release at that point, so I might not have noticed it anyway, but apparently noone lifted an eyebrow. It seems that this was deliberately postponed for 3.1.0, but for more than a year, noone actually exercised the code.
-pd
BTW, "exact" is a horrible name for an option, how about digitloss=c("allow", "warn", "forbid")?
Martin
I agree that for the TRUE case, it may make more sense to return string-like object of a new (simple) class such as "bignum" that was mentioned in this thread.
OTOH, this functionality should make it into an R 3.1.1 in the not so distant future, and thinking through consequences and implementing the new class approach may just take a tad too much time...
Martin
But most importantly I think a) is better than the status quo - even if the discussion about b) drags out.
Cheers, Simon
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Peter Dalgaard, Professor, Center for Statistics, Copenhagen Business School Solbjerg Plads 3, 2000 Frederiksberg, Denmark Phone: (+45)38153501 Email: pd.mes at cbs.dk Priv: PDalgd at gmail.com