Hi, We at 0xdata use Java and R together, and the new behavior for read.csv has made R unable to read the output of Java?s Double.toString(). This, needless to say, is disruptive for us. (Actually, it was downright shocking.) +1 for restoring old behavior. Thanks, Tom
Please make Pre-3.1 read.csv (type.convert) behavior available
12 messages · Dirk Eddelbuettel, Duncan Murdoch, Tom Kraljevic +2 more
On 26/04/2014, 12:23 AM, Tom Kraljevic wrote:
Hi, We at 0xdata use Java and R together, and the new behavior for read.csv has made R unable to read the output of Java?s Double.toString().
It may be less convenient, but it's certainly not "unable". Use colClasses.
This, needless to say, is disruptive for us. (Actually, it was downright shocking.)
It wouldn't have been a shock if you had tested pre-release versions. Commercial users of R should be contributing to its development, and that's a really easy way to do so. Duncan Murdoch
+1 for restoring old behavior.
On 26 April 2014 at 07:28, Duncan Murdoch wrote:
| On 26/04/2014, 12:23 AM, Tom Kraljevic wrote:
| > | > Hi, | > | > We at 0xdata use Java and R together, and the new behavior for read.csv has | > made R unable to read the output of Java?s Double.toString(). | | It may be less convenient, but it's certainly not "unable". Use colClasses. | | | > | > This, needless to say, is disruptive for us. (Actually, it was downright shocking.) | | It wouldn't have been a shock if you had tested pre-release versions. | Commercial users of R should be contributing to its development, and | that's a really easy way to do so. Seconded. For what it is worth, I made five pre-release available within Debian. Testing thses each was just an apt-get away. In any event, you can also farm out the old behaviour to a (local or even CRAN) package that provides the old behaviour if your life depends upon it. Or you could real serialization rather than relying on the crutch that is csv. Dirk
Dirk Eddelbuettel | edd at debian.org | http://dirk.eddelbuettel.com
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-devel/attachments/20140426/74551927/attachment.pl>
Hi Dirk, Thanks for taking the time to respond (both here and in other forums). Most of what I wanted to share I put in a followup response to Duncan (please read that thread if you?re interested). I would like to comment on the last point you brought up, though, in case anyone else finds it beneficial. For data which is exchanged programmatically machine-to-machine, I was able to use Java?s Double.toHexString() as a direct replacement for toString(). R is able to read this lossless (but still text) format. So this addresses some of the challenges we have with this change. Thanks, Tom
On Apr 26, 2014, at 5:26 AM, Dirk Eddelbuettel <edd at debian.org> wrote:
On 26 April 2014 at 07:28, Duncan Murdoch wrote: | On 26/04/2014, 12:23 AM, Tom Kraljevic wrote: | > | > Hi, | > | > We at 0xdata use Java and R together, and the new behavior for read.csv has | > made R unable to read the output of Java?s Double.toString(). | | It may be less convenient, but it's certainly not "unable". Use colClasses. | | | > | > This, needless to say, is disruptive for us. (Actually, it was downright shocking.) | | It wouldn't have been a shock if you had tested pre-release versions. | Commercial users of R should be contributing to its development, and | that's a really easy way to do so. Seconded. For what it is worth, I made five pre-release available within Debian. Testing thses each was just an apt-get away. In any event, you can also farm out the old behaviour to a (local or even CRAN) package that provides the old behaviour if your life depends upon it. Or you could real serialization rather than relying on the crutch that is csv. Dirk -- Dirk Eddelbuettel | edd at debian.org | http://dirk.eddelbuettel.com
On 26/04/2014, 12:28 PM, Tom Kraljevic wrote:
Hi Duncan, Please allow me to add a bit more context, which I probably should have added to my original message. We actually did see this in an R 3.1 beta which was pulled by an apt-get and thought it had been released accidentally. From my user perspective, the parsing of a string like ?1.2345678901234567890? into a factor was so surprising, I actually assumed it was just a really bad bug that would be fixed before the ?real" release. I didn?t bother reporting it since I assumed beta users would be heavily impacted and there is no way it wouldn?t be fixed. Apologies for that mistake on my part.
The beta stage is quite late. There's a non-zero risk that a bug detected during the beta stage will make it through to release, especially if the report doesn't arrive until after we've switched to release candidates. This change was made very early in the development cycle of 3.1.0, back in March 2013. If you are making serious use of R, I'd really recommend that you try out some of the R-devel versions early, when design decisions are being made. I suspect this feature would have been changed if we'd heard your complaints then. It'll likely still be changed, but it is harder now, because some users already depend on the new behaviour.
After discovering this new behavior really got released GA, I went searching to see what was going on. I found this bug, which states ?If you wish to express your opinion about the new behavior, please do so on the R-devel mailing list." https://bugs.r-project.org/bugzilla/show_bug.cgi?id=15751
Actually it isn't the bug that said that, it was Simon :-). if you look up some of his other posts on this topic here in the R-devel list, you'll see a couple of proposals for changes. Duncan Murdoch
So I?m sharing my opinion, as suggested. Thanks to all for the time
spent reading my opinion.
Let me also say, we are huge fans of R; many of our customers use R, and
we greatly appreciate the
efforts of the R core team. We are in the process of contributing an
H2O package back to the R
community and thanks to the CRAN moderators, as well, for their
assistance in this process.
CRAN is a fantastic resource.
I would like to share a little more insight on how this behavior affects
us, in particular. These merits
have probably already been debated, but let me state them here again to
provide the appropriate
context.
1. When dealing with larger and larger data, things become cumbersome.
Your comment that
specifying column types would work is true. But when there are
thousands+ of columns, specifying
them one by one becomes more and more of a burden, and it becomes easier
to make a mistake.
And when you do make a mistake, you can imagine a tool writer choosing
to just ?do what it?s told?
and swallowing the mistake. (Trying not to be smarter than the user.)
2. When working with datasets that have more and more rows, sometimes
there is a bad row.
Big data is messy. Having one bad value in one bad row contaminate the
entire dataset can be
undesirable for some. When you have millions of rows or more, each row
becomes less precious.
Many people would rather just ignore the effects of the bad row than try
to fix it. Especially in this
case, when ?bad? means a bit of extra precision that likely won?t have a
negative impact on the result.
(In our case, this extra precision was the output of Java?s
Double.toString().)
Our users want to use R as a driver language and a reference tool.
Being able to interchange
data easily (even just snippets) between tools is very valuable.
Thanks,
Tom
Below is an example of how you can create a million row dataset which
works fine (parses as a
numeric), but then adding just one bad row (which still *looks*
numeric!) flips the entire column to
a factor. Finding that one row out of a million+ can be quite a challenge.
# Script to generate dataset.
$ cat genDataset.py
#!/usr/bin/env python
for x in range(0, 1000000):
print (str(x) + ".1")
# Generate the dataset.
$ ./genDataset.py > million.csv
# R 3.1 thinks it?s a numeric.
$ R
> df = read.csv("million.csv")
> str(df)
'data.frame':999999 obs. of 1 variable: $ X0.1: num 1.1 2.1 3.1 4.1 5.1 6.1 7.1 8.1 9.1 10.1 ... # Add one more over-precision row. $ echo "1.2345678901234567890" >> million.csv # Now R 3.1 thinks it?s a factor. $ R
> df2 = read.csv("million.csv")
> str(df2)
'data.frame':1000000 obs. of 1 variable: $ X0.1: Factor w/ 1000000 levels "1.1","1.2345678901234567890",..: 1 111113 222224 333335 444446 555557 666668 777779 888890 3 ... On Apr 26, 2014, at 4:28 AM, Duncan Murdoch <murdoch.duncan at gmail.com <mailto:murdoch.duncan at gmail.com>> wrote:
On 26/04/2014, 12:23 AM, Tom Kraljevic wrote:
Hi, We at 0xdata use Java and R together, and the new behavior for read.csv has made R unable to read the output of Java?s Double.toString().
It may be less convenient, but it's certainly not "unable". Use colClasses.
This, needless to say, is disruptive for us. (Actually, it was downright shocking.)
It wouldn't have been a shock if you had tested pre-release versions. Commercial users of R should be contributing to its development, and that's a really easy way to do so. Duncan Murdoch
+1 for restoring old behavior.
Hi,
One additional follow-up here.
Unfortunately, I hit what looks like an R parsing bug that makes the Java Double.toHexString() output
unreliable for reading by R. (This is really unfortunate, because the format is intended to be lossless
and it looks like it?s so close to fully working.)
You can see the spec for the conversion here:
http://docs.oracle.com/javase/7/docs/api/java/lang/Double.html#toHexString(double)
The last value in the list below is not parsed by R in the way I expected, and causes the column to flip
from numeric to factor.
-0x1.8ff831c7ffffdp-1
-0x1.aff831c7ffffdp-1
-0x1.bff831c7ffffdp-1
-0x1.cff831c7ffffdp-1
-0x1.dff831c7ffffdp-1
-0x1.eff831c7ffffdp-1
-0x1.fff831c7ffffdp-1 <<<<< this value is not parsed as a number and flips the column from numeric to factor.
Below is the R output from adding one row at a time to ?bad.csv?.
The last attempt results in a factor rather than a numeric column.
What?s really odd about it is that the .a through .e case work fine but the .f case doesn?t.
Thanks,
Tom
bad.df = read.csv(file="/Users/tomk/bad.csv", header=F) str(bad.df)
'data.frame': 1 obs. of 1 variable: $ V1: num -0.781
bad.df = read.csv(file="/Users/tomk/bad.csv", header=F) str(bad.df)
'data.frame': 2 obs. of 1 variable: $ V1: num -0.781 -0.844
bad.df = read.csv(file="/Users/tomk/bad.csv", header=F) str(bad.df)
'data.frame': 3 obs. of 1 variable: $ V1: num -0.781 -0.844 -0.875
bad.df = read.csv(file="/Users/tomk/bad.csv", header=F) str(bad.df)
'data.frame': 4 obs. of 1 variable: $ V1: num -0.781 -0.844 -0.875 -0.906
bad.df = read.csv(file="/Users/tomk/bad.csv", header=F) str(bad.df)
'data.frame': 5 obs. of 1 variable: $ V1: num -0.781 -0.844 -0.875 -0.906 -0.937
bad.df = read.csv(file="/Users/tomk/bad.csv", header=F) str(bad.df)
'data.frame': 6 obs. of 1 variable: $ V1: num -0.781 -0.844 -0.875 -0.906 -0.937 ...
bad.df = read.csv(file="/Users/tomk/bad.csv", header=F) str(bad.df)
'data.frame': 7 obs. of 1 variable: $ V1: Factor w/ 7 levels "-0x1.8ff831c7ffffdp-1",..: 1 2 3 4 5 6 7
On 26/04/2014, 4:12 PM, Tom Kraljevic wrote:
Hi,
One additional follow-up here.
Unfortunately, I hit what looks like an R parsing bug that makes the Java Double.toHexString() output
unreliable for reading by R. (This is really unfortunate, because the format is intended to be lossless
and it looks like it?s so close to fully working.)
You can see the spec for the conversion here:
http://docs.oracle.com/javase/7/docs/api/java/lang/Double.html#toHexString(double)
The last value in the list below is not parsed by R in the way I expected, and causes the column to flip
from numeric to factor.
-0x1.8ff831c7ffffdp-1
-0x1.aff831c7ffffdp-1
-0x1.bff831c7ffffdp-1
-0x1.cff831c7ffffdp-1
-0x1.dff831c7ffffdp-1
-0x1.eff831c7ffffdp-1
-0x1.fff831c7ffffdp-1 <<<<< this value is not parsed as a number and flips the column from numeric to factor.
That looks like a bug in the conversion code. It uses the same test for lack of accuracy for hex doubles as it uses for decimal ones, but hex doubles can be larger before they lose precision. I believe the largest integer that can be represented exactly is 2^53 - 1, i.e. 0x1.fffffffffffffp52 in this notation; can you confirm that your Java code reads it and writes the same string? This is about 1% bigger than the limit at which type.convert switches to strings or factors. Duncan Murdoch
Below is the R output from adding one row at a time to ?bad.csv?. The last attempt results in a factor rather than a numeric column. What?s really odd about it is that the .a through .e case work fine but the .f case doesn?t. Thanks, Tom
bad.df = read.csv(file="/Users/tomk/bad.csv", header=F) str(bad.df)
'data.frame': 1 obs. of 1 variable: $ V1: num -0.781
bad.df = read.csv(file="/Users/tomk/bad.csv", header=F) str(bad.df)
'data.frame': 2 obs. of 1 variable: $ V1: num -0.781 -0.844
bad.df = read.csv(file="/Users/tomk/bad.csv", header=F) str(bad.df)
'data.frame': 3 obs. of 1 variable: $ V1: num -0.781 -0.844 -0.875
bad.df = read.csv(file="/Users/tomk/bad.csv", header=F) str(bad.df)
'data.frame': 4 obs. of 1 variable: $ V1: num -0.781 -0.844 -0.875 -0.906
bad.df = read.csv(file="/Users/tomk/bad.csv", header=F) str(bad.df)
'data.frame': 5 obs. of 1 variable: $ V1: num -0.781 -0.844 -0.875 -0.906 -0.937
bad.df = read.csv(file="/Users/tomk/bad.csv", header=F) str(bad.df)
'data.frame': 6 obs. of 1 variable: $ V1: num -0.781 -0.844 -0.875 -0.906 -0.937 ...
bad.df = read.csv(file="/Users/tomk/bad.csv", header=F) str(bad.df)
'data.frame': 7 obs. of 1 variable: $ V1: Factor w/ 7 levels "-0x1.8ff831c7ffffdp-1",..: 1 2 3 4 5 6 7
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-devel/attachments/20140426/eba4abd6/attachment.pl>
On 26/04/2014, 6:40 PM, Tom Kraljevic wrote:
Hi Duncan, This program and output should answer your question regarding java behavior. Basically the character toHexString() representation is shown to be lossless for this example (in Java). Please let me know if there is any way I can help further. I?d love for this to work! I would be happy to put all this into an R bug report if that is convenient for you.
This one has enough attention already that I don't think it will get lost, so no more bug reports are necessary. Martin Maechler (on another thread) is describing some changes that should address this. It would be really helpful if you tested it on your examples after he commits his changes. Duncan Murdoch
Thanks,
Tom
$ cat example.java
class example {
public static void main(String[] args) {
String value_as_string = "-0x1.fff831c7ffffdp-1";
double value = Double.parseDouble(value_as_string);
System.out.println("Starting string : " + value_as_string);
System.out.println("value toString() : " +
Double.toString(value));
System.out.println("value toHexString(): " +
Double.toHexString(value));
long bits = Double.doubleToRawLongBits(value);
boolean isNegative = (bits & 0x8000000000000000L) != 0;
long biased_exponent = (bits & 0x7ff0000000000000L) >> 52;
long exponent = biased_exponent - 1023;
long mantissa = bits & 0x000fffffffffffffL;
System.out.println("isNegative : " + isNegative);
System.out.println("biased exponent : " + biased_exponent);
System.out.println("exponent : " + exponent);
System.out.println("mantissa : " + mantissa);
System.out.println("mantissa as hex : " +
Long.toHexString(mantissa));
}
}
$ javac example.java
$ java example
Starting string : -0x1.fff831c7ffffdp-1
value toString() : -0.999940448440611
value toHexString(): -0x1.fff831c7ffffdp-1
isNegative : true
biased exponent : 1022
exponent : -1
mantissa : 4503063234609149
mantissa as hex : fff831c7ffffd
$ java -version
java version "1.7.0_51"
Java(TM) SE Runtime Environment (build 1.7.0_51-b13)
Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode)
On Apr 26, 2014, at 2:18 PM, Duncan Murdoch <murdoch.duncan at gmail.com
<mailto:murdoch.duncan at gmail.com>> wrote:
On 26/04/2014, 4:12 PM, Tom Kraljevic wrote:
Hi, One additional follow-up here. Unfortunately, I hit what looks like an R parsing bug that makes the Java Double.toHexString() output unreliable for reading by R. (This is really unfortunate, because the format is intended to be lossless and it looks like it?s so close to fully working.) You can see the spec for the conversion here: http://docs.oracle.com/javase/7/docs/api/java/lang/Double.html#toHexString(double) <http://docs.oracle.com/javase/7/docs/api/java/lang/Double.html#toHexString%28double%29> The last value in the list below is not parsed by R in the way I expected, and causes the column to flip from numeric to factor. -0x1.8ff831c7ffffdp-1 -0x1.aff831c7ffffdp-1 -0x1.bff831c7ffffdp-1 -0x1.cff831c7ffffdp-1 -0x1.dff831c7ffffdp-1 -0x1.eff831c7ffffdp-1 -0x1.fff831c7ffffdp-1 <<<<< this value is not parsed as a number and flips the column from numeric to factor.
That looks like a bug in the conversion code. It uses the same test for lack of accuracy for hex doubles as it uses for decimal ones, but hex doubles can be larger before they lose precision. I believe the largest integer that can be represented exactly is 2^53 - 1, i.e. 0x1.fffffffffffffp52 in this notation; can you confirm that your Java code reads it and writes the same string? This is about 1% bigger than the limit at which type.convert switches to strings or factors. Duncan Murdoch
Hi Duncan, I'm with Tom, don't want to be redundant but here's some extra info. This made me think that the problem is not a 'theshold'. Any thoughts. Also, if the "bad" number strings are entered at the R command prompt, they are parsed correctly as the expected number. (not factors) thanks, -kevin this works 0x1.ffadp-1
df = read.csv("bad1.csv", header=F)
str(df)
'data.frame': 1 obs. of 1 variable: $ V1: num 0.999 but this doesn't 0x1.ffa000000000dp-1
df = read.csv("bad1.csv", header=F)
str(df)
'data.frame': 1 obs. of 1 variable: $ V1: Factor w/ 1 level "0x1.ffa000000000dp-1 ": 1 this also works, which is one less trailing zero. 0x1.ffa00000000dp-1
df = read.csv("bad1.csv", header=F)
str(df)
'data.frame': 1 obs. of 1 variable: $ V1: num 0.999 -- View this message in context: http://r.789695.n4.nabble.com/Please-make-Pre-3-1-read-csv-type-convert-behavior-available-tp4689507p4689553.html Sent from the R devel mailing list archive at Nabble.com.
On Fri, Apr 25, 2014 at 09:23:23PM -0700, Tom Kraljevic wrote:
This, needless to say, is disruptive for us. (Actually, it was downright shocking.)
It WAS somewhat shocking. I trust the R core team to get things
right, and (AFAICT) they nearly always do. This was an exception, and
shocking mostly in that it was so obviously wrong to completely
discard all possibility of backwards compatibility.
The old type.convert() functionality worked fine and was very useful,
so the *obviously* right thing to do would be to at least retain the
old behavior as a (non-default) option.
Reproducing the old behavior in user R code is not simple. For
anybody else stuck with this, you can do it (probably inefficiently)
with the two functions below. Create your own version of read.table()
that calls the dtk.type.convert() below instead of the stock
type.convert(). It's not pretty, but that will do it.
dtk.type.convert <- function(xx ,... ,ignore.signif.p=TRUE) {
# Add backwards compatibility to R 3.1's "new feature":
if(ignore.signif.p && all(dtk.can.be.numeric(xx ,ignore.na.p=TRUE))) {
if(all(is.na(xx))) type.convert(xx ,...)
else methods::as(xx ,"numeric")
} else type.convert(xx ,...)
}
dtk.can.be.numeric <- function(xx ,ignore.na.p=TRUE) {
# Test whether a value can be converted to numeric without becoming NA.
# AKA, can this value be usefully represented as numeric?
# Optionally ignore NAs already present in the incoming data.
old.warn <- options(warn = -1) ; on.exit(options(old.warn))
aa <- !is.na(as.numeric(xx))
if(ignore.na.p) (is.na(xx) | aa) else aa
}
Andrew Piskorski <atp at piskorski.com>