Please make Pre-3.1 read.csv (type.convert) behavior available - R-devel

Fri, Apr 25, 2014 9:23 PM #

Hi,

We at 0xdata use Java and R together, and the new behavior for read.csv has
made R unable to read the output of Java?s Double.toString().

This, needless to say, is disruptive for us.  (Actually, it was downright shocking.)

+1 for restoring old behavior.

Thanks,
Tom

Duncan Murdoch

Sat, Apr 26, 2014 4:28 AM #

On 26/04/2014, 12:23 AM, Tom Kraljevic wrote:

It may be less convenient, but it's certainly not "unable".  Use colClasses.

It wouldn't have been a shock if you had tested pre-release versions. 
Commercial users of R should be contributing to its development, and 
that's a really easy way to do so.

Duncan Murdoch

Dirk Eddelbuettel

Sat, Apr 26, 2014 5:26 AM #

On 26 April 2014 at 07:28, Duncan Murdoch wrote:

| On 26/04/2014, 12:23 AM, Tom Kraljevic wrote:

| >
| > Hi,
| >
| > We at 0xdata use Java and R together, and the new behavior for read.csv has
| > made R unable to read the output of Java?s Double.toString().
| 
| It may be less convenient, but it's certainly not "unable".  Use colClasses.
| 
| 
| >
| > This, needless to say, is disruptive for us.  (Actually, it was downright shocking.)
| 
| It wouldn't have been a shock if you had tested pre-release versions. 
| Commercial users of R should be contributing to its development, and 
| that's a really easy way to do so.

Seconded. For what it is worth, I made five pre-release available within
Debian. Testing thses each was just an apt-get away.

In any event, you can also farm out the old behaviour to a (local or even
CRAN) package that provides the old behaviour if your life depends upon it.

Or you could real serialization rather than relying on the crutch that is csv.

Dirk

Dirk Eddelbuettel | edd at debian.org | http://dirk.eddelbuettel.com

Tom Kraljevic

Sat, Apr 26, 2014 9:28 AM #

An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-devel/attachments/20140426/74551927/attachment.pl>

Tom Kraljevic

Sat, Apr 26, 2014 9:43 AM #

Hi Dirk,


Thanks for taking the time to respond (both here and in other forums).

Most of what I wanted to share I put in a followup response to Duncan (please read
that thread if you?re interested).

I would like to comment on the last point you brought up, though, in case anyone else
finds it beneficial.

For data which is exchanged programmatically machine-to-machine, I was able to
use Java?s Double.toHexString() as a direct replacement for toString().  R is able
to read this lossless (but still text) format.  So this addresses some of the challenges
we have with this change.


Thanks,
Tom

On Apr 26, 2014, at 5:26 AM, Dirk Eddelbuettel <edd at debian.org> wrote:

Duncan Murdoch

Sat, Apr 26, 2014 9:50 AM #

On 26/04/2014, 12:28 PM, Tom Kraljevic wrote:

The beta stage is quite late.  There's a non-zero risk that a bug 
detected during the beta stage will make it through to release, 
especially if the report doesn't arrive until after we've switched to 
release candidates.

This change was made very early in the development cycle of 3.1.0, back 
in March 2013.  If you are making serious use of R, I'd really recommend 
that you try out some of the R-devel versions early, when design 
decisions are being made.  I suspect this feature would have been 
changed if we'd heard your complaints then.  It'll likely still be 
changed, but it is harder now, because some users already depend on the 
new behaviour.

Actually it isn't the bug that said that, it was Simon :-).  if you look 
up some of his other posts on this topic here in the R-devel list, 
you'll see a couple of proposals for changes.

Duncan Murdoch

So I?m sharing my opinion, as suggested.  Thanks to all for the time
spent reading my opinion.


Let me also say, we are huge fans of R; many of our customers use R, and
we greatly appreciate the
efforts of the R core team.  We are in the process of contributing an
H2O package back to the R
community and thanks to the CRAN moderators, as well, for their
assistance in this process.
CRAN is a fantastic resource.


I would like to share a little more insight on how this behavior affects
us, in particular.  These merits
have probably already been debated, but let me state them here again to
provide the appropriate
context.

1.  When dealing with larger and larger data, things become cumbersome.
  Your comment that
specifying column types would work is true.  But when there are
thousands+ of columns, specifying
them one by one becomes more and more of a burden, and it becomes easier
to make a mistake.
And when you do make a mistake, you can imagine a tool writer choosing
to just ?do what it?s told?
and swallowing the mistake.  (Trying not to be smarter than the user.)

2.  When working with datasets that have more and more rows, sometimes
there is a bad row.
Big data is messy.  Having one bad value in one bad row contaminate the
entire dataset can be
undesirable for some.  When you have millions of rows or more, each row
becomes less precious.
Many people would rather just ignore the effects of the bad row than try
to fix it.  Especially in this
case, when ?bad? means a bit of extra precision that likely won?t have a
negative impact on the result.
(In our case, this extra precision was the output of Java?s
Double.toString().)

Our users want to use R as a driver language and a reference tool.
  Being able to interchange
data easily (even just snippets) between tools is very valuable.


Thanks,
Tom


Below is an example of how you can create a million row dataset which
works fine (parses as a
numeric), but then adding just one bad row (which still *looks*
numeric!) flips the entire column to
a factor.  Finding that one row out of a million+ can be quite a challenge.


# Script to generate dataset.
$ cat genDataset.py
#!/usr/bin/env python

for x in range(0, 1000000):
     print (str(x) + ".1")

# Generate the dataset.
$ ./genDataset.py > million.csv

# R 3.1 thinks it?s a numeric.
$ R

 > df = read.csv("million.csv")
 > str(df)

'data.frame':999999 obs. of  1 variable:
  $ X0.1: num  1.1 2.1 3.1 4.1 5.1 6.1 7.1 8.1 9.1 10.1 ...

# Add one more over-precision row.
$ echo "1.2345678901234567890" >> million.csv

# Now R 3.1 thinks it?s a factor.
$ R

 > df2 = read.csv("million.csv")
 > str(df2)

'data.frame':1000000 obs. of  1 variable:
  $ X0.1: Factor w/ 1000000 levels "1.1","1.2345678901234567890",..: 1
111113 222224 333335 444446 555557 666668 777779 888890 3 ...





On Apr 26, 2014, at 4:28 AM, Duncan Murdoch <murdoch.duncan at gmail.com
<mailto:murdoch.duncan at gmail.com>> wrote:

On 26/04/2014, 12:23 AM, Tom Kraljevic wrote:

Hi,

We at 0xdata use Java and R together, and the new behavior for
read.csv has
made R unable to read the output of Java?s Double.toString().

It may be less convenient, but it's certainly not "unable".  Use
colClasses.

This, needless to say, is disruptive for us.  (Actually, it was
downright shocking.)

It wouldn't have been a shock if you had tested pre-release versions.
Commercial users of R should be contributing to its development, and
that's a really easy way to do so.

Duncan Murdoch

+1 for restoring old behavior.

Tom Kraljevic

Sat, Apr 26, 2014 1:12 PM #

Hi,


One additional follow-up here.

Unfortunately, I hit what looks like an R parsing bug that makes the Java Double.toHexString() output
unreliable for reading by R.  (This is really unfortunate, because the format is intended to be lossless
and it looks like it?s so close to fully working.)

You can see the spec for the conversion here:
    http://docs.oracle.com/javase/7/docs/api/java/lang/Double.html#toHexString(double)

The last value in the list below is not parsed by R in the way I expected, and causes the column to flip 
from numeric to factor.


-0x1.8ff831c7ffffdp-1
-0x1.aff831c7ffffdp-1
-0x1.bff831c7ffffdp-1
-0x1.cff831c7ffffdp-1
-0x1.dff831c7ffffdp-1
-0x1.eff831c7ffffdp-1
-0x1.fff831c7ffffdp-1           <<<<< this value is not parsed as a number and flips the column from numeric to factor.


Below is the R output from adding one row at a time to ?bad.csv?.
The last attempt results in a factor rather than a numeric column.

What?s really odd about it is that the .a through .e case work fine but the .f case doesn?t.


Thanks,
Tom

'data.frame':	1 obs. of  1 variable:
 $ V1: num -0.781

'data.frame':	2 obs. of  1 variable:
 $ V1: num  -0.781 -0.844

'data.frame':	3 obs. of  1 variable:
 $ V1: num  -0.781 -0.844 -0.875

'data.frame':	4 obs. of  1 variable:
 $ V1: num  -0.781 -0.844 -0.875 -0.906

'data.frame':	5 obs. of  1 variable:
 $ V1: num  -0.781 -0.844 -0.875 -0.906 -0.937

'data.frame':	6 obs. of  1 variable:
 $ V1: num  -0.781 -0.844 -0.875 -0.906 -0.937 ...

'data.frame':	7 obs. of  1 variable:
 $ V1: Factor w/ 7 levels "-0x1.8ff831c7ffffdp-1",..: 1 2 3 4 5 6 7

Duncan Murdoch

Sat, Apr 26, 2014 2:18 PM #

On 26/04/2014, 4:12 PM, Tom Kraljevic wrote:

That looks like a bug in the conversion code.  It uses the same test for 
lack of accuracy for hex doubles as it uses for decimal ones, but hex 
doubles can be larger before they lose precision.  I believe the largest 
integer that can be represented exactly is 2^53 - 1, i.e.

0x1.fffffffffffffp52

in this notation; can you confirm that your Java code reads it and 
writes the same string?  This is about 1% bigger than the limit at which 
type.convert switches to strings or factors.

Duncan Murdoch

Tom Kraljevic

Sat, Apr 26, 2014 3:40 PM #

An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-devel/attachments/20140426/eba4abd6/attachment.pl>

Duncan Murdoch

Sat, Apr 26, 2014 3:59 PM #

On 26/04/2014, 6:40 PM, Tom Kraljevic wrote:

This one has enough attention already that I don't think it will get 
lost, so no more bug reports are necessary.  Martin Maechler (on another 
thread) is describing some changes that should address this.  It would 
be really helpful if you tested it on your examples after he commits his 
changes.

Duncan Murdoch


Thanks,
Tom




$ cat example.java
class example {
     public static void main(String[] args) {
         String value_as_string = "-0x1.fff831c7ffffdp-1";
         double value = Double.parseDouble(value_as_string);
         System.out.println("Starting string    : " + value_as_string);
         System.out.println("value toString()   : " +
Double.toString(value));
         System.out.println("value toHexString(): " +
Double.toHexString(value));

         long bits = Double.doubleToRawLongBits(value);
         boolean isNegative = (bits & 0x8000000000000000L) != 0;
         long biased_exponent      = (bits & 0x7ff0000000000000L) >> 52;
         long exponent = biased_exponent - 1023;
         long mantissa =  bits & 0x000fffffffffffffL;
         System.out.println("isNegative         : " + isNegative);
         System.out.println("biased exponent    : " + biased_exponent);
         System.out.println("exponent           : " + exponent);
         System.out.println("mantissa           : " + mantissa);
         System.out.println("mantissa as hex    : " +
Long.toHexString(mantissa));
     }
}


$ javac example.java
$ java example
Starting string    : -0x1.fff831c7ffffdp-1
value toString()   : -0.999940448440611
value toHexString(): -0x1.fff831c7ffffdp-1
isNegative         : true
biased exponent    : 1022
exponent           : -1
mantissa           : 4503063234609149
mantissa as hex    : fff831c7ffffd


$ java -version
java version "1.7.0_51"
Java(TM) SE Runtime Environment (build 1.7.0_51-b13)
Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode)



On Apr 26, 2014, at 2:18 PM, Duncan Murdoch <murdoch.duncan at gmail.com
<mailto:murdoch.duncan at gmail.com>> wrote:

On 26/04/2014, 4:12 PM, Tom Kraljevic wrote:

Hi,


One additional follow-up here.

Unfortunately, I hit what looks like an R parsing bug that makes the
Java Double.toHexString() output
unreliable for reading by R.  (This is really unfortunate, because
the format is intended to be lossless
and it looks like it?s so close to fully working.)

You can see the spec for the conversion here:
http://docs.oracle.com/javase/7/docs/api/java/lang/Double.html#toHexString(double)
<http://docs.oracle.com/javase/7/docs/api/java/lang/Double.html#toHexString%28double%29>

The last value in the list below is not parsed by R in the way I
expected, and causes the column to flip
from numeric to factor.


-0x1.8ff831c7ffffdp-1
-0x1.aff831c7ffffdp-1
-0x1.bff831c7ffffdp-1
-0x1.cff831c7ffffdp-1
-0x1.dff831c7ffffdp-1
-0x1.eff831c7ffffdp-1
-0x1.fff831c7ffffdp-1           <<<<< this value is not parsed as a
number and flips the column from numeric to factor.

That looks like a bug in the conversion code.  It uses the same test
for lack of accuracy for hex doubles as it uses for decimal ones, but
hex doubles can be larger before they lose precision.  I believe the
largest integer that can be represented exactly is 2^53 - 1, i.e.

0x1.fffffffffffffp52

in this notation; can you confirm that your Java code reads it and
writes the same string?  This is about 1% bigger than the limit at
which type.convert switches to strings or factors.

Duncan Murdoch

knormoyle

Sat, Apr 26, 2014 4:03 PM #

Hi Duncan,
I'm with Tom, don't want to be redundant but here's some extra info.
This made me think that the problem is not a 'theshold'. Any thoughts.
Also, if the "bad" number strings are entered at the R command prompt, they
are parsed correctly as the expected number. (not factors)

thanks,
-kevin


this works

0x1.ffadp-1

'data.frame':    1 obs. of  1 variable:
 $ V1: num 0.999


but this doesn't

0x1.ffa000000000dp-1

'data.frame':    1 obs. of  1 variable:
 $ V1: Factor w/ 1 level "0x1.ffa000000000dp-1 ": 1


this also works, which is one less trailing zero.

0x1.ffa00000000dp-1

'data.frame':    1 obs. of  1 variable:
 $ V1: num 0.999 



--
View this message in context: http://r.789695.n4.nabble.com/Please-make-Pre-3-1-read-csv-type-convert-behavior-available-tp4689507p4689553.html
Sent from the R devel mailing list archive at Nabble.com.

Andrew Piskorski

Sun, Apr 27, 2014 8:31 AM #

On Fri, Apr 25, 2014 at 09:23:23PM -0700, Tom Kraljevic wrote:

It WAS somewhat shocking.  I trust the R core team to get things
right, and (AFAICT) they nearly always do.  This was an exception, and
shocking mostly in that it was so obviously wrong to completely
discard all possibility of backwards compatibility.

The old type.convert() functionality worked fine and was very useful,
so the *obviously* right thing to do would be to at least retain the
old behavior as a (non-default) option.

Reproducing the old behavior in user R code is not simple.  For
anybody else stuck with this, you can do it (probably inefficiently)
with the two functions below.  Create your own version of read.table()
that calls the dtk.type.convert() below instead of the stock
type.convert().  It's not pretty, but that will do it.


dtk.type.convert <- function(xx ,... ,ignore.signif.p=TRUE) { 
   # Add backwards compatibility to R 3.1's "new feature": 
   if(ignore.signif.p && all(dtk.can.be.numeric(xx ,ignore.na.p=TRUE))) { 
      if(all(is.na(xx))) type.convert(xx ,...) 
      else methods::as(xx ,"numeric")  
   } else type.convert(xx ,...) 
} 

dtk.can.be.numeric <- function(xx ,ignore.na.p=TRUE) { 
   # Test whether a value can be converted to numeric without becoming NA. 
   # AKA, can this value be usefully represented as numeric? 
   # Optionally ignore NAs already present in the incoming data. 

   old.warn <- options(warn = -1) ; on.exit(options(old.warn)) 
   aa <- !is.na(as.numeric(xx)) 
   if(ignore.na.p) (is.na(xx) | aa) else aa 
}

Andrew Piskorski <atp at piskorski.com>