Skip to content

Please make Pre-3.1 read.csv (type.convert) behavior available

12 messages · Dirk Eddelbuettel, Duncan Murdoch, Tom Kraljevic +2 more

#
Hi,

We at 0xdata use Java and R together, and the new behavior for read.csv has
made R unable to read the output of Java?s Double.toString().

This, needless to say, is disruptive for us.  (Actually, it was downright shocking.)

+1 for restoring old behavior.

Thanks,
Tom
#
On 26/04/2014, 12:23 AM, Tom Kraljevic wrote:
It may be less convenient, but it's certainly not "unable".  Use colClasses.
It wouldn't have been a shock if you had tested pre-release versions. 
Commercial users of R should be contributing to its development, and 
that's a really easy way to do so.

Duncan Murdoch
#
On 26 April 2014 at 07:28, Duncan Murdoch wrote:
| On 26/04/2014, 12:23 AM, Tom Kraljevic wrote:
| >
| > Hi,
| >
| > We at 0xdata use Java and R together, and the new behavior for read.csv has
| > made R unable to read the output of Java?s Double.toString().
| 
| It may be less convenient, but it's certainly not "unable".  Use colClasses.
| 
| 
| >
| > This, needless to say, is disruptive for us.  (Actually, it was downright shocking.)
| 
| It wouldn't have been a shock if you had tested pre-release versions. 
| Commercial users of R should be contributing to its development, and 
| that's a really easy way to do so.

Seconded. For what it is worth, I made five pre-release available within
Debian. Testing thses each was just an apt-get away.

In any event, you can also farm out the old behaviour to a (local or even
CRAN) package that provides the old behaviour if your life depends upon it.

Or you could real serialization rather than relying on the crutch that is csv.

Dirk
#
Hi Dirk,


Thanks for taking the time to respond (both here and in other forums).

Most of what I wanted to share I put in a followup response to Duncan (please read
that thread if you?re interested).

I would like to comment on the last point you brought up, though, in case anyone else
finds it beneficial.

For data which is exchanged programmatically machine-to-machine, I was able to
use Java?s Double.toHexString() as a direct replacement for toString().  R is able
to read this lossless (but still text) format.  So this addresses some of the challenges
we have with this change.


Thanks,
Tom
On Apr 26, 2014, at 5:26 AM, Dirk Eddelbuettel <edd at debian.org> wrote:

            
#
On 26/04/2014, 12:28 PM, Tom Kraljevic wrote:
The beta stage is quite late.  There's a non-zero risk that a bug 
detected during the beta stage will make it through to release, 
especially if the report doesn't arrive until after we've switched to 
release candidates.

This change was made very early in the development cycle of 3.1.0, back 
in March 2013.  If you are making serious use of R, I'd really recommend 
that you try out some of the R-devel versions early, when design 
decisions are being made.  I suspect this feature would have been 
changed if we'd heard your complaints then.  It'll likely still be 
changed, but it is harder now, because some users already depend on the 
new behaviour.
Actually it isn't the bug that said that, it was Simon :-).  if you look 
up some of his other posts on this topic here in the R-devel list, 
you'll see a couple of proposals for changes.

Duncan Murdoch
#
Hi,


One additional follow-up here.

Unfortunately, I hit what looks like an R parsing bug that makes the Java Double.toHexString() output
unreliable for reading by R.  (This is really unfortunate, because the format is intended to be lossless
and it looks like it?s so close to fully working.)

You can see the spec for the conversion here:
    http://docs.oracle.com/javase/7/docs/api/java/lang/Double.html#toHexString(double)

The last value in the list below is not parsed by R in the way I expected, and causes the column to flip 
from numeric to factor.


-0x1.8ff831c7ffffdp-1
-0x1.aff831c7ffffdp-1
-0x1.bff831c7ffffdp-1
-0x1.cff831c7ffffdp-1
-0x1.dff831c7ffffdp-1
-0x1.eff831c7ffffdp-1
-0x1.fff831c7ffffdp-1           <<<<< this value is not parsed as a number and flips the column from numeric to factor.


Below is the R output from adding one row at a time to ?bad.csv?.
The last attempt results in a factor rather than a numeric column.

What?s really odd about it is that the .a through .e case work fine but the .f case doesn?t.


Thanks,
Tom
'data.frame':	1 obs. of  1 variable:
 $ V1: num -0.781
'data.frame':	2 obs. of  1 variable:
 $ V1: num  -0.781 -0.844
'data.frame':	3 obs. of  1 variable:
 $ V1: num  -0.781 -0.844 -0.875
'data.frame':	4 obs. of  1 variable:
 $ V1: num  -0.781 -0.844 -0.875 -0.906
'data.frame':	5 obs. of  1 variable:
 $ V1: num  -0.781 -0.844 -0.875 -0.906 -0.937
'data.frame':	6 obs. of  1 variable:
 $ V1: num  -0.781 -0.844 -0.875 -0.906 -0.937 ...
'data.frame':	7 obs. of  1 variable:
 $ V1: Factor w/ 7 levels "-0x1.8ff831c7ffffdp-1",..: 1 2 3 4 5 6 7
#
On 26/04/2014, 4:12 PM, Tom Kraljevic wrote:
That looks like a bug in the conversion code.  It uses the same test for 
lack of accuracy for hex doubles as it uses for decimal ones, but hex 
doubles can be larger before they lose precision.  I believe the largest 
integer that can be represented exactly is 2^53 - 1, i.e.

0x1.fffffffffffffp52

in this notation; can you confirm that your Java code reads it and 
writes the same string?  This is about 1% bigger than the limit at which 
type.convert switches to strings or factors.

Duncan Murdoch
#
On 26/04/2014, 6:40 PM, Tom Kraljevic wrote:
This one has enough attention already that I don't think it will get 
lost, so no more bug reports are necessary.  Martin Maechler (on another 
thread) is describing some changes that should address this.  It would 
be really helpful if you tested it on your examples after he commits his 
changes.

Duncan Murdoch
#
Hi Duncan,
I'm with Tom, don't want to be redundant but here's some extra info.
This made me think that the problem is not a 'theshold'. Any thoughts.
Also, if the "bad" number strings are entered at the R command prompt, they
are parsed correctly as the expected number. (not factors)

thanks,
-kevin


this works

0x1.ffadp-1
'data.frame':    1 obs. of  1 variable:
 $ V1: num 0.999


but this doesn't

0x1.ffa000000000dp-1
'data.frame':    1 obs. of  1 variable:
 $ V1: Factor w/ 1 level "0x1.ffa000000000dp-1 ": 1


this also works, which is one less trailing zero.

0x1.ffa00000000dp-1
'data.frame':    1 obs. of  1 variable:
 $ V1: num 0.999 



--
View this message in context: http://r.789695.n4.nabble.com/Please-make-Pre-3-1-read-csv-type-convert-behavior-available-tp4689507p4689553.html
Sent from the R devel mailing list archive at Nabble.com.
#
On Fri, Apr 25, 2014 at 09:23:23PM -0700, Tom Kraljevic wrote:

            
It WAS somewhat shocking.  I trust the R core team to get things
right, and (AFAICT) they nearly always do.  This was an exception, and
shocking mostly in that it was so obviously wrong to completely
discard all possibility of backwards compatibility.

The old type.convert() functionality worked fine and was very useful,
so the *obviously* right thing to do would be to at least retain the
old behavior as a (non-default) option.

Reproducing the old behavior in user R code is not simple.  For
anybody else stuck with this, you can do it (probably inefficiently)
with the two functions below.  Create your own version of read.table()
that calls the dtk.type.convert() below instead of the stock
type.convert().  It's not pretty, but that will do it.


dtk.type.convert <- function(xx ,... ,ignore.signif.p=TRUE) { 
   # Add backwards compatibility to R 3.1's "new feature": 
   if(ignore.signif.p && all(dtk.can.be.numeric(xx ,ignore.na.p=TRUE))) { 
      if(all(is.na(xx))) type.convert(xx ,...) 
      else methods::as(xx ,"numeric")  
   } else type.convert(xx ,...) 
} 

dtk.can.be.numeric <- function(xx ,ignore.na.p=TRUE) { 
   # Test whether a value can be converted to numeric without becoming NA. 
   # AKA, can this value be usefully represented as numeric? 
   # Optionally ignore NAs already present in the incoming data. 

   old.warn <- options(warn = -1) ; on.exit(options(old.warn)) 
   aa <- !is.na(as.numeric(xx)) 
   if(ignore.na.p) (is.na(xx) | aa) else aa 
}