On Wed, Mar 16, 2016 at 03:18:27PM -0400, Duncan Murdoch wrote:
On 16/03/2016 1:40 PM, Jan Kim wrote:
Barry: that's an interesting hack.
I do feel compelled to make two comments, though, regarding the
general issue rather than the scraping idea:
(1) If your situation is that that image (.RData file) is the only
copy of the data, you'll need to rescue the data from that as soon as
possible anyway. Something like
load(".RData");
write.csv(mydataframe, file = "mydata.csv");
should do this trick. It will be slow, but you'll need to do it just
once, so you might as well enjoy your coffee while you wait. From that
point on, work with the mydata.csv file for getting at the colnames
(and anything else as well).
(2) If there's any chance / risk that scraping data off images is not
a one-off, the time to prevent that from catching on is now. If data is
of any value at all, it should be handled in a sane, portable, textual
format. For tabular data, csv is normally adequate or at least good
enough, but .RData images are never a good idea.
I agree with the sentiment, but not with the choice of .csv as a
"sane, portable, textual format". CSV has no type information
included, so strings that contain only digits can turn into numbers
(and get rounded in the process), things that look like
dates can get converted to different formats, etc.
I entirely agree. In hindsight, I should have stated that the .RData files,
as well as the R code to load and extract stuff from them, should be stored
permanently and documented.
The .RData format has the disadvantages of being hard to use outside
R, but at least it is usable in R.
yes -- that's why I thought it's a good idea to use R to pluck out the
valuable data, so (1) they can still be accessed even if the .RData
format changes and (2) they're in their own file, separated from the
(potentially homungous, see my P.S.) amount of other stuff caught up
in the image.
But to reiterate, the .RData file should be secured as well if that's
the only remaining primary / original source of the data.
I don't know what I'd recommend if I wanted a portable textual
format. JSON is close, but it can't handle the full
range of data that R can handle (e.g. no Inf). dput() on a
dataframe is text, but nothing but R can read it.
yes, that's the problem with "JSON", it's a JavaScript but not really
an object notation, as it doesn't store class structure metadata.
So again, the best bet is to secure multiple levels, the .RDdata
image to preserve the R types, the R script to be able to identify
the relevant variable(s), and the text version to avoid depending on
availablility of R / an R version still able to read the image format.
Best regards, Jan