Skip to content

Q: Suggestions for long-term data/program storage policy?

7 messages · Alexander Ploner, paul sorenson, Duncan Murdoch +3 more

#
Alexander Ploner wrote:
I am coming more from a software development angle but you might want to 
take a look at subversion for versioning your projects.  For non-geeky 
types, TortoiseSVN has a point and click interface.

It handles binary files efficiently and you can easily go back and get 
earlier versions of your projects.

http://subversion.tigris.org/
#
Alexander Ploner wrote:
I think sources will be, binaries much less reliably.  (I just 
discovered that one or two of the old Windows binaries are corrupted; 
I'm not sure I'll be able to find good copies.)
I think the intention is that it will be supported in future versions of 
R, but storing data in a binary format is risky.  What if you don't use 
R in 5 years?  You would find it a lot easier to decode text format 
files in another package than .RData format.

The other advantage of text format is that it works very well with 
version control systems like Subversion or CVS.  You can see several 
versions of the file, see comments on why changes were made, etc.

Duncan Murdoch
#
On Tue, 11 Oct 2005, Alexander Ploner wrote:

            
You are intending to retain copies of the OS used and hardware too?
The results depend far more on those than you apparently realize.
I think you will find your OS changes as fast: all those security updates 
potentially affect your results.
Not binaries.  The intention is that source files be available, but they 
could become corrupted (as it seems the Windows binary has for a past 
version).
I would say not, as it is almost impossible to recover from any corruption 
in such a file.  We like to have long-term data in a human-readable 
printout, with a print copy, and also store some checksums.
You need to consider the medium on which you are going to store the 
archive.  We currrently use CD-R (and not tapes as those are less 
compatible across drives -- we have two identical drives currently but do 
not expect either to last 10 years), and check them annually -- I guess we 
will re-write to another medium after much less than 10 years.
#
On 10/11/05 6:54 AM, "Duncan Murdoch" <murdoch at stats.uwo.ca> wrote:

            
I would also consider a relational database (such as mysql or postgres) for
your data warehousing.  These products (particularly postgres) are designed
with data integrity first-and-foremost.  Data formats can change over time,
but the data can be easily extracted from the database to match the needs at
hand.  Data generated at different times can be easily mined and combined as
needed.  The data backup process is fairly straightforward.  R already
integrates with several relational database systems, so an integrated
solution can be defined if one so desires.  Look at RMySQL, Rdbi, and
RdbiPgSQL for how to integrate R with MySQL and Postgres.

Sean
#
A general comment. 

As usual, Brian is right on target. Indeed, this has been written,
conferenced, agonized, kvetched,  etc. about extensively in the computer
science community (and no doubt, among many others ... like accountants). I
seem to remember reading a Scientific American Magazine article (or was it
Science) about 10-15 years ago. As Brian says, it's not only application
versions, applications, OS's -- but even hardware that goes obsolete. Do you
have any data on 5 1/4" floppies from appications written for CP/M running
on an Intel 8080? Think of poor banks, drug companies -- or the census
bureau -- who have to keep their data forever. I sometimes wonder if all
these bits and bytes will fill up all the earth's storage eventually? :-)

Anyway, you might try researching this in the CS literature to see what the
strategy du jour is for this.

Cheers,

-- Bert Gunter
Genentech Non-Clinical Statistics
South San Francisco, CA
 
"The business of the statistician is to catalyze the scientific learning
process."  - George E. P. Box
#
Berton Gunter wrote:
Now that journals are becoming electronic, librarians are also very 
concerned with this problem, and they tend to have very long term 
storage goals.

Duncan Murdoch