Skip to content

Risk of readRDS() not detecting race conditions with parallel saveRDS()?

5 messages · Henrik Bengtsson, Simon Urbanek, William Dunlap

#
I hardly know anything about the format used in (non-compressed)
serialization/RDS, but hoping someone with more knowledge could give
me some feedback;

Consider two R processes running in parallel on the same unknown file
system.  Both of them write and read to the same RDS file foo.rds
(without compression) at random times using saveRDS(object,
file="foo.rds", compress=FALSE) and object2 <-
readRDS(file="foo.rds").  This happens frequently enough such that
there is a risk for the two processes to write to the same "foo.rds"
file at the same time (here one needs to acknowledge that file updates
are not atomic nor instant).

To simulate the event that two processes writes to the same file at
the same time (and non-atomically) results in a interweaved/appended
"foo.rds" file, I manually corrupted "foo.rds" by
inserting/dropping/replacing a single random byte.  It appears that
readRDS() will detect this simple event, by throwing an error on
"unknown input format", which is what I want.  My question is now, is
it reasonable to assume that if two or more processes happen to write
to the same RDS file at the same time, it is extremely unlikely (*)
that they would generate a file that would pass as valid by readRDS()?
 (*) extremely unlikely = if all of us would run this toy example we
would not end up with a non-detect but still corrupt "foo.rds" file
in, say, 10000 years.

Background: The R.cache package allows memoization (caching of
results) to file such that the cache is persistent across R sessions.
The persistent part is achieved by writing cache files to the same
file directory.  This is safe when you run a single process, and even
if readRDS() would fail to read a cache file it is no big deal; the
memoization will just fail and the results will be recalculated and be
resaved.  The questions is what happens if you run this in parallel
and push it to the extreme; is there a risk that the memoization will
properly return but with invalid results.  I prefer not having to
synchronize this with a mutex/semaphore/common server, but instead
rely on this try-an-see approach (cf. the Ethernet protocol on shared
medium).  My guess (and hope) is that the risk is extremely unlikely
(*), but I'd like to hear if someone else thinks otherwise.

Thanks,

Henrik
#
On Sep 15, 2012, at 1:21 PM, Henrik Bengtsson wrote:

            
It's actually very probable that it will go undetected. In fact the probability in very high is you have large vectors, because you can corrupt almost the entire file and there will be no sign of corruption, because there is no checksum, so you can changed the the whole vector payload without any consequence. Just try saveRDS(rep(0L,100), "foo.rds", compress=T) and you can mess with anything after byte 21 and it will result in no error.

Cheers,
S
#
Why not write the RDS file more atomically - write it to a
temporary file and rename that file to its final name when
it is completely written?  E.g.,

saveRDS.atomically
function (object, file, ...) 
{
    tfile <- tempfile(basename(file), dirname(file))
    on.exit(if (file.exists(tfile)) unlink(tfile))
    retval <- saveRDS(object, tfile, ...)
    if (!file.rename(tfile, file)) { # perhaps want an if(file.exists(file))unlink(file) first
        stop("Cannot rename temporary file ", tfile, " to ", 
            file)
    }
    invisible(retval)
}

(The file.rename may be tripped up by an overeager virus checker looking
at the newly created tfile.  I don't know the best way to deal with that.)

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com
#
On Sat, Sep 15, 2012 at 12:17 PM, Simon Urbanek
<simon.urbanek at r-project.org> wrote:
Wow, I guess my "random" testing were modifying the header, because
your example clearly shows that readRDS() is not detecting "mutations"
of the data section itself.  This is exactly the type of feedback I
was looking for.

I guess I should enhance my cache file format with checksums.

Thanks Simon

/Henrik
#
Hi Bill,

yes, emulating atomic writing by writing to a temporary file and then
renaming definitely lowers the risk for corruptions.  I actually take
a similar approach in the Aroma Project (aroma.affymetrix et al.),
R.utils::saveObject(), R.utils::downloadFile() and more, and it
provides a great protection against user-interrupts, power failures
and so on.  I've been considering adding it to R.cache as well.

However, I'm not sure that it is guaranteed to be truly atomic.  I'm
saying this because ~8 years I was running batch jobs on 50 computers
on a shared file system.  Each R process was looking for remaining
"job" directory (=one job) and if found, it renamed/moved it
immediately so no other process would find/grab the same job.
However, it turned out that occasionally two separate R processes (on
different machines) could grab and move that same directory at the
"same" time (holding on to the same file target), proceed with the
analysis and write the results to file (which then would contain
interweaved results from the two parallel runs).  From that I learned
that on certain NFS file systems, it can take up to 30 seconds(!)
before file updates are seen by all computers.  Of course, what you're
proposing is somewhat different - first creating a unique temporary
file for each process which is then renamed to a common file.  The
question is how this is affected by above file system delays etc.

So to summarize my strategy, I'd like to add all possible layers of
protection (that are not too expensive) against race conditions in
order to minimize any risks for errors and if errors still occur I'd
like to be able to detect them, and all this without assuming to much
about the file systems.  It's only as a last resort I want to turn to
coordinated approaches via a main server (mutex handler; TCPIP is
guaranteed to truly atomic everywhere) ...and I don't want to reinvent
cluster OSes.

Thanks for you feedback

/Henrik
On Sat, Sep 15, 2012 at 12:44 PM, William Dunlap <wdunlap at tibco.com> wrote: