Skip to content

Comments requested on "changedFiles" function

22 messages · Karl Millar, Scott Kostyshak, Duncan Murdoch +2 more

#
In a number of places internal to R, we need to know which files have 
changed (e.g. after building a vignette).  I've just written a general 
purpose function "changedFiles" that I'll probably commit to R-devel.  
Comments on the design (or bug reports) would be appreciated.

The source for the function and the Rd page for it are inline below.

----- changedFiles.R:
changedFiles <- function(snapshot, timestamp = tempfile("timestamp"), 
file.info = NULL,
              md5sum = FALSE, full.names = FALSE, ...) {
     dosnapshot <- function(args) {
         fullnames <- do.call(list.files, c(full.names = TRUE, args))
         names <- do.call(list.files, c(full.names = full.names, args))
         if (isTRUE(file.info) || (is.character(file.info) && 
length(file.info))) {
             info <- file.info(fullnames)
         rownames(info) <- names
             if (isTRUE(file.info))
                 file.info <- c("size", "isdir", "mode", "mtime")
         } else
             info <- data.frame(row.names=names)
     if (md5sum)
         info <- data.frame(info, md5sum = tools::md5sum(fullnames))
     list(info = info, timestamp = timestamp, file.info = file.info,
          md5sum = md5sum, full.names = full.names, args = args)
     }
     if (missing(snapshot) || !inherits(snapshot, "changedFilesSnapshot")) {
         if (length(timestamp) == 1)
             file.create(timestamp)
         if (missing(snapshot)) snapshot <- "."
         pre <- dosnapshot(list(path = snapshot, ...))
         pre$pre <- pre$info
         pre$info <- NULL
         pre$wd <- getwd()
         class(pre) <- "changedFilesSnapshot"
         return(pre)
     }

     if (missing(timestamp)) timestamp <- snapshot$timestamp
     if (missing(file.info) || isTRUE(file.info)) file.info <- 
snapshot$file.info
     if (identical(file.info, FALSE)) file.info <- NULL
     if (missing(md5sum))    md5sum <- snapshot$md5sum
     if (missing(full.names)) full.names <- snapshot$full.names

     pre <- snapshot$pre
     savewd <- getwd()
     on.exit(setwd(savewd))
     setwd(snapshot$wd)

     args <- snapshot$args
     newargs <- list(...)
     args[names(newargs)] <- newargs
     post <- dosnapshot(args)$info
     prenames <- rownames(pre)
     postnames <- rownames(post)

     added <- setdiff(postnames, prenames)
     deleted <- setdiff(prenames, postnames)
     common <- intersect(prenames, postnames)

     if (length(file.info)) {
         preinfo <- pre[common, file.info]
         postinfo <- post[common, file.info]
         changes <- preinfo != postinfo
     }
     else changes <- matrix(logical(0), nrow = length(common), ncol = 0,
                            dimnames = list(common, character(0)))
     if (length(timestamp))
         changes <- cbind(changes, Newer = file_test("-nt", common, 
timestamp))
     if (md5sum) {
         premd5 <- pre[common, "md5sum"]
         postmd5 <- post[common, "md5sum"]
     changes <- cbind(changes, md5sum = premd5 != postmd5)
     }
     changes1 <- changes[rowSums(changes, na.rm = TRUE) > 0, , drop = FALSE]
     changed <- rownames(changes1)
     structure(list(added = added, deleted = deleted, changed = changed,
         unchanged = setdiff(common, changed), changes = changes), class 
= "changedFiles")
}

print.changedFilesSnapshot <- function(x, ...) {
     cat("changedFiles snapshot:\n timestamp = \"", x$timestamp, "\"\n 
file.info = ",
         if (length(x$file.info)) paste(paste0('"', x$file.info, '"'), 
collapse=","),
         "\n md5sum = ", x$md5sum, "\n args = ", deparse(x$args, control 
= NULL), "\n", sep="")
     x
}

print.changedFiles <- function(x, ...) {
     if (length(x$added)) cat("Files added:\n",  paste0("  ", x$added, 
collapse="\n"), "\n", sep="")
     if (length(x$deleted)) cat("Files deleted:\n",  paste0("  ", 
x$deleted, collapse="\n"), "\n", sep="")
     changes <- x$changes
     changes <- changes[rowSums(changes, na.rm = TRUE) > 0, , drop=FALSE]
     changes <- changes[, colSums(changes, na.rm = TRUE) > 0, drop=FALSE]
     if (nrow(changes)) {
         cat("Files changed:\n")
         print(changes)
     }
     x
}
----------------------

--- changedFiles.Rd:
\name{changedFiles}
\alias{changedFiles}
\alias{print.changedFiles}
\alias{print.changedFilesSnapshot}
\title{
Detect which files have changed
}
\description{
On the first call, \code{changedFiles} takes a snapshot of a selection 
of files.  In subsequent
calls, it takes another snapshot, and returns an object containing data 
on the
differences between the two snapshots.  The snapshots need not be the 
same directory;
this could be used to compare two directories.
}
\usage{
changedFiles(snapshot, timestamp = tempfile("timestamp"), file.info = NULL,
              md5sum = FALSE, full.names = FALSE, ...)
}
\arguments{
   \item{snapshot}{
The path to record, or a previous snapshot.  See the Details.
}
   \item{timestamp}{
The name of a file to write at the time the initial snapshot
is taken.  In subsequent calls, modification times of files will be 
compared to
this file, and newer files will be reported as changed.  Set to \code{NULL}
to skip this test.
}
   \item{file.info}{
A vector of columns from the result of the \code{file.info} function, or 
a logical value.  If
\code{TRUE}, columns \code{c("size", "isdir", "mode", "mtime")} will be 
used.  Set to
\code{FALSE} or \code{NULL} to skip this test.  See the Details.
}
   \item{md5sum}{
A logical value indicating whether MD5 summaries should be taken as part 
of the snapshot.
}
   \item{full.names}{
A logical value indicating whether full names (as in 
\code{\link{list.files}}) should be
recorded.
}
   \item{\dots}{
Additional parameters to pass to \code{\link{list.files}} to control the 
set of files
in the snapshots.
}
}
\details{
This function works in two modes.  If the \code{snapshot} argument is 
missing or is
not of S3 class \code{"changedFilesSnapshot"}, it is used as the 
\code{path} argument
to \code{\link{list.files}} to obtain a list of files.  If it is of class
\code{"changedFilesSnapshot"}, then it is taken to be the baseline file
and a new snapshot is taken and compared with it.  In the latter case, 
missing
arguments default to match those from the initial snapshot.

If the \code{timestamp} argument is length 1, a file with that name is 
created
in the current directory during the initial snapshot, and 
\code{\link{file_test}}
is used to compare the age of all files to it during subsequent calls.

If the \code{file.info} argument is \code{TRUE} or it contains a non-empty
character vector, the indicated columns from the result of a call to
\code{\link{file.info}} will be recorded and compared.

If \code{md5sum} is \code{TRUE}, the \code{tools::\link{md5sum}} function
will be called to record the 32 byte MD5 checksum for each file, and 
these values
will be compared.
}
\value{
In the initial snapshot phase, an object of class 
\code{"changedFilesSnapshot"} is returned.  This
is a list containing the fields
\item{pre}{a dataframe whose rownames are the filenames, and whose 
columns contain the
requested snapshot data}
\item{timestamp, file.info, md5sum, full.names}{a record of the 
arguments in the initial call}
\item{args}{other arguments passed via \code{...} to 
\code{\link{list.files}}.}

In the comparison phase, an object of class \code{"changedFiles"}. This 
is a list containing
\item{added, deleted, changed, unchanged}{character vectors of filenames 
from the before
and after snapshots, with obvious meanings}
\item{changes}{a logical matrix with a row for each common file, and a 
column for each
comparison test.  \code{TRUE} indicates a change in that test.}

\code{\link{print}} methods are defined for each of these types. The
\code{\link{print}} method for \code{"changedFilesSnapshot"} objects
displays the arguments used to produce it, while the one for
\code{"changedFiles"} displays the \code{added}, \code{deleted}
and \code{changed} fields if non-empty, and a submatrix of the 
\code{changes}
matrix containing all of the \code{TRUE} values.
}
\author{
Duncan Murdoch
}
\seealso{
\code{\link{file.info}}, \code{\link{file_test}}, \code{\link{md5sum}}.
}
\examples{
# Create some files in a temporary directory
dir <- tempfile()
dir.create(dir)
writeBin(1, file.path(dir, "file1"))
writeBin(2, file.path(dir, "file2"))
dir.create(file.path(dir, "dir"))

# Take a snapshot
snapshot <- changedFiles(dir, file.info=TRUE, md5sum=TRUE)

# Change one of the files
writeBin(3, file.path(dir, "file2"))

# Display the detected changes
changedFiles(snapshot)
changedFiles(snapshot)$changes
}
\keyword{utilities}
\keyword{file}
#
On 13-09-04 8:02 PM, Karl Millar wrote:
Yes, that's another possibility.  Some more comment below...


  In addition, the 'timestamp' functionality
You can do that, using file.info = "mtime", but the file.info snapshots 
are quite a bit slower than using the timestamp file (when looking at a 
big recursive directory of files).
I don't want to add too many new functions.  The general R style is to 
have functions that do a lot, rather than have a lot of different 
functions to achieve different parts of related tasks.  This is better 
for interactive use (fewer functions to remember, a simpler help system 
to navigate), though it probably results in less readable code.

I can see an argument for two functions (a get and a compare), but I 
don't think there are many cases where doing two gets and comparing the 
snapshots would be worth the extra runtime.  (It's extra because 
file.info is only a little faster than list.files, and it would be 
unavoidable to call both twice in that version.  Using the timestamp 
file avoids one of those calls, and replaces the other with file_test, 
which takes a similar amount of time.  So overall it's about 20-25% 
faster.)  It also makes the code a bit more complicated, i.e. three 
calls (get, get, compare) instead of two (get, compare).

Thanks for your comments.

Duncan Murdoch
#
On Wed, Sep 4, 2013 at 1:53 PM, Duncan Murdoch <murdoch.duncan at gmail.com> wrote:
This looks like a useful function. Thanks for writing it. I have only
one (picky) comment below.
Should a different name than 'dir' be used since 'dir' is a base function?

Further, if someone is not very familiar with R (or just not in "R
mode" at the time of reading), they might think that 'dir.create' is
calling the create member of the object named 'dir' that you just
made.

Scott
--
Scott Kostyshak
Economics PhD Candidate
Princeton University
#
On 13-09-04 11:36 PM, Scott Kostyshak wrote:
Such as?
dir.create is an existing function.  I wouldn't have named it that, but 
that's its name.

Duncan Murdoch
#
Dear Duncan,

This certainly looks useful. Might you consider adding the ability to 
supply an alternative digest function? Details below.

I often use a homemade "make" type function which starts by looking at 
modification times e.g. in a private package

https://github.com/jefferis/nat.utils/blob/master/R/make.r

For some of my work, I use hash functions. However because I typically 
work with many large files I often use a special digest process e.g. 
using the crc checksum embedded in a gzip file directly or hashing only 
the part of a large file that is (almost) certain to change.

Perhaps (code unchecked) along the lines of:

changedFiles <- function(snapshot, timestamp = tempfile("timestamp"), 
file.info = NULL,
	digest = FALSE, digestfun=NULL, full.names = FALSE, ...)

if(digest){
	if(is.null(digestfun)) digestfun=tools::md5sum
	else digestfun=match.fun(digestfun)
	info <- data.frame(info, digest = digestfun(fullnames))
}

etc

OR alternatively using only one argument:

changedFiles <- function(snapshot, timestamp = tempfile("timestamp"), 
file.info = NULL,
	digest = FALSE, full.names = FALSE, ...)

if(is.logical(digest)){
	if(digest) digestfun=tools::md5sum
} else {
	# Assume that digest specifies a function that we want to use
	digestfun=match.fun(digest)
	digest=TRUE
}

if(digest)
	info <- data.frame(info, digest = digestfun(fullnames))

etc

Many thanks,

Greg.
On 4 Sep 2013, at 18:53, Duncan Murdoch wrote:

            
--
Gregory Jefferis, PhD                   Tel: 01223 267048
Division of Neurobiology
MRC Laboratory of Molecular Biology
Francis Crick Avenue
Cambridge Biomedical Campus
Cambridge, CB2 OQH, UK

http://www2.mrc-lmb.cam.ac.uk/group-leaders/h-to-m/g-jefferis
http://jefferislab.org
http://flybrain.stanford.edu
#
On 05/09/2013 12:32 PM, Dr Gregory Jefferis wrote:
Thanks, that's a good idea.

Duncan Murdoch
#
FYI I implemented that approach in testthat:
https://github.com/hadley/testthat/blob/master/R/watcher.r - it's a
bit more general, because it just sits in the background and listens
for changes, dispatching to a callback.

Hadley
#
On Thu, Sep 5, 2013 at 6:48 AM, Duncan Murdoch <murdoch.duncan at gmail.com> wrote:
'dir_', 'dir1', 'temp_dir', none of which is a base function. I
thought that it was not recommended to create objects with the same
name as functions, but perhaps this recommended practice is not agreed
on.
I meant that if the object is called, e.g. 'temp_dir', one will not
think that 'dir.create' is a call to the 'create' member of 'dir'
because there is no 'dir' object apart from the base function. But
anyone with experience in R would know that this is not how R parses
'dir.create'.

In any case, I shouldn't waste your time on such a minor and subjective thing.

Scott
--
Scott Kostyshak
Economics PhD Candidate
Princeton University
#
Comments inline:
On Wed, Sep 4, 2013 at 6:10 PM, Duncan Murdoch <murdoch.duncan at gmail.com> wrote:
Sorry, I completely failed to explain what I was thinking here.  There
are a number of issues here, but the biggest one is that you're
implicitly assuming that files that get modified will have mtimes that
come after the timestamp file was created.  This isn't always true,
with the most notable exception being if you download a package from
CRAN and untar it, the mtimes are usually well in the past (at least
with GNU tar on a linux system), so this code won't notice that the
files have changed.

It may be a good idea to store the file sizes as well, which would
help prevent false negatives in the (rare IIRC) cases where the
contents have changed but the mtime values have not.  Since you
already need to call file.info() to get the mtime, this shouldn't
increase the runtime, and the extra memory needed is fairly modest.
This is somewhat more nuanced and not particular to interactive use
IMHO.  Having functions that do a lot is good, _as long as the
semantics are always consistent_.  For example, lm() does a huge
amount and has a wide variety of ways that you can specify your data,
but it basically does the same thing no matter how you use it.  On the
other hand, if you have a function that does different things
depending on how you call it (e.g. reshape()) then it's easy to
remember the function name, but much harder to remember how to call it
correctly, harder to understand the documentation and less readable.
I think a 'snapshotDirectory' and 'compareDirectoryToSnapshot'
combination might work well.

Thanks,

Karl
#
On 13-09-06 2:46 AM, Karl Millar wrote:
If we need to use file.info(), then I store the complete result, so I 
have size if I have mtime.
I have split it into two functions.  The compare function has two 
snapshot arguments, but if only the "before" is given, it will compute 
the "after" from the current file system.  This makes a cleaner design, 
thanks for the suggestion.

About the function names:  selection of files for the snapshot is done 
by list.files, and that function's "path" argument can be a vector, so 
multiple directories can be recorded at once.  I've chosen 
"fileSnapshot" and "changedFiles" so far, but those aren't perfect.

I need to do a little more cleanup and testing, then I'll put the new 
version online somewhere.

Duncan Murdoch
#
I have now put the code into a temporary package for testing; if anyone 
is interested, for a few days it will be downloadable from

fisher.stats.uwo.ca/faculty/murdoch/temp/testpkg_1.0.tar.gz

This uses two functions:

fileSnapshot -- takes a snapshot
changedFiles -- compares two snapshots, or one snapshot to the current 
file system


Duncan Murdoch
#
On 06/09/2013 2:20 PM, Duncan Murdoch wrote:
Sorry, error in the URL.  It should be

http://www.stats.uwo.ca/faculty/murdoch/temp/testpkg_1.0.tar.gz

(This time I tested it!  Thanks Scott for the heads-up.)

Duncan Murdoch
#
On Fri, Sep 6, 2013 at 3:46 PM, Duncan Murdoch <murdoch.duncan at gmail.com> wrote:
Works well. A couple of things I noticed:

(1)
md5sum is being called on directories, which causes warnings. (If this
is not viewed as undesirable, please ignore the rest of this comment.)
Should this be the responsibility of the user (by passing arguments to
list.files)? In the example, changing
fileSnapshot(dir, file.info=TRUE, md5sum=TRUE)
to
fileSnapshot(dir, file.info=TRUE, md5sum=TRUE, include.dirs=FALSE,
recursive=TRUE")

gets rid of the warnings. But perhaps the user just wants to exclude
directories for the md5sum calculations. This can't be controlled from
fileSnapshot.

Or, should the "if (md5sum)" chunk subset "fullnames" using file_test
or file.info to exclude directories (and then fill in the directories
with NA)?

(2)
If I run example(changedFiles) several times, sometimes I get:

chngdF> changedFiles(snapshot)
File changes:
      mtime md5sum
file2  TRUE   TRUE

and other times I get:

chngdF> changedFiles(snapshot)
File changes:
      md5sum
file2   TRUE

I wonder why.

Scott
R Under development (unstable) (2013-08-31 r63780)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] testpkg_1.0

loaded via a namespace (and not attached):
[1] tools_3.1.0
--
Scott Kostyshak
Economics PhD Candidate
Princeton University
#
On Fri, Sep 6, 2013 at 7:40 PM, Scott Kostyshak <skostysh at princeton.edu> wrote:
Putting the following in-between snapshot and writeBin in the example
leads to consistent output:

# allow for mtime to change
Sys.sleep(.1)

Scott
--
Scott Kostyshak
Economics PhD Candidate
Princeton University
#
On 13-09-06 7:40 PM, Scott Kostyshak wrote:
I don't see the warnings, I just get NA values.  I'll try to see why 
there's a difference.  (One possibility is my platform (Windows); 
another is that I'm generally testing in R-patched and R-devel rather 
than the 3.0.1 release version.)  I would rather suppress the warnings 
than make the user avoid them.
Sometimes the example runs so quickly that the new version has exactly 
the same modification time as the original.  That's the risk of the 
mtime check.  If you put a delay between, you'll get consistent results.

Duncan Murdoch
#
Hi Duncan,

I like the interface of this version a lot better, but there's still a
bunch of implementation details that need fixing:

* As previously mentioned, there are important cases where the mtime
values change in ways that this code doesn't detect.
* If the timestamp file (which is usually in the temp directory) gets
deleted (which can happen after a moderate amount of time of
inactivity on some systems), then the file_test('-nt', ...) will
always return false, even if the file has changed.
* If files get added or deleted between the two calls to list.files in
fileSnapshot, it will fail with an error.
* If the path is on a remote file system, tempdir is local, and
there's significant clock skew, then you can get incorrect results.

Unfortunately, these aren't just theoretical scenarios -- I've had the
misfortune to run up against all of them in the past.

I've attached code that's loosely based on your implementation that
solves these problems AFAICT.  Alternatively, Hadley's code handles
all of these correctly, with the exception that compare_state doesn't
handle the case where safe_digest returns NA very well.

Regards,

Karl
On Fri, Sep 6, 2013 at 5:40 PM, Duncan Murdoch <murdoch.duncan at gmail.com> wrote:
#
On 13-09-06 9:21 PM, Karl Millar wrote:
If that happened without user intervention, I think it would break other 
things in R -- the temp directory is supposed to last for the whole 
session.  But I should be checking anyway.
Yours won't work if path contains more than one directory.  This is 
probably a reasonable restriction, but it's inconsistent with 
list.files, so I'd like to avoid it if I can find a way.

Duncan Murdoch
#
On Fri, Sep 6, 2013 at 7:03 PM, Duncan Murdoch <murdoch.duncan at gmail.com> wrote:
Yes, it does break other things in R -- my experience has been that
the help system seems to be the one that is impacted the most by this.
 FWIW, I've never seen the entire R temp directory deleted, just
individual files and subdirectories in it, but even that probably
depends on how the machine is configured.  I suspect only a few users
ever notice this, but my R use is probably somewhat anomalous and I
think it only happens to R sessions that I haven't used for a few
days.
I'm currently unsure what the behaviour when comparing snapshots with
multiple directories should be.

Presumably we should have the property that (horribly abusing notation
for succinctness):
      compareSnapshots(c(a1, a2),  c(a1, a2))
is the same as concatenating (in some form)
      compareSnapshots(a1, a1) and compareSnapshots(a2, a2)
and there's a bunch of ways we could concatenate -- we could return a
list of results, or a single result where each of the 'added, deleted,
modified' fields are a list, or where we concatenate the 'added,
deleted, modified' fields together into three simple vectors.
Concatenating the vectors together like this is appealing, but unless
you're using the full names, it doesn't include the information of
which directory the changes are in, and using the full names doesn't
work in the case where you're comparing different sets of directories,
e.g. compareSnapshots(c(a1, a2), c(b1, b2)), where there is no
sensible choice for a full name.  The list options don't have this
problem, but are harder to work with, particularly for the common case
where there's only a single directory.  You'd also have to be somewhat
careful with filenames that occur in both directories.

Maybe I'm just being dense, but I don't see a way to do this thats
clear, easy to use and wouldn't confuse users at the moment.

Karl
1 day later
#
On 13-09-06 11:07 PM, Karl Millar wrote:
I use Windows and never see this; deleting temp files is up to me, not 
to the system.  But my understanding was the *nix systems should only 
clean up /tmp on restart, and I don't think an R session will survive a 
restart.

However, you have convinced me that the use of the timestamp file is not 
beneficial enough to be the default.  I'll leave it as an option, but 
add warnings that it might be unreliable.
The way I've done this is to require full.names when multiple dirs are 
on the path.  I've reduced it to one list.files() call per dir, by 
iterating over the path variable and using your approach of calling it 
with full.names = FALSE, then adding the dir if necessary.

I haven't adopted your change that forces comparison of only size and 
mtime from file.info.  I don't see a big cost in storing whatever 
file.info returns (which is system dependent; on Windows I don't see the 
user and group related columns; on Unix I don't see the exe column).
Users might want to detect changes to anything there, and I shouldn't 
make it harder for them.

I've also kept the special-casing of md5sum; it really needs to be 
wrapped in suppressWarnings() (on Unix only).  And I've kept the options 
to specify what changedFiles checks among the file.info columns; I can 
see that you might want a snapshot with everything, but sometimes only 
want to be told about changes in a subset of the attributes.

I've uploaded 
<http://www.stats.uwo.ca/faculty/murdoch/temp/testpkg_1.1.tar.gz> if 
anyone is interested.

Duncan Murdoch
#
On Sun, Sep 8, 2013 at 10:55 AM, Duncan Murdoch
<murdoch.duncan at gmail.com> wrote:
Works well.

Scott


--
Scott Kostyshak
Economics PhD Candidate
Princeton University
#
Thanks for everyone's comments on this.  I have now committed a version 
to R-devel.  I don't plan to backport it to 3.0.2 (coming out in a 
couple of weeks), but it should appear in 3.1.0 in the spring, and it's 
conceivable it could make it into 3.0.3 (not yet scheduled).

Duncan Murdoch