In a number of places internal to R, we need to know which files have
changed (e.g. after building a vignette). I've just written a general
purpose function "changedFiles" that I'll probably commit to R-devel.
Comments on the design (or bug reports) would be appreciated.
The source for the function and the Rd page for it are inline below.
----- changedFiles.R:
changedFiles <- function(snapshot, timestamp = tempfile("timestamp"),
file.info = NULL,
md5sum = FALSE, full.names = FALSE, ...) {
dosnapshot <- function(args) {
fullnames <- do.call(list.files, c(full.names = TRUE, args))
names <- do.call(list.files, c(full.names = full.names, args))
if (isTRUE(file.info) || (is.character(file.info) &&
length(file.info))) {
info <- file.info(fullnames)
rownames(info) <- names
if (isTRUE(file.info))
file.info <- c("size", "isdir", "mode", "mtime")
} else
info <- data.frame(row.names=names)
if (md5sum)
info <- data.frame(info, md5sum = tools::md5sum(fullnames))
list(info = info, timestamp = timestamp, file.info = file.info,
md5sum = md5sum, full.names = full.names, args = args)
}
if (missing(snapshot) || !inherits(snapshot, "changedFilesSnapshot")) {
if (length(timestamp) == 1)
file.create(timestamp)
if (missing(snapshot)) snapshot <- "."
pre <- dosnapshot(list(path = snapshot, ...))
pre$pre <- pre$info
pre$info <- NULL
pre$wd <- getwd()
class(pre) <- "changedFilesSnapshot"
return(pre)
}
if (missing(timestamp)) timestamp <- snapshot$timestamp
if (missing(file.info) || isTRUE(file.info)) file.info <-
snapshot$file.info
if (identical(file.info, FALSE)) file.info <- NULL
if (missing(md5sum)) md5sum <- snapshot$md5sum
if (missing(full.names)) full.names <- snapshot$full.names
pre <- snapshot$pre
savewd <- getwd()
on.exit(setwd(savewd))
setwd(snapshot$wd)
args <- snapshot$args
newargs <- list(...)
args[names(newargs)] <- newargs
post <- dosnapshot(args)$info
prenames <- rownames(pre)
postnames <- rownames(post)
added <- setdiff(postnames, prenames)
deleted <- setdiff(prenames, postnames)
common <- intersect(prenames, postnames)
if (length(file.info)) {
preinfo <- pre[common, file.info]
postinfo <- post[common, file.info]
changes <- preinfo != postinfo
}
else changes <- matrix(logical(0), nrow = length(common), ncol = 0,
dimnames = list(common, character(0)))
if (length(timestamp))
changes <- cbind(changes, Newer = file_test("-nt", common,
timestamp))
if (md5sum) {
premd5 <- pre[common, "md5sum"]
postmd5 <- post[common, "md5sum"]
changes <- cbind(changes, md5sum = premd5 != postmd5)
}
changes1 <- changes[rowSums(changes, na.rm = TRUE) > 0, , drop = FALSE]
changed <- rownames(changes1)
structure(list(added = added, deleted = deleted, changed = changed,
unchanged = setdiff(common, changed), changes = changes), class
= "changedFiles")
}
print.changedFilesSnapshot <- function(x, ...) {
cat("changedFiles snapshot:\n timestamp = \"", x$timestamp, "\"\n
file.info = ",
if (length(x$file.info)) paste(paste0('"', x$file.info, '"'),
collapse=","),
"\n md5sum = ", x$md5sum, "\n args = ", deparse(x$args, control
= NULL), "\n", sep="")
x
}
print.changedFiles <- function(x, ...) {
if (length(x$added)) cat("Files added:\n", paste0(" ", x$added,
collapse="\n"), "\n", sep="")
if (length(x$deleted)) cat("Files deleted:\n", paste0(" ",
x$deleted, collapse="\n"), "\n", sep="")
changes <- x$changes
changes <- changes[rowSums(changes, na.rm = TRUE) > 0, , drop=FALSE]
changes <- changes[, colSums(changes, na.rm = TRUE) > 0, drop=FALSE]
if (nrow(changes)) {
cat("Files changed:\n")
print(changes)
}
x
}
----------------------
--- changedFiles.Rd:
\name{changedFiles}
\alias{changedFiles}
\alias{print.changedFiles}
\alias{print.changedFilesSnapshot}
\title{
Detect which files have changed
}
\description{
On the first call, \code{changedFiles} takes a snapshot of a selection
of files. In subsequent
calls, it takes another snapshot, and returns an object containing data
on the
differences between the two snapshots. The snapshots need not be the
same directory;
this could be used to compare two directories.
}
\usage{
changedFiles(snapshot, timestamp = tempfile("timestamp"), file.info = NULL,
md5sum = FALSE, full.names = FALSE, ...)
}
\arguments{
\item{snapshot}{
The path to record, or a previous snapshot. See the Details.
}
\item{timestamp}{
The name of a file to write at the time the initial snapshot
is taken. In subsequent calls, modification times of files will be
compared to
this file, and newer files will be reported as changed. Set to \code{NULL}
to skip this test.
}
\item{file.info}{
A vector of columns from the result of the \code{file.info} function, or
a logical value. If
\code{TRUE}, columns \code{c("size", "isdir", "mode", "mtime")} will be
used. Set to
\code{FALSE} or \code{NULL} to skip this test. See the Details.
}
\item{md5sum}{
A logical value indicating whether MD5 summaries should be taken as part
of the snapshot.
}
\item{full.names}{
A logical value indicating whether full names (as in
\code{\link{list.files}}) should be
recorded.
}
\item{\dots}{
Additional parameters to pass to \code{\link{list.files}} to control the
set of files
in the snapshots.
}
}
\details{
This function works in two modes. If the \code{snapshot} argument is
missing or is
not of S3 class \code{"changedFilesSnapshot"}, it is used as the
\code{path} argument
to \code{\link{list.files}} to obtain a list of files. If it is of class
\code{"changedFilesSnapshot"}, then it is taken to be the baseline file
and a new snapshot is taken and compared with it. In the latter case,
missing
arguments default to match those from the initial snapshot.
If the \code{timestamp} argument is length 1, a file with that name is
created
in the current directory during the initial snapshot, and
\code{\link{file_test}}
is used to compare the age of all files to it during subsequent calls.
If the \code{file.info} argument is \code{TRUE} or it contains a non-empty
character vector, the indicated columns from the result of a call to
\code{\link{file.info}} will be recorded and compared.
If \code{md5sum} is \code{TRUE}, the \code{tools::\link{md5sum}} function
will be called to record the 32 byte MD5 checksum for each file, and
these values
will be compared.
}
\value{
In the initial snapshot phase, an object of class
\code{"changedFilesSnapshot"} is returned. This
is a list containing the fields
\item{pre}{a dataframe whose rownames are the filenames, and whose
columns contain the
requested snapshot data}
\item{timestamp, file.info, md5sum, full.names}{a record of the
arguments in the initial call}
\item{args}{other arguments passed via \code{...} to
\code{\link{list.files}}.}
In the comparison phase, an object of class \code{"changedFiles"}. This
is a list containing
\item{added, deleted, changed, unchanged}{character vectors of filenames
from the before
and after snapshots, with obvious meanings}
\item{changes}{a logical matrix with a row for each common file, and a
column for each
comparison test. \code{TRUE} indicates a change in that test.}
\code{\link{print}} methods are defined for each of these types. The
\code{\link{print}} method for \code{"changedFilesSnapshot"} objects
displays the arguments used to produce it, while the one for
\code{"changedFiles"} displays the \code{added}, \code{deleted}
and \code{changed} fields if non-empty, and a submatrix of the
\code{changes}
matrix containing all of the \code{TRUE} values.
}
\author{
Duncan Murdoch
}
\seealso{
\code{\link{file.info}}, \code{\link{file_test}}, \code{\link{md5sum}}.
}
\examples{
# Create some files in a temporary directory
dir <- tempfile()
dir.create(dir)
writeBin(1, file.path(dir, "file1"))
writeBin(2, file.path(dir, "file2"))
dir.create(file.path(dir, "dir"))
# Take a snapshot
snapshot <- changedFiles(dir, file.info=TRUE, md5sum=TRUE)
# Change one of the files
writeBin(3, file.path(dir, "file2"))
# Display the detected changes
changedFiles(snapshot)
changedFiles(snapshot)$changes
}
\keyword{utilities}
\keyword{file}
Comments requested on "changedFiles" function
22 messages · Karl Millar, Scott Kostyshak, Duncan Murdoch +2 more
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-devel/attachments/20130904/9b924360/attachment.pl>
On 13-09-04 8:02 PM, Karl Millar wrote:
Hi Duncan, I think this functionality would be much easier to use and understand if you split it up the functionality of taking snapshots and comparing them into separate functions.
Yes, that's another possibility. Some more comment below... In addition, the 'timestamp' functionality
seems both confusing and brittle to me. I think it would be better to store file modification times in the snapshot and use those instead of an external file. Maybe:
You can do that, using file.info = "mtime", but the file.info snapshots are quite a bit slower than using the timestamp file (when looking at a big recursive directory of files).
# Take a snapshot of the files. takeFileSnapshot(directory, file.info <http://file.info> = TRUE, md5sum = FALSE, full.names = FALSE, recursive = TRUE, ...) # Take a snapshot using the same options as used for snapshot. retakeFileSnapshot(snapshot, directory = snapshot$directory) { takeFileSnapshot)(directory, file.info <http://file.info> = snapshot$file.info <http://file.info>, md5sum = snapshot$md5sum, etc) } compareFileSnapshots(snapshot1, snapshot2) - or - getNewFiles(snapshat1, snapshot2) # These names are probably too generic getDeletedFiles(snapshot1, snapshot2) getUpdatedFiles(snapshot1, snapshot2) -or- setdiff(snapshot1, snapshot2) # Unclear how this should treat updated files This approach does have the difficulty that users could attempt to compare snapshots that were taken with different options and that can't be compared, but that should be an easy error to detect.
I don't want to add too many new functions. The general R style is to have functions that do a lot, rather than have a lot of different functions to achieve different parts of related tasks. This is better for interactive use (fewer functions to remember, a simpler help system to navigate), though it probably results in less readable code. I can see an argument for two functions (a get and a compare), but I don't think there are many cases where doing two gets and comparing the snapshots would be worth the extra runtime. (It's extra because file.info is only a little faster than list.files, and it would be unavoidable to call both twice in that version. Using the timestamp file avoids one of those calls, and replaces the other with file_test, which takes a similar amount of time. So overall it's about 20-25% faster.) It also makes the code a bit more complicated, i.e. three calls (get, get, compare) instead of two (get, compare). Thanks for your comments. Duncan Murdoch
Karl
On Wed, Sep 4, 2013 at 10:53 AM, Duncan Murdoch
<murdoch.duncan at gmail.com <mailto:murdoch.duncan at gmail.com>> wrote:
In a number of places internal to R, we need to know which files
have changed (e.g. after building a vignette). I've just written a
general purpose function "changedFiles" that I'll probably commit to
R-devel. Comments on the design (or bug reports) would be appreciated.
The source for the function and the Rd page for it are inline below.
----- changedFiles.R:
changedFiles <- function(snapshot, timestamp =
tempfile("timestamp"), file.info <http://file.info> = NULL,
md5sum = FALSE, full.names = FALSE, ...) {
dosnapshot <- function(args) {
fullnames <- do.call(list.files, c(full.names = TRUE, args))
names <- do.call(list.files, c(full.names = full.names, args))
if (isTRUE(file.info <http://file.info>) ||
(is.character(file.info <http://file.info>) && length(file.info
<http://file.info>))) {
info <- file.info <http://file.info>(fullnames)
rownames(info) <- names
if (isTRUE(file.info <http://file.info>))
file.info <http://file.info> <- c("size", "isdir", "mode", "mtime")
} else
info <- data.frame(row.names=names)
if (md5sum)
info <- data.frame(info, md5sum = tools::md5sum(fullnames))
list(info = info, timestamp = timestamp, file.info
<http://file.info> = file.info <http://file.info>,
md5sum = md5sum, full.names = full.names, args = args)
}
if (missing(snapshot) || !inherits(snapshot,
"changedFilesSnapshot")) {
if (length(timestamp) == 1)
file.create(timestamp)
if (missing(snapshot)) snapshot <- "."
pre <- dosnapshot(list(path = snapshot, ...))
pre$pre <- pre$info
pre$info <- NULL
pre$wd <- getwd()
class(pre) <- "changedFilesSnapshot"
return(pre)
}
if (missing(timestamp)) timestamp <- snapshot$timestamp
if (missing(file.info <http://file.info>) || isTRUE(file.info
<http://file.info>)) file.info <http://file.info> <-
snapshot$file.info <http://file.info>
if (identical(file.info <http://file.info>, FALSE)) file.info
<http://file.info> <- NULL
if (missing(md5sum)) md5sum <- snapshot$md5sum
if (missing(full.names)) full.names <- snapshot$full.names
pre <- snapshot$pre
savewd <- getwd()
on.exit(setwd(savewd))
setwd(snapshot$wd)
args <- snapshot$args
newargs <- list(...)
args[names(newargs)] <- newargs
post <- dosnapshot(args)$info
prenames <- rownames(pre)
postnames <- rownames(post)
added <- setdiff(postnames, prenames)
deleted <- setdiff(prenames, postnames)
common <- intersect(prenames, postnames)
if (length(file.info <http://file.info>)) {
preinfo <- pre[common, file.info <http://file.info>]
postinfo <- post[common, file.info <http://file.info>]
changes <- preinfo != postinfo
}
else changes <- matrix(logical(0), nrow = length(common), ncol = 0,
dimnames = list(common, character(0)))
if (length(timestamp))
changes <- cbind(changes, Newer = file_test("-nt", common,
timestamp))
if (md5sum) {
premd5 <- pre[common, "md5sum"]
postmd5 <- post[common, "md5sum"]
changes <- cbind(changes, md5sum = premd5 != postmd5)
}
changes1 <- changes[rowSums(changes, na.rm = TRUE) > 0, , drop
= FALSE]
changed <- rownames(changes1)
structure(list(added = added, deleted = deleted, changed = changed,
unchanged = setdiff(common, changed), changes = changes),
class = "changedFiles")
}
print.changedFilesSnapshot <- function(x, ...) {
cat("changedFiles snapshot:\n timestamp = \"", x$timestamp,
"\"\n file.info <http://file.info> = ",
if (length(x$file.info <http://file.info>))
paste(paste0('"', x$file.info <http://file.info>, '"'), collapse=","),
"\n md5sum = ", x$md5sum, "\n args = ", deparse(x$args,
control = NULL), "\n", sep="")
x
}
print.changedFiles <- function(x, ...) {
if (length(x$added)) cat("Files added:\n", paste0(" ",
x$added, collapse="\n"), "\n", sep="")
if (length(x$deleted)) cat("Files deleted:\n", paste0(" ",
x$deleted, collapse="\n"), "\n", sep="")
changes <- x$changes
changes <- changes[rowSums(changes, na.rm = TRUE) > 0, ,
drop=FALSE]
changes <- changes[, colSums(changes, na.rm = TRUE) > 0,
drop=FALSE]
if (nrow(changes)) {
cat("Files changed:\n")
print(changes)
}
x
}
----------------------
--- changedFiles.Rd:
\name{changedFiles}
\alias{changedFiles}
\alias{print.changedFiles}
\alias{print.__changedFilesSnapshot}
\title{
Detect which files have changed
}
\description{
On the first call, \code{changedFiles} takes a snapshot of a
selection of files. In subsequent
calls, it takes another snapshot, and returns an object containing
data on the
differences between the two snapshots. The snapshots need not be
the same directory;
this could be used to compare two directories.
}
\usage{
changedFiles(snapshot, timestamp = tempfile("timestamp"), file.info
<http://file.info> = NULL,
md5sum = FALSE, full.names = FALSE, ...)
}
\arguments{
\item{snapshot}{
The path to record, or a previous snapshot. See the Details.
}
\item{timestamp}{
The name of a file to write at the time the initial snapshot
is taken. In subsequent calls, modification times of files will be
compared to
this file, and newer files will be reported as changed. Set to
\code{NULL}
to skip this test.
}
\item{file.info <http://file.info>}{
A vector of columns from the result of the \code{file.info
<http://file.info>} function, or a logical value. If
\code{TRUE}, columns \code{c("size", "isdir", "mode", "mtime")} will
be used. Set to
\code{FALSE} or \code{NULL} to skip this test. See the Details.
}
\item{md5sum}{
A logical value indicating whether MD5 summaries should be taken as
part of the snapshot.
}
\item{full.names}{
A logical value indicating whether full names (as in
\code{\link{list.files}}) should be
recorded.
}
\item{\dots}{
Additional parameters to pass to \code{\link{list.files}} to control
the set of files
in the snapshots.
}
}
\details{
This function works in two modes. If the \code{snapshot} argument
is missing or is
not of S3 class \code{"changedFilesSnapshot"}, it is used as the
\code{path} argument
to \code{\link{list.files}} to obtain a list of files. If it is of
class
\code{"changedFilesSnapshot"}, then it is taken to be the baseline file
and a new snapshot is taken and compared with it. In the latter
case, missing
arguments default to match those from the initial snapshot.
If the \code{timestamp} argument is length 1, a file with that name
is created
in the current directory during the initial snapshot, and
\code{\link{file_test}}
is used to compare the age of all files to it during subsequent calls.
If the \code{file.info <http://file.info>} argument is \code{TRUE}
or it contains a non-empty
character vector, the indicated columns from the result of a call to
\code{\link{file.info <http://file.info>}} will be recorded and
compared.
If \code{md5sum} is \code{TRUE}, the \code{tools::\link{md5sum}}
function
will be called to record the 32 byte MD5 checksum for each file, and
these values
will be compared.
}
\value{
In the initial snapshot phase, an object of class
\code{"changedFilesSnapshot"} is returned. This
is a list containing the fields
\item{pre}{a dataframe whose rownames are the filenames, and whose
columns contain the
requested snapshot data}
\item{timestamp, file.info <http://file.info>, md5sum, full.names}{a
record of the arguments in the initial call}
\item{args}{other arguments passed via \code{...} to
\code{\link{list.files}}.}
In the comparison phase, an object of class \code{"changedFiles"}.
This is a list containing
\item{added, deleted, changed, unchanged}{character vectors of
filenames from the before
and after snapshots, with obvious meanings}
\item{changes}{a logical matrix with a row for each common file, and
a column for each
comparison test. \code{TRUE} indicates a change in that test.}
\code{\link{print}} methods are defined for each of these types. The
\code{\link{print}} method for \code{"changedFilesSnapshot"} objects
displays the arguments used to produce it, while the one for
\code{"changedFiles"} displays the \code{added}, \code{deleted}
and \code{changed} fields if non-empty, and a submatrix of the
\code{changes}
matrix containing all of the \code{TRUE} values.
}
\author{
Duncan Murdoch
}
\seealso{
\code{\link{file.info <http://file.info>}}, \code{\link{file_test}},
\code{\link{md5sum}}.
}
\examples{
# Create some files in a temporary directory
dir <- tempfile()
dir.create(dir)
writeBin(1, file.path(dir, "file1"))
writeBin(2, file.path(dir, "file2"))
dir.create(file.path(dir, "dir"))
# Take a snapshot
snapshot <- changedFiles(dir, file.info <http://file.info>=TRUE,
md5sum=TRUE)
# Change one of the files
writeBin(3, file.path(dir, "file2"))
# Display the detected changes
changedFiles(snapshot)
changedFiles(snapshot)$changes
}
\keyword{utilities}
\keyword{file}
________________________________________________
R-devel at r-project.org <mailto:R-devel at r-project.org> mailing list
https://stat.ethz.ch/mailman/__listinfo/r-devel
<https://stat.ethz.ch/mailman/listinfo/r-devel>
On Wed, Sep 4, 2013 at 1:53 PM, Duncan Murdoch <murdoch.duncan at gmail.com> wrote:
In a number of places internal to R, we need to know which files have changed (e.g. after building a vignette). I've just written a general purpose function "changedFiles" that I'll probably commit to R-devel. Comments on the design (or bug reports) would be appreciated. The source for the function and the Rd page for it are inline below.
This looks like a useful function. Thanks for writing it. I have only one (picky) comment below.
----- changedFiles.R:
changedFiles <- function(snapshot, timestamp = tempfile("timestamp"),
file.info = NULL,
md5sum = FALSE, full.names = FALSE, ...) {
dosnapshot <- function(args) {
fullnames <- do.call(list.files, c(full.names = TRUE, args))
names <- do.call(list.files, c(full.names = full.names, args))
if (isTRUE(file.info) || (is.character(file.info) &&
length(file.info))) {
info <- file.info(fullnames)
rownames(info) <- names
if (isTRUE(file.info))
file.info <- c("size", "isdir", "mode", "mtime")
} else
info <- data.frame(row.names=names)
if (md5sum)
info <- data.frame(info, md5sum = tools::md5sum(fullnames))
list(info = info, timestamp = timestamp, file.info = file.info,
md5sum = md5sum, full.names = full.names, args = args)
}
if (missing(snapshot) || !inherits(snapshot, "changedFilesSnapshot")) {
if (length(timestamp) == 1)
file.create(timestamp)
if (missing(snapshot)) snapshot <- "."
pre <- dosnapshot(list(path = snapshot, ...))
pre$pre <- pre$info
pre$info <- NULL
pre$wd <- getwd()
class(pre) <- "changedFilesSnapshot"
return(pre)
}
if (missing(timestamp)) timestamp <- snapshot$timestamp
if (missing(file.info) || isTRUE(file.info)) file.info <-
snapshot$file.info
if (identical(file.info, FALSE)) file.info <- NULL
if (missing(md5sum)) md5sum <- snapshot$md5sum
if (missing(full.names)) full.names <- snapshot$full.names
pre <- snapshot$pre
savewd <- getwd()
on.exit(setwd(savewd))
setwd(snapshot$wd)
args <- snapshot$args
newargs <- list(...)
args[names(newargs)] <- newargs
post <- dosnapshot(args)$info
prenames <- rownames(pre)
postnames <- rownames(post)
added <- setdiff(postnames, prenames)
deleted <- setdiff(prenames, postnames)
common <- intersect(prenames, postnames)
if (length(file.info)) {
preinfo <- pre[common, file.info]
postinfo <- post[common, file.info]
changes <- preinfo != postinfo
}
else changes <- matrix(logical(0), nrow = length(common), ncol = 0,
dimnames = list(common, character(0)))
if (length(timestamp))
changes <- cbind(changes, Newer = file_test("-nt", common,
timestamp))
if (md5sum) {
premd5 <- pre[common, "md5sum"]
postmd5 <- post[common, "md5sum"]
changes <- cbind(changes, md5sum = premd5 != postmd5)
}
changes1 <- changes[rowSums(changes, na.rm = TRUE) > 0, , drop = FALSE]
changed <- rownames(changes1)
structure(list(added = added, deleted = deleted, changed = changed,
unchanged = setdiff(common, changed), changes = changes), class =
"changedFiles")
}
print.changedFilesSnapshot <- function(x, ...) {
cat("changedFiles snapshot:\n timestamp = \"", x$timestamp, "\"\n
file.info = ",
if (length(x$file.info)) paste(paste0('"', x$file.info, '"'),
collapse=","),
"\n md5sum = ", x$md5sum, "\n args = ", deparse(x$args, control =
NULL), "\n", sep="")
x
}
print.changedFiles <- function(x, ...) {
if (length(x$added)) cat("Files added:\n", paste0(" ", x$added,
collapse="\n"), "\n", sep="")
if (length(x$deleted)) cat("Files deleted:\n", paste0(" ", x$deleted,
collapse="\n"), "\n", sep="")
changes <- x$changes
changes <- changes[rowSums(changes, na.rm = TRUE) > 0, , drop=FALSE]
changes <- changes[, colSums(changes, na.rm = TRUE) > 0, drop=FALSE]
if (nrow(changes)) {
cat("Files changed:\n")
print(changes)
}
x
}
----------------------
--- changedFiles.Rd:
\name{changedFiles}
\alias{changedFiles}
\alias{print.changedFiles}
\alias{print.changedFilesSnapshot}
\title{
Detect which files have changed
}
\description{
On the first call, \code{changedFiles} takes a snapshot of a selection of
files. In subsequent
calls, it takes another snapshot, and returns an object containing data on
the
differences between the two snapshots. The snapshots need not be the same
directory;
this could be used to compare two directories.
}
\usage{
changedFiles(snapshot, timestamp = tempfile("timestamp"), file.info = NULL,
md5sum = FALSE, full.names = FALSE, ...)
}
\arguments{
\item{snapshot}{
The path to record, or a previous snapshot. See the Details.
}
\item{timestamp}{
The name of a file to write at the time the initial snapshot
is taken. In subsequent calls, modification times of files will be compared
to
this file, and newer files will be reported as changed. Set to \code{NULL}
to skip this test.
}
\item{file.info}{
A vector of columns from the result of the \code{file.info} function, or a
logical value. If
\code{TRUE}, columns \code{c("size", "isdir", "mode", "mtime")} will be
used. Set to
\code{FALSE} or \code{NULL} to skip this test. See the Details.
}
\item{md5sum}{
A logical value indicating whether MD5 summaries should be taken as part of
the snapshot.
}
\item{full.names}{
A logical value indicating whether full names (as in
\code{\link{list.files}}) should be
recorded.
}
\item{\dots}{
Additional parameters to pass to \code{\link{list.files}} to control the set
of files
in the snapshots.
}
}
\details{
This function works in two modes. If the \code{snapshot} argument is
missing or is
not of S3 class \code{"changedFilesSnapshot"}, it is used as the \code{path}
argument
to \code{\link{list.files}} to obtain a list of files. If it is of class
\code{"changedFilesSnapshot"}, then it is taken to be the baseline file
and a new snapshot is taken and compared with it. In the latter case,
missing
arguments default to match those from the initial snapshot.
If the \code{timestamp} argument is length 1, a file with that name is
created
in the current directory during the initial snapshot, and
\code{\link{file_test}}
is used to compare the age of all files to it during subsequent calls.
If the \code{file.info} argument is \code{TRUE} or it contains a non-empty
character vector, the indicated columns from the result of a call to
\code{\link{file.info}} will be recorded and compared.
If \code{md5sum} is \code{TRUE}, the \code{tools::\link{md5sum}} function
will be called to record the 32 byte MD5 checksum for each file, and these
values
will be compared.
}
\value{
In the initial snapshot phase, an object of class
\code{"changedFilesSnapshot"} is returned. This
is a list containing the fields
\item{pre}{a dataframe whose rownames are the filenames, and whose columns
contain the
requested snapshot data}
\item{timestamp, file.info, md5sum, full.names}{a record of the arguments in
the initial call}
\item{args}{other arguments passed via \code{...} to
\code{\link{list.files}}.}
In the comparison phase, an object of class \code{"changedFiles"}. This is a
list containing
\item{added, deleted, changed, unchanged}{character vectors of filenames
from the before
and after snapshots, with obvious meanings}
\item{changes}{a logical matrix with a row for each common file, and a
column for each
comparison test. \code{TRUE} indicates a change in that test.}
\code{\link{print}} methods are defined for each of these types. The
\code{\link{print}} method for \code{"changedFilesSnapshot"} objects
displays the arguments used to produce it, while the one for
\code{"changedFiles"} displays the \code{added}, \code{deleted}
and \code{changed} fields if non-empty, and a submatrix of the
\code{changes}
matrix containing all of the \code{TRUE} values.
}
\author{
Duncan Murdoch
}
\seealso{
\code{\link{file.info}}, \code{\link{file_test}}, \code{\link{md5sum}}.
}
\examples{
# Create some files in a temporary directory
dir <- tempfile()
dir.create(dir)
Should a different name than 'dir' be used since 'dir' is a base function? Further, if someone is not very familiar with R (or just not in "R mode" at the time of reading), they might think that 'dir.create' is calling the create member of the object named 'dir' that you just made. Scott
writeBin(1, file.path(dir, "file1"))
writeBin(2, file.path(dir, "file2"))
dir.create(file.path(dir, "dir"))
# Take a snapshot
snapshot <- changedFiles(dir, file.info=TRUE, md5sum=TRUE)
# Change one of the files
writeBin(3, file.path(dir, "file2"))
# Display the detected changes
changedFiles(snapshot)
changedFiles(snapshot)$changes
}
\keyword{utilities}
\keyword{file}
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
-- Scott Kostyshak Economics PhD Candidate Princeton University
On 13-09-04 11:36 PM, Scott Kostyshak wrote:
On Wed, Sep 4, 2013 at 1:53 PM, Duncan Murdoch <murdoch.duncan at gmail.com> wrote:
In a number of places internal to R, we need to know which files have changed (e.g. after building a vignette). I've just written a general purpose function "changedFiles" that I'll probably commit to R-devel. Comments on the design (or bug reports) would be appreciated. The source for the function and the Rd page for it are inline below.
This looks like a useful function. Thanks for writing it. I have only one (picky) comment below.
----- changedFiles.R:
changedFiles <- function(snapshot, timestamp = tempfile("timestamp"),
file.info = NULL,
md5sum = FALSE, full.names = FALSE, ...) {
dosnapshot <- function(args) {
fullnames <- do.call(list.files, c(full.names = TRUE, args))
names <- do.call(list.files, c(full.names = full.names, args))
if (isTRUE(file.info) || (is.character(file.info) &&
length(file.info))) {
info <- file.info(fullnames)
rownames(info) <- names
if (isTRUE(file.info))
file.info <- c("size", "isdir", "mode", "mtime")
} else
info <- data.frame(row.names=names)
if (md5sum)
info <- data.frame(info, md5sum = tools::md5sum(fullnames))
list(info = info, timestamp = timestamp, file.info = file.info,
md5sum = md5sum, full.names = full.names, args = args)
}
if (missing(snapshot) || !inherits(snapshot, "changedFilesSnapshot")) {
if (length(timestamp) == 1)
file.create(timestamp)
if (missing(snapshot)) snapshot <- "."
pre <- dosnapshot(list(path = snapshot, ...))
pre$pre <- pre$info
pre$info <- NULL
pre$wd <- getwd()
class(pre) <- "changedFilesSnapshot"
return(pre)
}
if (missing(timestamp)) timestamp <- snapshot$timestamp
if (missing(file.info) || isTRUE(file.info)) file.info <-
snapshot$file.info
if (identical(file.info, FALSE)) file.info <- NULL
if (missing(md5sum)) md5sum <- snapshot$md5sum
if (missing(full.names)) full.names <- snapshot$full.names
pre <- snapshot$pre
savewd <- getwd()
on.exit(setwd(savewd))
setwd(snapshot$wd)
args <- snapshot$args
newargs <- list(...)
args[names(newargs)] <- newargs
post <- dosnapshot(args)$info
prenames <- rownames(pre)
postnames <- rownames(post)
added <- setdiff(postnames, prenames)
deleted <- setdiff(prenames, postnames)
common <- intersect(prenames, postnames)
if (length(file.info)) {
preinfo <- pre[common, file.info]
postinfo <- post[common, file.info]
changes <- preinfo != postinfo
}
else changes <- matrix(logical(0), nrow = length(common), ncol = 0,
dimnames = list(common, character(0)))
if (length(timestamp))
changes <- cbind(changes, Newer = file_test("-nt", common,
timestamp))
if (md5sum) {
premd5 <- pre[common, "md5sum"]
postmd5 <- post[common, "md5sum"]
changes <- cbind(changes, md5sum = premd5 != postmd5)
}
changes1 <- changes[rowSums(changes, na.rm = TRUE) > 0, , drop = FALSE]
changed <- rownames(changes1)
structure(list(added = added, deleted = deleted, changed = changed,
unchanged = setdiff(common, changed), changes = changes), class =
"changedFiles")
}
print.changedFilesSnapshot <- function(x, ...) {
cat("changedFiles snapshot:\n timestamp = \"", x$timestamp, "\"\n
file.info = ",
if (length(x$file.info)) paste(paste0('"', x$file.info, '"'),
collapse=","),
"\n md5sum = ", x$md5sum, "\n args = ", deparse(x$args, control =
NULL), "\n", sep="")
x
}
print.changedFiles <- function(x, ...) {
if (length(x$added)) cat("Files added:\n", paste0(" ", x$added,
collapse="\n"), "\n", sep="")
if (length(x$deleted)) cat("Files deleted:\n", paste0(" ", x$deleted,
collapse="\n"), "\n", sep="")
changes <- x$changes
changes <- changes[rowSums(changes, na.rm = TRUE) > 0, , drop=FALSE]
changes <- changes[, colSums(changes, na.rm = TRUE) > 0, drop=FALSE]
if (nrow(changes)) {
cat("Files changed:\n")
print(changes)
}
x
}
----------------------
--- changedFiles.Rd:
\name{changedFiles}
\alias{changedFiles}
\alias{print.changedFiles}
\alias{print.changedFilesSnapshot}
\title{
Detect which files have changed
}
\description{
On the first call, \code{changedFiles} takes a snapshot of a selection of
files. In subsequent
calls, it takes another snapshot, and returns an object containing data on
the
differences between the two snapshots. The snapshots need not be the same
directory;
this could be used to compare two directories.
}
\usage{
changedFiles(snapshot, timestamp = tempfile("timestamp"), file.info = NULL,
md5sum = FALSE, full.names = FALSE, ...)
}
\arguments{
\item{snapshot}{
The path to record, or a previous snapshot. See the Details.
}
\item{timestamp}{
The name of a file to write at the time the initial snapshot
is taken. In subsequent calls, modification times of files will be compared
to
this file, and newer files will be reported as changed. Set to \code{NULL}
to skip this test.
}
\item{file.info}{
A vector of columns from the result of the \code{file.info} function, or a
logical value. If
\code{TRUE}, columns \code{c("size", "isdir", "mode", "mtime")} will be
used. Set to
\code{FALSE} or \code{NULL} to skip this test. See the Details.
}
\item{md5sum}{
A logical value indicating whether MD5 summaries should be taken as part of
the snapshot.
}
\item{full.names}{
A logical value indicating whether full names (as in
\code{\link{list.files}}) should be
recorded.
}
\item{\dots}{
Additional parameters to pass to \code{\link{list.files}} to control the set
of files
in the snapshots.
}
}
\details{
This function works in two modes. If the \code{snapshot} argument is
missing or is
not of S3 class \code{"changedFilesSnapshot"}, it is used as the \code{path}
argument
to \code{\link{list.files}} to obtain a list of files. If it is of class
\code{"changedFilesSnapshot"}, then it is taken to be the baseline file
and a new snapshot is taken and compared with it. In the latter case,
missing
arguments default to match those from the initial snapshot.
If the \code{timestamp} argument is length 1, a file with that name is
created
in the current directory during the initial snapshot, and
\code{\link{file_test}}
is used to compare the age of all files to it during subsequent calls.
If the \code{file.info} argument is \code{TRUE} or it contains a non-empty
character vector, the indicated columns from the result of a call to
\code{\link{file.info}} will be recorded and compared.
If \code{md5sum} is \code{TRUE}, the \code{tools::\link{md5sum}} function
will be called to record the 32 byte MD5 checksum for each file, and these
values
will be compared.
}
\value{
In the initial snapshot phase, an object of class
\code{"changedFilesSnapshot"} is returned. This
is a list containing the fields
\item{pre}{a dataframe whose rownames are the filenames, and whose columns
contain the
requested snapshot data}
\item{timestamp, file.info, md5sum, full.names}{a record of the arguments in
the initial call}
\item{args}{other arguments passed via \code{...} to
\code{\link{list.files}}.}
In the comparison phase, an object of class \code{"changedFiles"}. This is a
list containing
\item{added, deleted, changed, unchanged}{character vectors of filenames
from the before
and after snapshots, with obvious meanings}
\item{changes}{a logical matrix with a row for each common file, and a
column for each
comparison test. \code{TRUE} indicates a change in that test.}
\code{\link{print}} methods are defined for each of these types. The
\code{\link{print}} method for \code{"changedFilesSnapshot"} objects
displays the arguments used to produce it, while the one for
\code{"changedFiles"} displays the \code{added}, \code{deleted}
and \code{changed} fields if non-empty, and a submatrix of the
\code{changes}
matrix containing all of the \code{TRUE} values.
}
\author{
Duncan Murdoch
}
\seealso{
\code{\link{file.info}}, \code{\link{file_test}}, \code{\link{md5sum}}.
}
\examples{
# Create some files in a temporary directory
dir <- tempfile()
dir.create(dir)
Should a different name than 'dir' be used since 'dir' is a base function?
Such as?
Further, if someone is not very familiar with R (or just not in "R mode" at the time of reading), they might think that 'dir.create' is calling the create member of the object named 'dir' that you just made.
dir.create is an existing function. I wouldn't have named it that, but that's its name. Duncan Murdoch
Scott
writeBin(1, file.path(dir, "file1"))
writeBin(2, file.path(dir, "file2"))
dir.create(file.path(dir, "dir"))
# Take a snapshot
snapshot <- changedFiles(dir, file.info=TRUE, md5sum=TRUE)
# Change one of the files
writeBin(3, file.path(dir, "file2"))
# Display the detected changes
changedFiles(snapshot)
changedFiles(snapshot)$changes
}
\keyword{utilities}
\keyword{file}
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
-- Scott Kostyshak Economics PhD Candidate Princeton University
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Dear Duncan, This certainly looks useful. Might you consider adding the ability to supply an alternative digest function? Details below. I often use a homemade "make" type function which starts by looking at modification times e.g. in a private package https://github.com/jefferis/nat.utils/blob/master/R/make.r For some of my work, I use hash functions. However because I typically work with many large files I often use a special digest process e.g. using the crc checksum embedded in a gzip file directly or hashing only the part of a large file that is (almost) certain to change. Perhaps (code unchecked) along the lines of: changedFiles <- function(snapshot, timestamp = tempfile("timestamp"), file.info = NULL, digest = FALSE, digestfun=NULL, full.names = FALSE, ...) if(digest){ if(is.null(digestfun)) digestfun=tools::md5sum else digestfun=match.fun(digestfun) info <- data.frame(info, digest = digestfun(fullnames)) } etc OR alternatively using only one argument: changedFiles <- function(snapshot, timestamp = tempfile("timestamp"), file.info = NULL, digest = FALSE, full.names = FALSE, ...) if(is.logical(digest)){ if(digest) digestfun=tools::md5sum } else { # Assume that digest specifies a function that we want to use digestfun=match.fun(digest) digest=TRUE } if(digest) info <- data.frame(info, digest = digestfun(fullnames)) etc Many thanks, Greg.
On 4 Sep 2013, at 18:53, Duncan Murdoch wrote:
In a number of places internal to R, we need to know which files have
changed (e.g. after building a vignette). I've just written a general
purpose function "changedFiles" that I'll probably commit to R-devel.
Comments on the design (or bug reports) would be appreciated.
The source for the function and the Rd page for it are inline below.
----- changedFiles.R:
changedFiles <- function(snapshot, timestamp = tempfile("timestamp"),
file.info = NULL,
md5sum = FALSE, full.names = FALSE, ...) {
dosnapshot <- function(args) {
fullnames <- do.call(list.files, c(full.names = TRUE, args))
names <- do.call(list.files, c(full.names = full.names, args))
if (isTRUE(file.info) || (is.character(file.info) &&
length(file.info))) {
info <- file.info(fullnames)
rownames(info) <- names
if (isTRUE(file.info))
file.info <- c("size", "isdir", "mode", "mtime")
} else
info <- data.frame(row.names=names)
if (md5sum)
info <- data.frame(info, md5sum = tools::md5sum(fullnames))
list(info = info, timestamp = timestamp, file.info = file.info,
md5sum = md5sum, full.names = full.names, args = args)
-- Gregory Jefferis, PhD Tel: 01223 267048 Division of Neurobiology MRC Laboratory of Molecular Biology Francis Crick Avenue Cambridge Biomedical Campus Cambridge, CB2 OQH, UK http://www2.mrc-lmb.cam.ac.uk/group-leaders/h-to-m/g-jefferis http://jefferislab.org http://flybrain.stanford.edu
On 05/09/2013 12:32 PM, Dr Gregory Jefferis wrote:
Dear Duncan, This certainly looks useful. Might you consider adding the ability to supply an alternative digest function? Details below.
Thanks, that's a good idea. Duncan Murdoch
I often use a homemade "make" type function which starts by looking at modification times e.g. in a private package https://github.com/jefferis/nat.utils/blob/master/R/make.r For some of my work, I use hash functions. However because I typically work with many large files I often use a special digest process e.g. using the crc checksum embedded in a gzip file directly or hashing only the part of a large file that is (almost) certain to change. Perhaps (code unchecked) along the lines of: changedFiles <- function(snapshot, timestamp = tempfile("timestamp"), file.info = NULL, digest = FALSE, digestfun=NULL, full.names = FALSE, ...) if(digest){ if(is.null(digestfun)) digestfun=tools::md5sum else digestfun=match.fun(digestfun) info <- data.frame(info, digest = digestfun(fullnames)) } etc OR alternatively using only one argument: changedFiles <- function(snapshot, timestamp = tempfile("timestamp"), file.info = NULL, digest = FALSE, full.names = FALSE, ...) if(is.logical(digest)){ if(digest) digestfun=tools::md5sum } else { # Assume that digest specifies a function that we want to use digestfun=match.fun(digest) digest=TRUE } if(digest) info <- data.frame(info, digest = digestfun(fullnames)) etc Many thanks, Greg. On 4 Sep 2013, at 18:53, Duncan Murdoch wrote:
In a number of places internal to R, we need to know which files have
changed (e.g. after building a vignette). I've just written a general
purpose function "changedFiles" that I'll probably commit to R-devel.
Comments on the design (or bug reports) would be appreciated.
The source for the function and the Rd page for it are inline below.
----- changedFiles.R:
changedFiles <- function(snapshot, timestamp = tempfile("timestamp"),
file.info = NULL,
md5sum = FALSE, full.names = FALSE, ...) {
dosnapshot <- function(args) {
fullnames <- do.call(list.files, c(full.names = TRUE, args))
names <- do.call(list.files, c(full.names = full.names, args))
if (isTRUE(file.info) || (is.character(file.info) &&
length(file.info))) {
info <- file.info(fullnames)
rownames(info) <- names
if (isTRUE(file.info))
file.info <- c("size", "isdir", "mode", "mtime")
} else
info <- data.frame(row.names=names)
if (md5sum)
info <- data.frame(info, md5sum = tools::md5sum(fullnames))
list(info = info, timestamp = timestamp, file.info = file.info,
md5sum = md5sum, full.names = full.names, args = args)
-- Gregory Jefferis, PhD Tel: 01223 267048 Division of Neurobiology MRC Laboratory of Molecular Biology Francis Crick Avenue Cambridge Biomedical Campus Cambridge, CB2 OQH, UK http://www2.mrc-lmb.cam.ac.uk/group-leaders/h-to-m/g-jefferis http://jefferislab.org http://flybrain.stanford.edu
This approach does have the difficulty that users could attempt to compare snapshots that were taken with different options and that can't be compared, but that should be an easy error to detect.
FYI I implemented that approach in testthat: https://github.com/hadley/testthat/blob/master/R/watcher.r - it's a bit more general, because it just sits in the background and listens for changes, dispatching to a callback. Hadley
Chief Scientist, RStudio http://had.co.nz/
On Thu, Sep 5, 2013 at 6:48 AM, Duncan Murdoch <murdoch.duncan at gmail.com> wrote:
On 13-09-04 11:36 PM, Scott Kostyshak wrote:
On Wed, Sep 4, 2013 at 1:53 PM, Duncan Murdoch <murdoch.duncan at gmail.com> wrote:
In a number of places internal to R, we need to know which files have changed (e.g. after building a vignette). I've just written a general purpose function "changedFiles" that I'll probably commit to R-devel. Comments on the design (or bug reports) would be appreciated. The source for the function and the Rd page for it are inline below.
This looks like a useful function. Thanks for writing it. I have only one (picky) comment below.
----- changedFiles.R:
changedFiles <- function(snapshot, timestamp = tempfile("timestamp"),
file.info = NULL,
md5sum = FALSE, full.names = FALSE, ...) {
dosnapshot <- function(args) {
fullnames <- do.call(list.files, c(full.names = TRUE, args))
names <- do.call(list.files, c(full.names = full.names, args))
if (isTRUE(file.info) || (is.character(file.info) &&
length(file.info))) {
info <- file.info(fullnames)
rownames(info) <- names
if (isTRUE(file.info))
file.info <- c("size", "isdir", "mode", "mtime")
} else
info <- data.frame(row.names=names)
if (md5sum)
info <- data.frame(info, md5sum = tools::md5sum(fullnames))
list(info = info, timestamp = timestamp, file.info = file.info,
md5sum = md5sum, full.names = full.names, args = args)
}
if (missing(snapshot) || !inherits(snapshot,
"changedFilesSnapshot")) {
if (length(timestamp) == 1)
file.create(timestamp)
if (missing(snapshot)) snapshot <- "."
pre <- dosnapshot(list(path = snapshot, ...))
pre$pre <- pre$info
pre$info <- NULL
pre$wd <- getwd()
class(pre) <- "changedFilesSnapshot"
return(pre)
}
if (missing(timestamp)) timestamp <- snapshot$timestamp
if (missing(file.info) || isTRUE(file.info)) file.info <-
snapshot$file.info
if (identical(file.info, FALSE)) file.info <- NULL
if (missing(md5sum)) md5sum <- snapshot$md5sum
if (missing(full.names)) full.names <- snapshot$full.names
pre <- snapshot$pre
savewd <- getwd()
on.exit(setwd(savewd))
setwd(snapshot$wd)
args <- snapshot$args
newargs <- list(...)
args[names(newargs)] <- newargs
post <- dosnapshot(args)$info
prenames <- rownames(pre)
postnames <- rownames(post)
added <- setdiff(postnames, prenames)
deleted <- setdiff(prenames, postnames)
common <- intersect(prenames, postnames)
if (length(file.info)) {
preinfo <- pre[common, file.info]
postinfo <- post[common, file.info]
changes <- preinfo != postinfo
}
else changes <- matrix(logical(0), nrow = length(common), ncol = 0,
dimnames = list(common, character(0)))
if (length(timestamp))
changes <- cbind(changes, Newer = file_test("-nt", common,
timestamp))
if (md5sum) {
premd5 <- pre[common, "md5sum"]
postmd5 <- post[common, "md5sum"]
changes <- cbind(changes, md5sum = premd5 != postmd5)
}
changes1 <- changes[rowSums(changes, na.rm = TRUE) > 0, , drop =
FALSE]
changed <- rownames(changes1)
structure(list(added = added, deleted = deleted, changed = changed,
unchanged = setdiff(common, changed), changes = changes), class
=
"changedFiles")
}
print.changedFilesSnapshot <- function(x, ...) {
cat("changedFiles snapshot:\n timestamp = \"", x$timestamp, "\"\n
file.info = ",
if (length(x$file.info)) paste(paste0('"', x$file.info, '"'),
collapse=","),
"\n md5sum = ", x$md5sum, "\n args = ", deparse(x$args, control
=
NULL), "\n", sep="")
x
}
print.changedFiles <- function(x, ...) {
if (length(x$added)) cat("Files added:\n", paste0(" ", x$added,
collapse="\n"), "\n", sep="")
if (length(x$deleted)) cat("Files deleted:\n", paste0(" ",
x$deleted,
collapse="\n"), "\n", sep="")
changes <- x$changes
changes <- changes[rowSums(changes, na.rm = TRUE) > 0, , drop=FALSE]
changes <- changes[, colSums(changes, na.rm = TRUE) > 0, drop=FALSE]
if (nrow(changes)) {
cat("Files changed:\n")
print(changes)
}
x
}
----------------------
--- changedFiles.Rd:
\name{changedFiles}
\alias{changedFiles}
\alias{print.changedFiles}
\alias{print.changedFilesSnapshot}
\title{
Detect which files have changed
}
\description{
On the first call, \code{changedFiles} takes a snapshot of a selection of
files. In subsequent
calls, it takes another snapshot, and returns an object containing data
on
the
differences between the two snapshots. The snapshots need not be the
same
directory;
this could be used to compare two directories.
}
\usage{
changedFiles(snapshot, timestamp = tempfile("timestamp"), file.info =
NULL,
md5sum = FALSE, full.names = FALSE, ...)
}
\arguments{
\item{snapshot}{
The path to record, or a previous snapshot. See the Details.
}
\item{timestamp}{
The name of a file to write at the time the initial snapshot
is taken. In subsequent calls, modification times of files will be
compared
to
this file, and newer files will be reported as changed. Set to
\code{NULL}
to skip this test.
}
\item{file.info}{
A vector of columns from the result of the \code{file.info} function, or
a
logical value. If
\code{TRUE}, columns \code{c("size", "isdir", "mode", "mtime")} will be
used. Set to
\code{FALSE} or \code{NULL} to skip this test. See the Details.
}
\item{md5sum}{
A logical value indicating whether MD5 summaries should be taken as part
of
the snapshot.
}
\item{full.names}{
A logical value indicating whether full names (as in
\code{\link{list.files}}) should be
recorded.
}
\item{\dots}{
Additional parameters to pass to \code{\link{list.files}} to control the
set
of files
in the snapshots.
}
}
\details{
This function works in two modes. If the \code{snapshot} argument is
missing or is
not of S3 class \code{"changedFilesSnapshot"}, it is used as the
\code{path}
argument
to \code{\link{list.files}} to obtain a list of files. If it is of class
\code{"changedFilesSnapshot"}, then it is taken to be the baseline file
and a new snapshot is taken and compared with it. In the latter case,
missing
arguments default to match those from the initial snapshot.
If the \code{timestamp} argument is length 1, a file with that name is
created
in the current directory during the initial snapshot, and
\code{\link{file_test}}
is used to compare the age of all files to it during subsequent calls.
If the \code{file.info} argument is \code{TRUE} or it contains a
non-empty
character vector, the indicated columns from the result of a call to
\code{\link{file.info}} will be recorded and compared.
If \code{md5sum} is \code{TRUE}, the \code{tools::\link{md5sum}} function
will be called to record the 32 byte MD5 checksum for each file, and
these
values
will be compared.
}
\value{
In the initial snapshot phase, an object of class
\code{"changedFilesSnapshot"} is returned. This
is a list containing the fields
\item{pre}{a dataframe whose rownames are the filenames, and whose
columns
contain the
requested snapshot data}
\item{timestamp, file.info, md5sum, full.names}{a record of the arguments
in
the initial call}
\item{args}{other arguments passed via \code{...} to
\code{\link{list.files}}.}
In the comparison phase, an object of class \code{"changedFiles"}. This
is a
list containing
\item{added, deleted, changed, unchanged}{character vectors of filenames
from the before
and after snapshots, with obvious meanings}
\item{changes}{a logical matrix with a row for each common file, and a
column for each
comparison test. \code{TRUE} indicates a change in that test.}
\code{\link{print}} methods are defined for each of these types. The
\code{\link{print}} method for \code{"changedFilesSnapshot"} objects
displays the arguments used to produce it, while the one for
\code{"changedFiles"} displays the \code{added}, \code{deleted}
and \code{changed} fields if non-empty, and a submatrix of the
\code{changes}
matrix containing all of the \code{TRUE} values.
}
\author{
Duncan Murdoch
}
\seealso{
\code{\link{file.info}}, \code{\link{file_test}}, \code{\link{md5sum}}.
}
\examples{
# Create some files in a temporary directory
dir <- tempfile()
dir.create(dir)
Should a different name than 'dir' be used since 'dir' is a base function?
Such as?
'dir_', 'dir1', 'temp_dir', none of which is a base function. I thought that it was not recommended to create objects with the same name as functions, but perhaps this recommended practice is not agreed on.
Further, if someone is not very familiar with R (or just not in "R mode" at the time of reading), they might think that 'dir.create' is calling the create member of the object named 'dir' that you just made.
dir.create is an existing function. I wouldn't have named it that, but that's its name.
I meant that if the object is called, e.g. 'temp_dir', one will not think that 'dir.create' is a call to the 'create' member of 'dir' because there is no 'dir' object apart from the base function. But anyone with experience in R would know that this is not how R parses 'dir.create'. In any case, I shouldn't waste your time on such a minor and subjective thing. Scott
Duncan Murdoch
Scott
writeBin(1, file.path(dir, "file1"))
writeBin(2, file.path(dir, "file2"))
dir.create(file.path(dir, "dir"))
# Take a snapshot
snapshot <- changedFiles(dir, file.info=TRUE, md5sum=TRUE)
# Change one of the files
writeBin(3, file.path(dir, "file2"))
# Display the detected changes
changedFiles(snapshot)
changedFiles(snapshot)$changes
}
\keyword{utilities}
\keyword{file}
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
-- Scott Kostyshak Economics PhD Candidate Princeton University
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
-- Scott Kostyshak Economics PhD Candidate Princeton University
Comments inline:
On Wed, Sep 4, 2013 at 6:10 PM, Duncan Murdoch <murdoch.duncan at gmail.com> wrote:
On 13-09-04 8:02 PM, Karl Millar wrote:
Hi Duncan, I think this functionality would be much easier to use and understand if you split it up the functionality of taking snapshots and comparing them into separate functions.
Yes, that's another possibility. Some more comment below... In addition, the 'timestamp' functionality
seems both confusing and brittle to me. I think it would be better to store file modification times in the snapshot and use those instead of an external file. Maybe:
You can do that, using file.info = "mtime", but the file.info snapshots are quite a bit slower than using the timestamp file (when looking at a big recursive directory of files).
Sorry, I completely failed to explain what I was thinking here. There are a number of issues here, but the biggest one is that you're implicitly assuming that files that get modified will have mtimes that come after the timestamp file was created. This isn't always true, with the most notable exception being if you download a package from CRAN and untar it, the mtimes are usually well in the past (at least with GNU tar on a linux system), so this code won't notice that the files have changed. It may be a good idea to store the file sizes as well, which would help prevent false negatives in the (rare IIRC) cases where the contents have changed but the mtime values have not. Since you already need to call file.info() to get the mtime, this shouldn't increase the runtime, and the extra memory needed is fairly modest.
# Take a snapshot of the files. takeFileSnapshot(directory, file.info <http://file.info> = TRUE, md5sum = FALSE, full.names = FALSE, recursive = TRUE, ...) # Take a snapshot using the same options as used for snapshot. retakeFileSnapshot(snapshot, directory = snapshot$directory) { takeFileSnapshot)(directory, file.info <http://file.info> = snapshot$file.info <http://file.info>, md5sum = snapshot$md5sum, etc) } compareFileSnapshots(snapshot1, snapshot2) - or - getNewFiles(snapshat1, snapshot2) # These names are probably too generic getDeletedFiles(snapshot1, snapshot2) getUpdatedFiles(snapshot1, snapshot2) -or- setdiff(snapshot1, snapshot2) # Unclear how this should treat updated files This approach does have the difficulty that users could attempt to compare snapshots that were taken with different options and that can't be compared, but that should be an easy error to detect.
I don't want to add too many new functions. The general R style is to have functions that do a lot, rather than have a lot of different functions to achieve different parts of related tasks. This is better for interactive use (fewer functions to remember, a simpler help system to navigate), though it probably results in less readable code.
This is somewhat more nuanced and not particular to interactive use IMHO. Having functions that do a lot is good, _as long as the semantics are always consistent_. For example, lm() does a huge amount and has a wide variety of ways that you can specify your data, but it basically does the same thing no matter how you use it. On the other hand, if you have a function that does different things depending on how you call it (e.g. reshape()) then it's easy to remember the function name, but much harder to remember how to call it correctly, harder to understand the documentation and less readable.
I can see an argument for two functions (a get and a compare), but I don't think there are many cases where doing two gets and comparing the snapshots would be worth the extra runtime. (It's extra because file.info is only a little faster than list.files, and it would be unavoidable to call both twice in that version. Using the timestamp file avoids one of those calls, and replaces the other with file_test, which takes a similar amount of time. So overall it's about 20-25% faster.) It also makes the code a bit more complicated, i.e. three calls (get, get, compare) instead of two (get, compare).
I think a 'snapshotDirectory' and 'compareDirectoryToSnapshot' combination might work well. Thanks, Karl
On 13-09-06 2:46 AM, Karl Millar wrote:
Comments inline: On Wed, Sep 4, 2013 at 6:10 PM, Duncan Murdoch <murdoch.duncan at gmail.com> wrote:
On 13-09-04 8:02 PM, Karl Millar wrote:
Hi Duncan, I think this functionality would be much easier to use and understand if you split it up the functionality of taking snapshots and comparing them into separate functions.
Yes, that's another possibility. Some more comment below... In addition, the 'timestamp' functionality
seems both confusing and brittle to me. I think it would be better to store file modification times in the snapshot and use those instead of an external file. Maybe:
You can do that, using file.info = "mtime", but the file.info snapshots are quite a bit slower than using the timestamp file (when looking at a big recursive directory of files).
Sorry, I completely failed to explain what I was thinking here. There are a number of issues here, but the biggest one is that you're implicitly assuming that files that get modified will have mtimes that come after the timestamp file was created. This isn't always true, with the most notable exception being if you download a package from CRAN and untar it, the mtimes are usually well in the past (at least with GNU tar on a linux system), so this code won't notice that the files have changed. It may be a good idea to store the file sizes as well, which would help prevent false negatives in the (rare IIRC) cases where the contents have changed but the mtime values have not. Since you already need to call file.info() to get the mtime, this shouldn't increase the runtime, and the extra memory needed is fairly modest.
If we need to use file.info(), then I store the complete result, so I have size if I have mtime.
# Take a snapshot of the files. takeFileSnapshot(directory, file.info <http://file.info> = TRUE, md5sum = FALSE, full.names = FALSE, recursive = TRUE, ...) # Take a snapshot using the same options as used for snapshot. retakeFileSnapshot(snapshot, directory = snapshot$directory) { takeFileSnapshot)(directory, file.info <http://file.info> = snapshot$file.info <http://file.info>, md5sum = snapshot$md5sum, etc) } compareFileSnapshots(snapshot1, snapshot2) - or - getNewFiles(snapshat1, snapshot2) # These names are probably too generic getDeletedFiles(snapshot1, snapshot2) getUpdatedFiles(snapshot1, snapshot2) -or- setdiff(snapshot1, snapshot2) # Unclear how this should treat updated files This approach does have the difficulty that users could attempt to compare snapshots that were taken with different options and that can't be compared, but that should be an easy error to detect.
I don't want to add too many new functions. The general R style is to have functions that do a lot, rather than have a lot of different functions to achieve different parts of related tasks. This is better for interactive use (fewer functions to remember, a simpler help system to navigate), though it probably results in less readable code.
This is somewhat more nuanced and not particular to interactive use IMHO. Having functions that do a lot is good, _as long as the semantics are always consistent_. For example, lm() does a huge amount and has a wide variety of ways that you can specify your data, but it basically does the same thing no matter how you use it. On the other hand, if you have a function that does different things depending on how you call it (e.g. reshape()) then it's easy to remember the function name, but much harder to remember how to call it correctly, harder to understand the documentation and less readable.
I can see an argument for two functions (a get and a compare), but I don't think there are many cases where doing two gets and comparing the snapshots would be worth the extra runtime. (It's extra because file.info is only a little faster than list.files, and it would be unavoidable to call both twice in that version. Using the timestamp file avoids one of those calls, and replaces the other with file_test, which takes a similar amount of time. So overall it's about 20-25% faster.) It also makes the code a bit more complicated, i.e. three calls (get, get, compare) instead of two (get, compare).
I think a 'snapshotDirectory' and 'compareDirectoryToSnapshot' combination might work well.
I have split it into two functions. The compare function has two snapshot arguments, but if only the "before" is given, it will compute the "after" from the current file system. This makes a cleaner design, thanks for the suggestion. About the function names: selection of files for the snapshot is done by list.files, and that function's "path" argument can be a vector, so multiple directories can be recorded at once. I've chosen "fileSnapshot" and "changedFiles" so far, but those aren't perfect. I need to do a little more cleanup and testing, then I'll put the new version online somewhere. Duncan Murdoch
I have now put the code into a temporary package for testing; if anyone is interested, for a few days it will be downloadable from fisher.stats.uwo.ca/faculty/murdoch/temp/testpkg_1.0.tar.gz This uses two functions: fileSnapshot -- takes a snapshot changedFiles -- compares two snapshots, or one snapshot to the current file system Duncan Murdoch
On 06/09/2013 2:20 PM, Duncan Murdoch wrote:
I have now put the code into a temporary package for testing; if anyone is interested, for a few days it will be downloadable from fisher.stats.uwo.ca/faculty/murdoch/temp/testpkg_1.0.tar.gz
Sorry, error in the URL. It should be http://www.stats.uwo.ca/faculty/murdoch/temp/testpkg_1.0.tar.gz (This time I tested it! Thanks Scott for the heads-up.) Duncan Murdoch
This uses two functions: fileSnapshot -- takes a snapshot changedFiles -- compares two snapshots, or one snapshot to the current file system Duncan Murdoch
On Fri, Sep 6, 2013 at 3:46 PM, Duncan Murdoch <murdoch.duncan at gmail.com> wrote:
On 06/09/2013 2:20 PM, Duncan Murdoch wrote:
I have now put the code into a temporary package for testing; if anyone is interested, for a few days it will be downloadable from fisher.stats.uwo.ca/faculty/murdoch/temp/testpkg_1.0.tar.gz
Sorry, error in the URL. It should be http://www.stats.uwo.ca/faculty/murdoch/temp/testpkg_1.0.tar.gz
Works well. A couple of things I noticed:
(1)
md5sum is being called on directories, which causes warnings. (If this
is not viewed as undesirable, please ignore the rest of this comment.)
Should this be the responsibility of the user (by passing arguments to
list.files)? In the example, changing
fileSnapshot(dir, file.info=TRUE, md5sum=TRUE)
to
fileSnapshot(dir, file.info=TRUE, md5sum=TRUE, include.dirs=FALSE,
recursive=TRUE")
gets rid of the warnings. But perhaps the user just wants to exclude
directories for the md5sum calculations. This can't be controlled from
fileSnapshot.
Or, should the "if (md5sum)" chunk subset "fullnames" using file_test
or file.info to exclude directories (and then fill in the directories
with NA)?
(2)
If I run example(changedFiles) several times, sometimes I get:
chngdF> changedFiles(snapshot)
File changes:
mtime md5sum
file2 TRUE TRUE
and other times I get:
chngdF> changedFiles(snapshot)
File changes:
md5sum
file2 TRUE
I wonder why.
Scott
sessionInfo()
R Under development (unstable) (2013-08-31 r63780) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] testpkg_1.0 loaded via a namespace (and not attached): [1] tools_3.1.0
-- Scott Kostyshak Economics PhD Candidate Princeton University
On Fri, Sep 6, 2013 at 7:40 PM, Scott Kostyshak <skostysh at princeton.edu> wrote:
On Fri, Sep 6, 2013 at 3:46 PM, Duncan Murdoch <murdoch.duncan at gmail.com> wrote:
On 06/09/2013 2:20 PM, Duncan Murdoch wrote:
I have now put the code into a temporary package for testing; if anyone is interested, for a few days it will be downloadable from fisher.stats.uwo.ca/faculty/murdoch/temp/testpkg_1.0.tar.gz
Sorry, error in the URL. It should be http://www.stats.uwo.ca/faculty/murdoch/temp/testpkg_1.0.tar.gz
Works well. A couple of things I noticed:
(1)
md5sum is being called on directories, which causes warnings. (If this
is not viewed as undesirable, please ignore the rest of this comment.)
Should this be the responsibility of the user (by passing arguments to
list.files)? In the example, changing
fileSnapshot(dir, file.info=TRUE, md5sum=TRUE)
to
fileSnapshot(dir, file.info=TRUE, md5sum=TRUE, include.dirs=FALSE,
recursive=TRUE")
gets rid of the warnings. But perhaps the user just wants to exclude
directories for the md5sum calculations. This can't be controlled from
fileSnapshot.
Or, should the "if (md5sum)" chunk subset "fullnames" using file_test
or file.info to exclude directories (and then fill in the directories
with NA)?
(2)
If I run example(changedFiles) several times, sometimes I get:
chngdF> changedFiles(snapshot)
File changes:
mtime md5sum
file2 TRUE TRUE
and other times I get:
chngdF> changedFiles(snapshot)
File changes:
md5sum
file2 TRUE
I wonder why.
Putting the following in-between snapshot and writeBin in the example leads to consistent output: # allow for mtime to change Sys.sleep(.1) Scott
Scott
sessionInfo()
R Under development (unstable) (2013-08-31 r63780) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] testpkg_1.0 loaded via a namespace (and not attached): [1] tools_3.1.0
-- Scott Kostyshak Economics PhD Candidate Princeton University
-- Scott Kostyshak Economics PhD Candidate Princeton University
On 13-09-06 7:40 PM, Scott Kostyshak wrote:
On Fri, Sep 6, 2013 at 3:46 PM, Duncan Murdoch <murdoch.duncan at gmail.com> wrote:
On 06/09/2013 2:20 PM, Duncan Murdoch wrote:
I have now put the code into a temporary package for testing; if anyone is interested, for a few days it will be downloadable from fisher.stats.uwo.ca/faculty/murdoch/temp/testpkg_1.0.tar.gz
Sorry, error in the URL. It should be http://www.stats.uwo.ca/faculty/murdoch/temp/testpkg_1.0.tar.gz
Works well. A couple of things I noticed: (1) md5sum is being called on directories, which causes warnings. (If this is not viewed as undesirable, please ignore the rest of this comment.) Should this be the responsibility of the user (by passing arguments to list.files)? In the example, changing fileSnapshot(dir, file.info=TRUE, md5sum=TRUE) to fileSnapshot(dir, file.info=TRUE, md5sum=TRUE, include.dirs=FALSE, recursive=TRUE") gets rid of the warnings. But perhaps the user just wants to exclude directories for the md5sum calculations. This can't be controlled from fileSnapshot.
I don't see the warnings, I just get NA values. I'll try to see why there's a difference. (One possibility is my platform (Windows); another is that I'm generally testing in R-patched and R-devel rather than the 3.0.1 release version.) I would rather suppress the warnings than make the user avoid them.
Or, should the "if (md5sum)" chunk subset "fullnames" using file_test
or file.info to exclude directories (and then fill in the directories
with NA)?
(2)
If I run example(changedFiles) several times, sometimes I get:
chngdF> changedFiles(snapshot)
File changes:
mtime md5sum
file2 TRUE TRUE
and other times I get:
chngdF> changedFiles(snapshot)
File changes:
md5sum
file2 TRUE
I wonder why.
Sometimes the example runs so quickly that the new version has exactly the same modification time as the original. That's the risk of the mtime check. If you put a delay between, you'll get consistent results. Duncan Murdoch
Scott
sessionInfo()
R Under development (unstable) (2013-08-31 r63780) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] testpkg_1.0 loaded via a namespace (and not attached): [1] tools_3.1.0
-- Scott Kostyshak Economics PhD Candidate Princeton University
Hi Duncan,
I like the interface of this version a lot better, but there's still a
bunch of implementation details that need fixing:
* As previously mentioned, there are important cases where the mtime
values change in ways that this code doesn't detect.
* If the timestamp file (which is usually in the temp directory) gets
deleted (which can happen after a moderate amount of time of
inactivity on some systems), then the file_test('-nt', ...) will
always return false, even if the file has changed.
* If files get added or deleted between the two calls to list.files in
fileSnapshot, it will fail with an error.
* If the path is on a remote file system, tempdir is local, and
there's significant clock skew, then you can get incorrect results.
Unfortunately, these aren't just theoretical scenarios -- I've had the
misfortune to run up against all of them in the past.
I've attached code that's loosely based on your implementation that
solves these problems AFAICT. Alternatively, Hadley's code handles
all of these correctly, with the exception that compare_state doesn't
handle the case where safe_digest returns NA very well.
Regards,
Karl
On Fri, Sep 6, 2013 at 5:40 PM, Duncan Murdoch <murdoch.duncan at gmail.com> wrote:
On 13-09-06 7:40 PM, Scott Kostyshak wrote:
On Fri, Sep 6, 2013 at 3:46 PM, Duncan Murdoch <murdoch.duncan at gmail.com> wrote:
On 06/09/2013 2:20 PM, Duncan Murdoch wrote:
I have now put the code into a temporary package for testing; if anyone is interested, for a few days it will be downloadable from fisher.stats.uwo.ca/faculty/murdoch/temp/testpkg_1.0.tar.gz
Sorry, error in the URL. It should be http://www.stats.uwo.ca/faculty/murdoch/temp/testpkg_1.0.tar.gz
Works well. A couple of things I noticed: (1) md5sum is being called on directories, which causes warnings. (If this is not viewed as undesirable, please ignore the rest of this comment.) Should this be the responsibility of the user (by passing arguments to list.files)? In the example, changing fileSnapshot(dir, file.info=TRUE, md5sum=TRUE) to fileSnapshot(dir, file.info=TRUE, md5sum=TRUE, include.dirs=FALSE, recursive=TRUE") gets rid of the warnings. But perhaps the user just wants to exclude directories for the md5sum calculations. This can't be controlled from fileSnapshot.
I don't see the warnings, I just get NA values. I'll try to see why there's a difference. (One possibility is my platform (Windows); another is that I'm generally testing in R-patched and R-devel rather than the 3.0.1 release version.) I would rather suppress the warnings than make the user avoid them.
Or, should the "if (md5sum)" chunk subset "fullnames" using file_test
or file.info to exclude directories (and then fill in the directories
with NA)?
(2)
If I run example(changedFiles) several times, sometimes I get:
chngdF> changedFiles(snapshot)
File changes:
mtime md5sum
file2 TRUE TRUE
and other times I get:
chngdF> changedFiles(snapshot)
File changes:
md5sum
file2 TRUE
I wonder why.
Sometimes the example runs so quickly that the new version has exactly the same modification time as the original. That's the risk of the mtime check. If you put a delay between, you'll get consistent results. Duncan Murdoch
Scott
sessionInfo()
R Under development (unstable) (2013-08-31 r63780) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] testpkg_1.0 loaded via a namespace (and not attached): [1] tools_3.1.0
-- Scott Kostyshak Economics PhD Candidate Princeton University
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
On 13-09-06 9:21 PM, Karl Millar wrote:
Hi Duncan,
I like the interface of this version a lot better, but there's still a
bunch of implementation details that need fixing:
* As previously mentioned, there are important cases where the mtime
values change in ways that this code doesn't detect.
* If the timestamp file (which is usually in the temp directory) gets
deleted (which can happen after a moderate amount of time of
inactivity on some systems), then the file_test('-nt', ...) will
always return false, even if the file has changed.
If that happened without user intervention, I think it would break other things in R -- the temp directory is supposed to last for the whole session. But I should be checking anyway.
* If files get added or deleted between the two calls to list.files in fileSnapshot, it will fail with an error.
Yours won't work if path contains more than one directory. This is probably a reasonable restriction, but it's inconsistent with list.files, so I'd like to avoid it if I can find a way. Duncan Murdoch
* If the path is on a remote file system, tempdir is local, and there's significant clock skew, then you can get incorrect results. Unfortunately, these aren't just theoretical scenarios -- I've had the misfortune to run up against all of them in the past. I've attached code that's loosely based on your implementation that solves these problems AFAICT. Alternatively, Hadley's code handles all of these correctly, with the exception that compare_state doesn't handle the case where safe_digest returns NA very well. Regards, Karl On Fri, Sep 6, 2013 at 5:40 PM, Duncan Murdoch <murdoch.duncan at gmail.com> wrote:
On 13-09-06 7:40 PM, Scott Kostyshak wrote:
On Fri, Sep 6, 2013 at 3:46 PM, Duncan Murdoch <murdoch.duncan at gmail.com> wrote:
On 06/09/2013 2:20 PM, Duncan Murdoch wrote:
I have now put the code into a temporary package for testing; if anyone is interested, for a few days it will be downloadable from fisher.stats.uwo.ca/faculty/murdoch/temp/testpkg_1.0.tar.gz
Sorry, error in the URL. It should be http://www.stats.uwo.ca/faculty/murdoch/temp/testpkg_1.0.tar.gz
Works well. A couple of things I noticed: (1) md5sum is being called on directories, which causes warnings. (If this is not viewed as undesirable, please ignore the rest of this comment.) Should this be the responsibility of the user (by passing arguments to list.files)? In the example, changing fileSnapshot(dir, file.info=TRUE, md5sum=TRUE) to fileSnapshot(dir, file.info=TRUE, md5sum=TRUE, include.dirs=FALSE, recursive=TRUE") gets rid of the warnings. But perhaps the user just wants to exclude directories for the md5sum calculations. This can't be controlled from fileSnapshot.
I don't see the warnings, I just get NA values. I'll try to see why there's a difference. (One possibility is my platform (Windows); another is that I'm generally testing in R-patched and R-devel rather than the 3.0.1 release version.) I would rather suppress the warnings than make the user avoid them.
Or, should the "if (md5sum)" chunk subset "fullnames" using file_test
or file.info to exclude directories (and then fill in the directories
with NA)?
(2)
If I run example(changedFiles) several times, sometimes I get:
chngdF> changedFiles(snapshot)
File changes:
mtime md5sum
file2 TRUE TRUE
and other times I get:
chngdF> changedFiles(snapshot)
File changes:
md5sum
file2 TRUE
I wonder why.
Sometimes the example runs so quickly that the new version has exactly the same modification time as the original. That's the risk of the mtime check. If you put a delay between, you'll get consistent results. Duncan Murdoch
Scott
sessionInfo()
R Under development (unstable) (2013-08-31 r63780) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] testpkg_1.0 loaded via a namespace (and not attached): [1] tools_3.1.0
-- Scott Kostyshak Economics PhD Candidate Princeton University
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
On Fri, Sep 6, 2013 at 7:03 PM, Duncan Murdoch <murdoch.duncan at gmail.com> wrote:
On 13-09-06 9:21 PM, Karl Millar wrote:
Hi Duncan,
I like the interface of this version a lot better, but there's still a
bunch of implementation details that need fixing:
* As previously mentioned, there are important cases where the mtime
values change in ways that this code doesn't detect.
* If the timestamp file (which is usually in the temp directory) gets
deleted (which can happen after a moderate amount of time of
inactivity on some systems), then the file_test('-nt', ...) will
always return false, even if the file has changed.
If that happened without user intervention, I think it would break other things in R -- the temp directory is supposed to last for the whole session. But I should be checking anyway.
Yes, it does break other things in R -- my experience has been that the help system seems to be the one that is impacted the most by this. FWIW, I've never seen the entire R temp directory deleted, just individual files and subdirectories in it, but even that probably depends on how the machine is configured. I suspect only a few users ever notice this, but my R use is probably somewhat anomalous and I think it only happens to R sessions that I haven't used for a few days.
* If files get added or deleted between the two calls to list.files in fileSnapshot, it will fail with an error.
Yours won't work if path contains more than one directory. This is probably a reasonable restriction, but it's inconsistent with list.files, so I'd like to avoid it if I can find a way.
I'm currently unsure what the behaviour when comparing snapshots with
multiple directories should be.
Presumably we should have the property that (horribly abusing notation
for succinctness):
compareSnapshots(c(a1, a2), c(a1, a2))
is the same as concatenating (in some form)
compareSnapshots(a1, a1) and compareSnapshots(a2, a2)
and there's a bunch of ways we could concatenate -- we could return a
list of results, or a single result where each of the 'added, deleted,
modified' fields are a list, or where we concatenate the 'added,
deleted, modified' fields together into three simple vectors.
Concatenating the vectors together like this is appealing, but unless
you're using the full names, it doesn't include the information of
which directory the changes are in, and using the full names doesn't
work in the case where you're comparing different sets of directories,
e.g. compareSnapshots(c(a1, a2), c(b1, b2)), where there is no
sensible choice for a full name. The list options don't have this
problem, but are harder to work with, particularly for the common case
where there's only a single directory. You'd also have to be somewhat
careful with filenames that occur in both directories.
Maybe I'm just being dense, but I don't see a way to do this thats
clear, easy to use and wouldn't confuse users at the moment.
Karl
Duncan Murdoch
* If the path is on a remote file system, tempdir is local, and there's significant clock skew, then you can get incorrect results. Unfortunately, these aren't just theoretical scenarios -- I've had the misfortune to run up against all of them in the past. I've attached code that's loosely based on your implementation that solves these problems AFAICT. Alternatively, Hadley's code handles all of these correctly, with the exception that compare_state doesn't handle the case where safe_digest returns NA very well. Regards, Karl On Fri, Sep 6, 2013 at 5:40 PM, Duncan Murdoch <murdoch.duncan at gmail.com> wrote:
On 13-09-06 7:40 PM, Scott Kostyshak wrote:
On Fri, Sep 6, 2013 at 3:46 PM, Duncan Murdoch <murdoch.duncan at gmail.com> wrote:
On 06/09/2013 2:20 PM, Duncan Murdoch wrote:
I have now put the code into a temporary package for testing; if anyone is interested, for a few days it will be downloadable from fisher.stats.uwo.ca/faculty/murdoch/temp/testpkg_1.0.tar.gz
Sorry, error in the URL. It should be http://www.stats.uwo.ca/faculty/murdoch/temp/testpkg_1.0.tar.gz
Works well. A couple of things I noticed: (1) md5sum is being called on directories, which causes warnings. (If this is not viewed as undesirable, please ignore the rest of this comment.) Should this be the responsibility of the user (by passing arguments to list.files)? In the example, changing fileSnapshot(dir, file.info=TRUE, md5sum=TRUE) to fileSnapshot(dir, file.info=TRUE, md5sum=TRUE, include.dirs=FALSE, recursive=TRUE") gets rid of the warnings. But perhaps the user just wants to exclude directories for the md5sum calculations. This can't be controlled from fileSnapshot.
I don't see the warnings, I just get NA values. I'll try to see why there's a difference. (One possibility is my platform (Windows); another is that I'm generally testing in R-patched and R-devel rather than the 3.0.1 release version.) I would rather suppress the warnings than make the user avoid them.
Or, should the "if (md5sum)" chunk subset "fullnames" using file_test
or file.info to exclude directories (and then fill in the directories
with NA)?
(2)
If I run example(changedFiles) several times, sometimes I get:
chngdF> changedFiles(snapshot)
File changes:
mtime md5sum
file2 TRUE TRUE
and other times I get:
chngdF> changedFiles(snapshot)
File changes:
md5sum
file2 TRUE
I wonder why.
Sometimes the example runs so quickly that the new version has exactly the same modification time as the original. That's the risk of the mtime check. If you put a delay between, you'll get consistent results. Duncan Murdoch
Scott
sessionInfo()
R Under development (unstable) (2013-08-31 r63780) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] testpkg_1.0 loaded via a namespace (and not attached): [1] tools_3.1.0
-- Scott Kostyshak Economics PhD Candidate Princeton University
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
1 day later
On 13-09-06 11:07 PM, Karl Millar wrote:
On Fri, Sep 6, 2013 at 7:03 PM, Duncan Murdoch <murdoch.duncan at gmail.com> wrote:
On 13-09-06 9:21 PM, Karl Millar wrote:
Hi Duncan,
I like the interface of this version a lot better, but there's still a
bunch of implementation details that need fixing:
* As previously mentioned, there are important cases where the mtime
values change in ways that this code doesn't detect.
* If the timestamp file (which is usually in the temp directory) gets
deleted (which can happen after a moderate amount of time of
inactivity on some systems), then the file_test('-nt', ...) will
always return false, even if the file has changed.
If that happened without user intervention, I think it would break other things in R -- the temp directory is supposed to last for the whole session. But I should be checking anyway.
Yes, it does break other things in R -- my experience has been that the help system seems to be the one that is impacted the most by this. FWIW, I've never seen the entire R temp directory deleted, just individual files and subdirectories in it, but even that probably depends on how the machine is configured. I suspect only a few users ever notice this, but my R use is probably somewhat anomalous and I think it only happens to R sessions that I haven't used for a few days.
I use Windows and never see this; deleting temp files is up to me, not to the system. But my understanding was the *nix systems should only clean up /tmp on restart, and I don't think an R session will survive a restart. However, you have convinced me that the use of the timestamp file is not beneficial enough to be the default. I'll leave it as an option, but add warnings that it might be unreliable.
* If files get added or deleted between the two calls to list.files in fileSnapshot, it will fail with an error.
Yours won't work if path contains more than one directory. This is probably a reasonable restriction, but it's inconsistent with list.files, so I'd like to avoid it if I can find a way.
I'm currently unsure what the behaviour when comparing snapshots with
multiple directories should be.
Presumably we should have the property that (horribly abusing notation
for succinctness):
compareSnapshots(c(a1, a2), c(a1, a2))
is the same as concatenating (in some form)
compareSnapshots(a1, a1) and compareSnapshots(a2, a2)
and there's a bunch of ways we could concatenate -- we could return a
list of results, or a single result where each of the 'added, deleted,
modified' fields are a list, or where we concatenate the 'added,
deleted, modified' fields together into three simple vectors.
Concatenating the vectors together like this is appealing, but unless
you're using the full names, it doesn't include the information of
which directory the changes are in, and using the full names doesn't
work in the case where you're comparing different sets of directories,
e.g. compareSnapshots(c(a1, a2), c(b1, b2)), where there is no
sensible choice for a full name. The list options don't have this
problem, but are harder to work with, particularly for the common case
where there's only a single directory. You'd also have to be somewhat
careful with filenames that occur in both directories.
Maybe I'm just being dense, but I don't see a way to do this thats
clear, easy to use and wouldn't confuse users at the moment.
The way I've done this is to require full.names when multiple dirs are on the path. I've reduced it to one list.files() call per dir, by iterating over the path variable and using your approach of calling it with full.names = FALSE, then adding the dir if necessary. I haven't adopted your change that forces comparison of only size and mtime from file.info. I don't see a big cost in storing whatever file.info returns (which is system dependent; on Windows I don't see the user and group related columns; on Unix I don't see the exe column). Users might want to detect changes to anything there, and I shouldn't make it harder for them. I've also kept the special-casing of md5sum; it really needs to be wrapped in suppressWarnings() (on Unix only). And I've kept the options to specify what changedFiles checks among the file.info columns; I can see that you might want a snapshot with everything, but sometimes only want to be told about changes in a subset of the attributes. I've uploaded <http://www.stats.uwo.ca/faculty/murdoch/temp/testpkg_1.1.tar.gz> if anyone is interested. Duncan Murdoch
Karl
Duncan Murdoch
* If the path is on a remote file system, tempdir is local, and there's significant clock skew, then you can get incorrect results. Unfortunately, these aren't just theoretical scenarios -- I've had the misfortune to run up against all of them in the past. I've attached code that's loosely based on your implementation that solves these problems AFAICT. Alternatively, Hadley's code handles all of these correctly, with the exception that compare_state doesn't handle the case where safe_digest returns NA very well. Regards, Karl On Fri, Sep 6, 2013 at 5:40 PM, Duncan Murdoch <murdoch.duncan at gmail.com> wrote:
On 13-09-06 7:40 PM, Scott Kostyshak wrote:
On Fri, Sep 6, 2013 at 3:46 PM, Duncan Murdoch <murdoch.duncan at gmail.com> wrote:
On 06/09/2013 2:20 PM, Duncan Murdoch wrote:
I have now put the code into a temporary package for testing; if anyone is interested, for a few days it will be downloadable from fisher.stats.uwo.ca/faculty/murdoch/temp/testpkg_1.0.tar.gz
Sorry, error in the URL. It should be http://www.stats.uwo.ca/faculty/murdoch/temp/testpkg_1.0.tar.gz
Works well. A couple of things I noticed: (1) md5sum is being called on directories, which causes warnings. (If this is not viewed as undesirable, please ignore the rest of this comment.) Should this be the responsibility of the user (by passing arguments to list.files)? In the example, changing fileSnapshot(dir, file.info=TRUE, md5sum=TRUE) to fileSnapshot(dir, file.info=TRUE, md5sum=TRUE, include.dirs=FALSE, recursive=TRUE") gets rid of the warnings. But perhaps the user just wants to exclude directories for the md5sum calculations. This can't be controlled from fileSnapshot.
I don't see the warnings, I just get NA values. I'll try to see why there's a difference. (One possibility is my platform (Windows); another is that I'm generally testing in R-patched and R-devel rather than the 3.0.1 release version.) I would rather suppress the warnings than make the user avoid them.
Or, should the "if (md5sum)" chunk subset "fullnames" using file_test
or file.info to exclude directories (and then fill in the directories
with NA)?
(2)
If I run example(changedFiles) several times, sometimes I get:
chngdF> changedFiles(snapshot)
File changes:
mtime md5sum
file2 TRUE TRUE
and other times I get:
chngdF> changedFiles(snapshot)
File changes:
md5sum
file2 TRUE
I wonder why.
Sometimes the example runs so quickly that the new version has exactly the same modification time as the original. That's the risk of the mtime check. If you put a delay between, you'll get consistent results. Duncan Murdoch
Scott
sessionInfo()
R Under development (unstable) (2013-08-31 r63780)
Platform: x86_64-unknown-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] testpkg_1.0
loaded via a namespace (and not attached):
[1] tools_3.1.0
-- Scott Kostyshak Economics PhD Candidate Princeton University
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
On Sun, Sep 8, 2013 at 10:55 AM, Duncan Murdoch
<murdoch.duncan at gmail.com> wrote:
On 13-09-06 11:07 PM, Karl Millar wrote:
On Fri, Sep 6, 2013 at 7:03 PM, Duncan Murdoch <murdoch.duncan at gmail.com> wrote:
On 13-09-06 9:21 PM, Karl Millar wrote:
Hi Duncan,
I like the interface of this version a lot better, but there's still a
bunch of implementation details that need fixing:
* As previously mentioned, there are important cases where the mtime
values change in ways that this code doesn't detect.
* If the timestamp file (which is usually in the temp directory) gets
deleted (which can happen after a moderate amount of time of
inactivity on some systems), then the file_test('-nt', ...) will
always return false, even if the file has changed.
If that happened without user intervention, I think it would break other things in R -- the temp directory is supposed to last for the whole session. But I should be checking anyway.
Yes, it does break other things in R -- my experience has been that the help system seems to be the one that is impacted the most by this. FWIW, I've never seen the entire R temp directory deleted, just individual files and subdirectories in it, but even that probably depends on how the machine is configured. I suspect only a few users ever notice this, but my R use is probably somewhat anomalous and I think it only happens to R sessions that I haven't used for a few days.
I use Windows and never see this; deleting temp files is up to me, not to the system. But my understanding was the *nix systems should only clean up /tmp on restart, and I don't think an R session will survive a restart. However, you have convinced me that the use of the timestamp file is not beneficial enough to be the default. I'll leave it as an option, but add warnings that it might be unreliable.
* If files get added or deleted between the two calls to list.files in fileSnapshot, it will fail with an error.
Yours won't work if path contains more than one directory. This is probably a reasonable restriction, but it's inconsistent with list.files, so I'd like to avoid it if I can find a way.
I'm currently unsure what the behaviour when comparing snapshots with
multiple directories should be.
Presumably we should have the property that (horribly abusing notation
for succinctness):
compareSnapshots(c(a1, a2), c(a1, a2))
is the same as concatenating (in some form)
compareSnapshots(a1, a1) and compareSnapshots(a2, a2)
and there's a bunch of ways we could concatenate -- we could return a
list of results, or a single result where each of the 'added, deleted,
modified' fields are a list, or where we concatenate the 'added,
deleted, modified' fields together into three simple vectors.
Concatenating the vectors together like this is appealing, but unless
you're using the full names, it doesn't include the information of
which directory the changes are in, and using the full names doesn't
work in the case where you're comparing different sets of directories,
e.g. compareSnapshots(c(a1, a2), c(b1, b2)), where there is no
sensible choice for a full name. The list options don't have this
problem, but are harder to work with, particularly for the common case
where there's only a single directory. You'd also have to be somewhat
careful with filenames that occur in both directories.
Maybe I'm just being dense, but I don't see a way to do this thats
clear, easy to use and wouldn't confuse users at the moment.
The way I've done this is to require full.names when multiple dirs are on the path. I've reduced it to one list.files() call per dir, by iterating over the path variable and using your approach of calling it with full.names = FALSE, then adding the dir if necessary. I haven't adopted your change that forces comparison of only size and mtime from file.info. I don't see a big cost in storing whatever file.info returns (which is system dependent; on Windows I don't see the user and group related columns; on Unix I don't see the exe column). Users might want to detect changes to anything there, and I shouldn't make it harder for them. I've also kept the special-casing of md5sum; it really needs to be wrapped in suppressWarnings() (on Unix only). And I've kept the options to specify what changedFiles checks among the file.info columns; I can see that you might want a snapshot with everything, but sometimes only want to be told about changes in a subset of the attributes. I've uploaded <http://www.stats.uwo.ca/faculty/murdoch/temp/testpkg_1.1.tar.gz> if anyone is interested.
Works well. Scott -- Scott Kostyshak Economics PhD Candidate Princeton University
Thanks for everyone's comments on this. I have now committed a version to R-devel. I don't plan to backport it to 3.0.2 (coming out in a couple of weeks), but it should appear in 3.1.0 in the spring, and it's conceivable it could make it into 3.0.3 (not yet scheduled). Duncan Murdoch