match and unique
Hi Terry,
On 03/16/2016 08:03 AM, Therneau, Terry M., Ph.D. wrote:
Is the phrase "index <- match(x, sort(unique(x)))" reliable, in the sense that it will never return NA?
This is assuming that match() and unique() will never disagree on
equality between 2 floating point values. I believe they share some
code internally (same hashing routine?), so maybe it's reliable.
Anyway, it's always preferable to not rely on this kind of assumption.
A safer thing to do is to use rank():
r <- rank(x, ties.method="min") # could use "max"
Think of 'r' as a unique ID assigned to each value in 'x'. This ID takes
its values in the (1,length(x)) range but we want it to take its values
in the (1,length(unique(x))) range:
ID_remapping <- cumsum(tabulate(r, nbins=length(r)) != 0L)
index <- ID_remapping[r]
'index' will be the same as 'match(x, sort(unique(x))' but doesn't rely
on the assumption that match() and unique() agree on equality between
2 floating point values.
Unfortunately rank() is very slow, much slower than sort(). Here is a
faster solution based on sort.list(x, na.last=NA, method="quick"):
assignID <- function(x)
{
oo <- sort.list(x, na.last=NA, method="quick")
sorted <- x[oo]
is_unique <- c(TRUE, sorted[-length(sorted)] != sorted[-1L])
sorted_ID <- cumsum(is_unique)
ID <- integer(length(x))
ID[oo] <- sorted_ID
ID
}
'assignID(x)' is also slightly faster than 'match(x, sort(unique(x)))':
x <- runif(5000000)
system.time(index1 <- match(x, sort(unique(x))))
# user system elapsed
# 2.170 0.552 2.725
system.time(index2 <- assignID(x))
# user system elapsed
# 0.885 0.032 0.917
identical(index1, index2)
# [1] TRUE
Cheers,
H.
Context: Calculation of survival curves involves the concept of unique
death times. I've had reported cases in the past where survfit failed,
and it was due to the fact that two "differ by machine precision" values
would sometimes match and sometimes not, depending on how I compared
them. I've dealt with those piecemeal in the past, but am going to do a
code review and make sure that I do things consistently throughout the
survival package. The basic plan will be to change time to an integer,
do all the work, then restore labels at the end. The above line is one
candidate for the first step.
An alternative is index <- as.numeric(factor(x)), with
as.numeric(levels(factor(x))) as the final labeling step. This is a
more severe rounding, is it not? But perhaps it is preferable? The KM
branch of the current survfit routine does this, and I've had one user
report a bug in that
x <- runif(20)
fit <- survfit(Surv(x) ~1)
summary(fit, times=x)
will produce lines with 0, 1 or 2 events when "they all should be 1".
The same issue just came up in an rpart example, sent to me. For coxph
models is may only be a matter of time.
Suggestions and opinions are welcome.
Terry Therneau
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Herv? Pag?s Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fredhutch.org Phone: (206) 667-5791 Fax: (206) 667-1319