Here are 2 functions, which.just.above and which.just.below,
which may help you. They will tell which element in a reference
dataset is the first just above (or just below) each element
in the main dataset (x). They return NA if there is no reference
element above (or below) an element of x. The strict argument
lets you say if the inequalities are strict or if equality is
acceptable.
They are vectorized so are pretty quick.
E.g.,
> which.just.below(c(14,14.5), 11:15, strict=TRUE)
[1] 3 4
> which.just.above(c(14,14.5), 11:15, strict=FALSE)
[1] 4 5
They should work with any class of data that order() and sort()
work on. In particular, POSIXct times work. The attached file
has a demonstration function called 'test' with some examples.
In your case the 'reference' data would be the times at which your
backup measurements were taken and the 'x' data would be the
times of the pings. You can look at the elements of 'reference' just
before and just after each ping (or just the pings that are missing
locations) and decide how to combine the data from the bracketing
reference elements to inpute a location for the ping.
Here are the functions, in case the attachment doesn't make it
through. I'm sure some mailer will throw in some newlines so
it will be corrupted.
"which.just.above" <-
function(x, reference, strict = T)
{
# output[k] will be index of smallest value in reference vector
# larger than x[k]. If strict=F, replace 'larger than' by
# 'larger than or equal to'.
# We should allow NA's in x (but we don't). NA's in reference
# should not be allowed.
if(any(is.na(x)) || any(is.na(reference))) stop("NA's in input")
if(strict)
i <- c(rep(T, length(reference)), rep(F,
length(x)))[order(
c(reference, x))]
else i <- c(rep(F, length(x)), rep(T,
length(reference)))[order(c(
x, reference))]
i <- cumsum(i)[!i] + 1.
i[i > length(reference)] <- NA
# i is length of x and has values in range 1:length(reference)
or NA
# following needed if reference is not sorted
i <- order(reference)[i]
# following needed if x is not sorted
i[order(order(x))]
}
"which.just.below" <-
function(x, reference, strict = T)
{
# output[k] will be index of largest value in reference vector
# less than x[k]. If strict=F, replace 'less than' by
# 'less than or equal to'. Neither x nor reference need be
# sorted, although they should not have NA's (in theory, NA's
# in x are ok, but not in reference).
if(any(is.na(x)) || any(is.na(reference))) stop("NA's in input")
if(!strict)
i <- c(rep(T, length(reference)), rep(F,
length(x)))[order(
c(reference, x))]
else i <- c(rep(F, length(x)), rep(T,
length(reference)))[order(c(
x, reference))]
i <- cumsum(i)[!i]
i[i <= 0] <- NA
# i is length of x and has values in range 1:length(reference)
or NA
# following needed if reference is not sorted
i <- order(reference)[i]
# following needed if x is not sorted
i[order(order(x))]
}
Bill Dunlap
TIBCO Software Inc - Spotfire Division
wdunlap tibco.com
-----Original Message-----
From: r-help-bounces at r-project.org
[mailto:r-help-bounces at r-project.org] On Behalf Of Tim Clark
Sent: Thursday, May 21, 2009 9:45 PM
To: r-help at r-project.org
Subject: [R] Need a faster function to replace missing data
Dear List,
I need some help in coming up with a function that will take
two data sets, determine if a value is missing in one, find a
value in the second that was taken at about the same time,
and substitute the second value in for where the first should
have been. My problem is from a fish tracking study. We put
acoustic tags in fish and track them for several days.
Location data is supposed to be automatically recorded every
time we detect a "ping" from the fish. Unfortunately the GPS
had some problems and sometimes the fishes depth was recorded
but not its location. I fortunately had a back-up GPS that
was taking location data every five minutes. I would like to
merge the two files, replacing the missing value in the vscan
(automatic) file with the location from the garmin file.
Since we were getting vscan records every 1-2 seconds and
garmin records every 5 minutes, I need to find the right
place in the vscan file to place the garmin record - i.e. the
closest in time, but not greater than 5 minutes. I have
written a function that does this. However, it works with my
test data but locks up my computer with my real data. I have
several million vscan records and several thousand garmin
records. Is there a better way to do this?
My function and test data:
myvscan<-data.frame(c(1,NA,1.5),times(c("12:00:00","12:14:00",
"12:20:00")))
names(myvscan)<-c("Latitude","DateTime")
mygarmin<-data.frame(c(20,30,40),times(("12:00:00","12:10:00",
"12:15:00")))
names(mygarmin)<-c("Latitude","DateTime")
minute.diff<-1/24/12 #Time diff is in days, so this is 5 minutes
for (k in 1:nrow(myvscan))
{
if (is.na(myvscan$Latitude[k]))
{
if ((min(abs(mygarmin$DateTime-myvscan$DateTime[k]))) < minute.diff )
{
index.min.date<-which.min(abs(mygarmin$DateTime-myvscan$DateTime[k]))
myvscan$Latitude[k]<-mygarmin$Latitude[index.min.date]
}}}
I appreciate your help and advice.
Aloha,
Tim
Tim Clark
Department of Zoology
University of Hawaii