Skip to content

Best way to do temporal joins in R?

6 messages · Jonathan Greenberg, Gabor Grothendieck

#
I've been playing with zoo a bit, and it seems ok except it doesn't 
support non-unique time stamps when performing joins.  I have two 
databases which contain a dataframe of a Date object (with the time, not 
just MM/DD/YY), e.g.:

DB 1:
UniqueID,Date1,Data 1,Data 2

DB 2:
Date2, Station, Data 3

We'll say Station can contain three values: A,B and C

DB 1 may have some repeat times, and DB 2 definitely has them -- 
although each Date, Station combo is unique (this DB contains weather 
data collected on the half-hour or fifteen minute interval at a set of 
stations).  I'd like DB2's station and Data3 to be joined with DB1 based 
on the nearest time stamp (interpolating Data3 or not).

Ideally, I'd like a fused database such that I get for each uniqueID in DB1:

UniqueID,Date,Data1,Data2,Station,Data3

Thoughts?  Hints?

--j
#
Sorry for the immediate follow-up, but Phil Spector correctly reminded 
me this is a lot easier for the community I provide some sample data, so 
I'm attaching 3 small CSVs to this email:

species_data_Rexample.csv contains the "field data" (which species was 
ID'd and what time it was ID'd),
temperature_data_Rexample.csv contains the date, time, station ID and 
the temperature "value"

I'd like a dataframe which contains for each unique line in 
species_data_Rexample.csv, a series of lines, one per station, and the 
temperature of the nearest time stamp, or an interpolated value 
(weighted average would be fine, but so would just grabbing the nearest 
value), so for this example I'd like something that looks like the csv 
"fused_data_Rexample.csv"

Thanks!

--j
Jonathan Greenberg wrote:
#
There was nothing attached.

On Mon, Mar 16, 2009 at 3:11 PM, Jonathan Greenberg
<greenberg at ucdavis.edu> wrote:
#
Weird -- the  email was sent through my gmail account, looks like the 
.csvs got intercepted somewhere along the way.  At any rate, I placed 
them on a website:

http://cstars.ucdavis.edu/~jongreen/temp/temporal_join_R/

--j
Gabor Grothendieck wrote:

  
    
#
I am assuming that for each species and station_id that you want
the value in that temp record whose date/time is closest to the
date/time in the species record along with the identifying information
(species, station_id) and date/time of the species.   That interpretation
does give the same answer as in the fused data set you posted.

First we read in temp and use chron to convert the date/times to
chron.  Similarly we do that for species.

Then we define a function which measures the "distance" between
two date/times and we define another function f which takes a
species rowname and merges that row with temp.  Finally
we call lapply that function to species.

library(chron)

temp <- read.csv("temperature_data_Rexample.csv")
temp$dt <- as.chron(paste(temp$date, temp$hour), "%Y%m%d %H%M")

species <- read.csv("species_data_Rexample.csv")
ds <- species$Date_Sampled
species$dt <- chron(sub(" .*", "", ds), gsub("[apm]+$|^.* ", "", ds)) +
	(regexpr("pm", ds) > 0)/2  # add half a day if its pm

mydist <- function(x, y) abs(as.numeric(x-y))

f <- function(r) {
	s <- species[r, ]
	out <- by(temp, temp$station_id, function(x) {
		imin <- which.min(mydist(x$dt, s$dt))
		data.frame(Species = s$Species, Date = s$dt,
			station_id = x[imin, "station_id"], value = x[imin, "value"])
	})
	do.call(rbind, out)
}

do.call(rbind, lapply(rownames(species), f))

Result of last line is:

   Species                Date station_id value
1 SpeciesB (06/23/08 13:55:11)        ANH  2.25
2 SpeciesA (06/23/08 13:43:11)        ANH  2.25
3 SpeciesC (06/23/08 13:55:11)        ANH  2.25
4 SpeciesB (06/23/08 13:55:11)        BDT  3.82
5 SpeciesA (06/23/08 13:43:11)        BDT  3.90
6 SpeciesC (06/23/08 13:55:11)        BDT  3.82



On Mon, Mar 16, 2009 at 7:41 PM, Jonathan Greenberg
<greenberg at ucdavis.edu> wrote:
#
In thinking about this some more an SQL solution would be a
bit easier.  Try this where species and temp are from
the prior post (i.e. after processing with chron).

The first statement finds the minimum distances for
each species and station_id combination.  The
second finds all combinations of a record in species
with a record in temp, i.e. outer join.  The last sql
statement merges those two over Species, station_id
and dist.  Note that value is a keyword in sql so
its name is automatically transformed to value__1 to
avoid a conflict.  At the end we fix up the dt column
so its chron again.

library(sqldf)

mins <- sqldf("select s.Species, s.dt, t.station_id, min(abs(s.dt - t.dt)) dist
	from species2 s, temp2 t group by species, station_id")

every <- sqldf("select s.Species, s.dt, t.station_id, t.value__1,
	abs(s.dt - t.dt) dist
	from species2 s, temp2 t")

merged <- sqldf("select Species, e.dt, station_id, value__1
	from mins join every e
	using(Species, station_id, dist)")

merged <- transform(merged, dt = chron(dt))


On Tue, Mar 17, 2009 at 8:13 AM, Gabor Grothendieck
<ggrothendieck at gmail.com> wrote: