How to speed up grouping time series, help please
On Mon, Apr 4, 2011 at 8:49 AM, Den Alpin <den.alpin at gmail.com> wrote:
I retrieve for a few hundred times a group of time series (10-15 ts with 10000 values each), on every group I do some calculation, graphs etc. I wonder if there is a faster method than what presented below to get an appropriate timeseries object. Making a query with RODBC for every group I get a data frame like this:
X
? ID ? ? ? ? ? ? ? ?DATE ? ? VALUE
14 ?3 2000-01-01 00:00:03 0.5726334
4 ? 1 2000-01-01 00:00:03 0.8830174
1 ? 1 2000-01-01 00:00:00 0.2875775
15 ?3 2000-01-01 00:00:04 0.1029247
11 ?3 2000-01-01 00:00:00 0.9568333
9 ? 2 2000-01-01 00:00:03 0.5514350
7 ? 2 2000-01-01 00:00:01 0.5281055
6 ? 2 2000-01-01 00:00:00 0.0455565
12 ?3 2000-01-01 00:00:01 0.4533342
8 ? 2 2000-01-01 00:00:02 0.8924190
3 ? 1 2000-01-01 00:00:02 0.4089769
13 ?3 2000-01-01 00:00:02 0.6775706
And I want to get a timeSeries object or xts object like this:
? ? ? ? ? ? ? ? ? ? ? ? ? ?1 ? ? ? ? 2 ? ? ? ? 3
2000-01-01 00:00:00 0.2875775 0.0455565 0.9568333
2000-01-01 00:00:01 ? ? ? ?NA 0.5281055 0.4533342
2000-01-01 00:00:02 0.4089769 0.8924190 0.6775706
2000-01-01 00:00:03 0.8830174 0.5514350 0.5726334
2000-01-01 00:00:04 ? ? ? ?NA ? ? ? ?NA 0.1029247
Input data can be sorted or unsorted (the most complicated case is in
the example, unsorted and missing data) in the sense that I can sort
in query if I can take an advantage from this.
Some considerations:
- Xts is generally faster than timeSeries
- both accept a matrix so if I can create a matrix like the one
represented above and an array of characters representing dates faster
than what possible with xts:::cbind, for examole,I will have a faster
implementation (package data.table ?).
- create timeseries objects in multithread and then merge (package plyr ?)
- faster merge algorithms?
Below some code to generate the test case above:
set.seed(123)
N <- 5 # number of observations
K <- 3 # number of timeseries ID
X <- data.frame(
?ID = rep(1:K, each = N),
?DATE = as.character(rep(as.POSIXct("2000-01-01", tz = "GMT")+ 0:(N-1), K)),
?VALUE = runif(N*K), stringsAsFactors = FALSE)
X <- X[sample(1:(N*K), N*K),] # sample observations to get random
order (optional)
X <- X[-(sample(1:nrow(X), floor(nrow(X)*0.2))),] # 20% missing
head(X, 15)
# use explicitly environments to avoid '<<-'
buildTimeSeriesFromDataFrame <- function(x, env)
{
?{
? ?if(exists("xx", envir = env)) # if exist variable xx in env cbind
? ? ?assign("xx",
? ? ? ?cbind(get("xx", env), timeSeries(x$VALUE, x$DATE,
? ? ? ? ?format = '%Y-%m-%d %H:%M:%S',
? ? ? ? ?zone = 'GMT', units = as.character(x$ID[1]))),
? ? ? ?envir = env)
? ?else ?# create xx in env
? ? ?assign("xx",
? ? ? ?timeSeries(x$VALUE, x$DATE, format = '%Y-%m-%d %H:%M:%S',
? ? ? ? ?zone = 'GMT', units = as.character(x$ID[1])),
? ? ? ?envir = env)
? ?return(TRUE)
?}
}
# use package plyr, faster than 'by' function
tsDaply <- function(...)
{
?library(plyr)
?e1 <- new.env(parent = baseenv()) #create a new env
?res <- daply(X, "ID", buildTimeSeriesFromDataFrame,
? ? ?env = e1)
?return(get("xx", e1)) # return xx from env
}
##replicate 100 times
#Time03 <- replicate(100,
# ?system.time(tsDaply(X, X$ID))[[1]])
#median(Time03)
# result
tsDaply(X, X$ID)
Haven't checked how fast it is but using read.zoo its just one line of code to produce the required matrix: # set up input data frame, DF Lines <- "ID,DATE,VALUE 3,2000-01-01 00:00:03,0.5726334 1,2000-01-01 00:00:03,0.8830174 1,2000-01-01 00:00:00,0.2875775 3,2000-01-01 00:00:04,0.1029247 3,2000-01-01 00:00:00,0.9568333 2,2000-01-01 00:00:03,0.5514350 2,2000-01-01 00:00:01,0.5281055 2,2000-01-01 00:00:00,0.0455565 3,2000-01-01 00:00:01,0.4533342 2,2000-01-01 00:00:02,0.8924190 1,2000-01-01 00:00:02,0.4089769 3,2000-01-01 00:00:02,0.6775706" DF <- read.table(textConnection(Lines), header = TRUE, sep = ",") # create zoo matrix library(zoo) z <- read.zoo(DF, split = 1, index = 2, tz = "") The last line gives:
z
1 2 3 2000-01-01 00:00:00 0.2875775 0.0455565 0.9568333 2000-01-01 00:00:01 NA 0.5281055 0.4533342 2000-01-01 00:00:02 0.4089769 0.8924190 0.6775706 2000-01-01 00:00:03 0.8830174 0.5514350 0.5726334 2000-01-01 00:00:04 NA NA 0.1029247
Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com