How to speed up grouping time series, help please

Mon, Apr 4, 2011 8:20 AM

I did some tests on Your and Gabor solutions, below my findings:

- Your solution is fast as my solution in xts (below) but MUCH MORE
READABLE, in particular I think your test should take into account xts
creation from the data.frame (see below);
- Gabor's solution with read.zoo is fast as xts but give an xts object
that has some problems with time zones.

Any better idea to speed up grouping time series?

Thanks!

Below few line of codes to test (I suggest to grow X size to get
better comparison results):
xtsSplit <- function(x)
{
  x <- xts(x[,c("ID","VALUE")], as.POSIXct(x[,"DATE"]))
  x <- do.call(merge, split(x$VALUE,x$ID))
  return(x)
}

xtsSplitTime <- replicate(100,
  system.time(xtsSplit(X))[[1]])
median(xtsTime)

zooReadTime <- replicate(100,
 system.time(z <- read.zoo(X, split = 1, index = 2, tz = ""))[[1]])
median(zooReadTime)

And my (unreadable) solution:
library(xts)
buildXtsFromDataFrame <- function(x, env)
{
  {
    if(exists("xx", envir = env))
    {
      VALUE <- as.matrix(x$VALUE)
      colnames(VALUE) <- as.character(x$ID[1])
      assign("xx",
        cbind(get("xx", env), xts(VALUE,
          as.POSIXct(x$DATE, tz = "GMT",
            format = '%Y-%m-%d %H:%M:%S'),
          tzone = 'GMT')),
        envir = env)
    } else
    {
      VALUE <- as.matrix(x$VALUE)
      colnames(VALUE) <- as.character(x$ID[1])
      assign("xx",
        xts(VALUE, as.POSIXct(x$DATE, tz = "GMT",
            format = '%Y-%m-%d %H:%M:%S'),
          tzone = 'GMT'),
        envir = env)
    }
    return(TRUE)
  }
}

xtsDaply <- function(...)
{
  e1 <- new.env(parent = baseenv())
  res <- daply(X, "ID", buildXtsFromDataFrame,
      env = e1)
  return(get("xx", e1))
}

Time04 <- replicate(100,
  system.time(xtsDaply(X, X$ID))[[1]])




2011/4/4 Joshua Ulrich <josh.m.ulrich at gmail.com>:

Hi Dan,

On Mon, Apr 4, 2011 at 7:49 AM, Den Alpin <den.alpin at gmail.com> wrote:

I retrieve for a few hundred times a group of time series (10-15 ts
with 10000 values each), on every group I do some calculation, graphs
etc. I wonder if there is a faster method than what presented below to
get an appropriate timeseries object.

Making a query with RODBC for every group I get a data frame like this:

? ID ? ? ? ? ? ? ? ?DATE ? ? VALUE
14 ?3 2000-01-01 00:00:03 0.5726334
4 ? 1 2000-01-01 00:00:03 0.8830174
1 ? 1 2000-01-01 00:00:00 0.2875775
15 ?3 2000-01-01 00:00:04 0.1029247
11 ?3 2000-01-01 00:00:00 0.9568333
9 ? 2 2000-01-01 00:00:03 0.5514350
7 ? 2 2000-01-01 00:00:01 0.5281055
6 ? 2 2000-01-01 00:00:00 0.0455565
12 ?3 2000-01-01 00:00:01 0.4533342
8 ? 2 2000-01-01 00:00:02 0.8924190
3 ? 1 2000-01-01 00:00:02 0.4089769
13 ?3 2000-01-01 00:00:02 0.6775706

And I want to get a timeSeries object or xts object like this:

? ? ? ? ? ? ? ? ? ? ? ? ? ?1 ? ? ? ? 2 ? ? ? ? 3
2000-01-01 00:00:00 0.2875775 0.0455565 0.9568333
2000-01-01 00:00:01 ? ? ? ?NA 0.5281055 0.4533342
2000-01-01 00:00:02 0.4089769 0.8924190 0.6775706
2000-01-01 00:00:03 0.8830174 0.5514350 0.5726334
2000-01-01 00:00:04 ? ? ? ?NA ? ? ? ?NA 0.1029247


Input data can be sorted or unsorted (the most complicated case is in
the example, unsorted and missing data) in the sense that I can sort
in query if I can take an advantage from this.

Some considerations:
- Xts is generally faster than timeSeries
- both accept a matrix so if I can create a matrix like the one
represented above and an array of characters representing dates faster
than what possible with xts:::cbind, for examole,I will have a faster
implementation (package data.table ?).
- create timeseries objects in multithread and then merge (package plyr ?)
- faster merge algorithms?

Below some code to generate the test case above:


set.seed(123)
N <- 5 # number of observations
K <- 3 # number of timeseries ID

X <- data.frame(
?ID = rep(1:K, each = N),
?DATE = as.character(rep(as.POSIXct("2000-01-01", tz = "GMT")+ 0:(N-1), K)),
?VALUE = runif(N*K), stringsAsFactors = FALSE)

X <- X[sample(1:(N*K), N*K),] # sample observations to get random
order (optional)
X <- X[-(sample(1:nrow(X), floor(nrow(X)*0.2))),] # 20% missing

head(X, 15)

# use explicitly environments to avoid '<<-'
buildTimeSeriesFromDataFrame <- function(x, env)
{
?{
? ?if(exists("xx", envir = env)) # if exist variable xx in env cbind
? ? ?assign("xx",
? ? ? ?cbind(get("xx", env), timeSeries(x$VALUE, x$DATE,
? ? ? ? ?format = '%Y-%m-%d %H:%M:%S',
? ? ? ? ?zone = 'GMT', units = as.character(x$ID[1]))),
? ? ? ?envir = env)
? ?else ?# create xx in env
? ? ?assign("xx",
? ? ? ?timeSeries(x$VALUE, x$DATE, format = '%Y-%m-%d %H:%M:%S',
? ? ? ? ?zone = 'GMT', units = as.character(x$ID[1])),
? ? ? ?envir = env)

? ?return(TRUE)
?}
}

# use package plyr, faster than 'by' function
tsDaply <- function(...)
{
?library(plyr)
?e1 <- new.env(parent = baseenv()) #create a new env
?res <- daply(X, "ID", buildTimeSeriesFromDataFrame,
? ? ?env = e1)
?return(get("xx", e1)) # return xx from env
}

##replicate 100 times
#Time03 <- replicate(100,
# ?system.time(tsDaply(X, X$ID))[[1]])
#median(Time03)

# result
tsDaply(X, X$ID)


Thanks in advance for any input, best regards,
Den

Here's how I would do it with xts:

x <- xts(X[,c("ID","VALUE")], as.POSIXct(X[,"DATE"]))
do.call(merge, split(x$VALUE,x$ID))

My xts solution compares favorably to your solution:

Time03 <- replicate(100,

+ ? system.time(tsDaply(X, X$ID))[[1]])

median(Time03)

[1] 0.02

xtsTime <- replicate(100,

+ ? system.time(do.call(merge, split(x$VALUE,x$ID)))[[1]])

median(xtsTime)

[1] 0

Best,
--
Joshua Ulrich ?| ?FOSS Trading: www.fosstrading.com

How to speed up grouping time series, help please

Thread (7 messages)