Skip to content

How to speed up grouping time series, help please

7 messages · Joshua Ulrich, Gabor Grothendieck, Den Alpin

#
I retrieve for a few hundred times a group of time series (10-15 ts
with 10000 values each), on every group I do some calculation, graphs
etc. I wonder if there is a faster method than what presented below to
get an appropriate timeseries object.

Making a query with RODBC for every group I get a data frame like this:
ID                DATE     VALUE
14  3 2000-01-01 00:00:03 0.5726334
4   1 2000-01-01 00:00:03 0.8830174
1   1 2000-01-01 00:00:00 0.2875775
15  3 2000-01-01 00:00:04 0.1029247
11  3 2000-01-01 00:00:00 0.9568333
9   2 2000-01-01 00:00:03 0.5514350
7   2 2000-01-01 00:00:01 0.5281055
6   2 2000-01-01 00:00:00 0.0455565
12  3 2000-01-01 00:00:01 0.4533342
8   2 2000-01-01 00:00:02 0.8924190
3   1 2000-01-01 00:00:02 0.4089769
13  3 2000-01-01 00:00:02 0.6775706

And I want to get a timeSeries object or xts object like this:

                            1         2         3
2000-01-01 00:00:00 0.2875775 0.0455565 0.9568333
2000-01-01 00:00:01        NA 0.5281055 0.4533342
2000-01-01 00:00:02 0.4089769 0.8924190 0.6775706
2000-01-01 00:00:03 0.8830174 0.5514350 0.5726334
2000-01-01 00:00:04        NA        NA 0.1029247


Input data can be sorted or unsorted (the most complicated case is in
the example, unsorted and missing data) in the sense that I can sort
in query if I can take an advantage from this.

Some considerations:
- Xts is generally faster than timeSeries
- both accept a matrix so if I can create a matrix like the one
represented above and an array of characters representing dates faster
than what possible with xts:::cbind, for examole,I will have a faster
implementation (package data.table ?).
- create timeseries objects in multithread and then merge (package plyr ?)
- faster merge algorithms?

Below some code to generate the test case above:


set.seed(123)
N <- 5 # number of observations
K <- 3 # number of timeseries ID

X <- data.frame(
  ID = rep(1:K, each = N),
  DATE = as.character(rep(as.POSIXct("2000-01-01", tz = "GMT")+ 0:(N-1), K)),
  VALUE = runif(N*K), stringsAsFactors = FALSE)

X <- X[sample(1:(N*K), N*K),] # sample observations to get random
order (optional)
X <- X[-(sample(1:nrow(X), floor(nrow(X)*0.2))),] # 20% missing

head(X, 15)

# use explicitly environments to avoid '<<-'
buildTimeSeriesFromDataFrame <- function(x, env)
{
  {
    if(exists("xx", envir = env)) # if exist variable xx in env cbind
      assign("xx",
        cbind(get("xx", env), timeSeries(x$VALUE, x$DATE,
          format = '%Y-%m-%d %H:%M:%S',
          zone = 'GMT', units = as.character(x$ID[1]))),
        envir = env)
    else  # create xx in env
      assign("xx",
        timeSeries(x$VALUE, x$DATE, format = '%Y-%m-%d %H:%M:%S',
          zone = 'GMT', units = as.character(x$ID[1])),
        envir = env)

    return(TRUE)
  }
}

# use package plyr, faster than 'by' function
tsDaply <- function(...)
{
  library(plyr)
  e1 <- new.env(parent = baseenv()) #create a new env
  res <- daply(X, "ID", buildTimeSeriesFromDataFrame,
      env = e1)
  return(get("xx", e1)) # return xx from env
}

##replicate 100 times
#Time03 <- replicate(100,
#  system.time(tsDaply(X, X$ID))[[1]])
#median(Time03)

# result
tsDaply(X, X$ID)


Thanks in advance for any input, best regards,
Den
#
Hi Dan,
On Mon, Apr 4, 2011 at 7:49 AM, Den Alpin <den.alpin at gmail.com> wrote:
Here's how I would do it with xts:

x <- xts(X[,c("ID","VALUE")], as.POSIXct(X[,"DATE"]))
do.call(merge, split(x$VALUE,x$ID))

My xts solution compares favorably to your solution:
+   system.time(tsDaply(X, X$ID))[[1]])
[1] 0.02
+   system.time(do.call(merge, split(x$VALUE,x$ID)))[[1]])
[1] 0

Best,
--
Joshua Ulrich  |  FOSS Trading: www.fosstrading.com
#
On Mon, Apr 4, 2011 at 8:49 AM, Den Alpin <den.alpin at gmail.com> wrote:
Haven't checked how fast it is but using read.zoo its just one line of
code to produce the required matrix:

# set up input data frame, DF

Lines <- "ID,DATE,VALUE
3,2000-01-01 00:00:03,0.5726334
1,2000-01-01 00:00:03,0.8830174
1,2000-01-01 00:00:00,0.2875775
3,2000-01-01 00:00:04,0.1029247
3,2000-01-01 00:00:00,0.9568333
2,2000-01-01 00:00:03,0.5514350
2,2000-01-01 00:00:01,0.5281055
2,2000-01-01 00:00:00,0.0455565
3,2000-01-01 00:00:01,0.4533342
2,2000-01-01 00:00:02,0.8924190
1,2000-01-01 00:00:02,0.4089769
3,2000-01-01 00:00:02,0.6775706"
DF <- read.table(textConnection(Lines), header = TRUE, sep = ",")

# create zoo matrix

library(zoo)
z <- read.zoo(DF, split = 1, index = 2, tz = "")

The last line gives:
1         2         3
2000-01-01 00:00:00 0.2875775 0.0455565 0.9568333
2000-01-01 00:00:01        NA 0.5281055 0.4533342
2000-01-01 00:00:02 0.4089769 0.8924190 0.6775706
2000-01-01 00:00:03 0.8830174 0.5514350 0.5726334
2000-01-01 00:00:04        NA        NA 0.1029247
#
I did some tests on Your and Gabor solutions, below my findings:

- Your solution is fast as my solution in xts (below) but MUCH MORE
READABLE, in particular I think your test should take into account xts
creation from the data.frame (see below);
- Gabor's solution with read.zoo is fast as xts but give an xts object
that has some problems with time zones.

Any better idea to speed up grouping time series?

Thanks!

Below few line of codes to test (I suggest to grow X size to get
better comparison results):
xtsSplit <- function(x)
{
  x <- xts(x[,c("ID","VALUE")], as.POSIXct(x[,"DATE"]))
  x <- do.call(merge, split(x$VALUE,x$ID))
  return(x)
}

xtsSplitTime <- replicate(100,
  system.time(xtsSplit(X))[[1]])
median(xtsTime)

zooReadTime <- replicate(100,
 system.time(z <- read.zoo(X, split = 1, index = 2, tz = ""))[[1]])
median(zooReadTime)

And my (unreadable) solution:
library(xts)
buildXtsFromDataFrame <- function(x, env)
{
  {
    if(exists("xx", envir = env))
    {
      VALUE <- as.matrix(x$VALUE)
      colnames(VALUE) <- as.character(x$ID[1])
      assign("xx",
        cbind(get("xx", env), xts(VALUE,
          as.POSIXct(x$DATE, tz = "GMT",
            format = '%Y-%m-%d %H:%M:%S'),
          tzone = 'GMT')),
        envir = env)
    } else
    {
      VALUE <- as.matrix(x$VALUE)
      colnames(VALUE) <- as.character(x$ID[1])
      assign("xx",
        xts(VALUE, as.POSIXct(x$DATE, tz = "GMT",
            format = '%Y-%m-%d %H:%M:%S'),
          tzone = 'GMT'),
        envir = env)
    }
    return(TRUE)
  }
}

xtsDaply <- function(...)
{
  e1 <- new.env(parent = baseenv())
  res <- daply(X, "ID", buildXtsFromDataFrame,
      env = e1)
  return(get("xx", e1))
}

Time04 <- replicate(100,
  system.time(xtsDaply(X, X$ID))[[1]])




2011/4/4 Joshua Ulrich <josh.m.ulrich at gmail.com>:
#
On Mon, Apr 4, 2011 at 11:20 AM, Den Alpin <den.alpin at gmail.com> wrote:
read.zoo gives a zoo object, not an xts object.  If you want an xts
object  try as.xts(z). If you were expecting an xts object that may be
the source of your other problems.
2 days later
#
I found a faster implementation (by an order of magnitude from my
tests) than the one using xts, split, merge (from Joshua).
I report the two fastest solution below with code to generate a test
case; some work still to be done for columns order and naming,
Test case has grown from my previous post to get a more realistic timing.

Any comment or idea to further speed up multivariate time series
creation with classes xts or timeSeries starting from a data.frame
like the one reported here is welcome.

Best regards,
Den


a data.frame example (code below to generate it)

  ID                DATE     VALUE
14  3 2000-01-01 00:00:03 0.5726334
4   1 2000-01-01 00:00:03 0.8830174
1   1 2000-01-01 00:00:00 0.2875775
15  3 2000-01-01 00:00:04 0.1029247
11  3 2000-01-01 00:00:00 0.9568333
9   2 2000-01-01 00:00:03 0.5514350
7   2 2000-01-01 00:00:01 0.5281055
6   2 2000-01-01 00:00:00 0.0455565
12  3 2000-01-01 00:00:01 0.4533342
8   2 2000-01-01 00:00:02 0.8924190
3   1 2000-01-01 00:00:02 0.4089769
13  3 2000-01-01 00:00:02 0.6775706

And I want to get a timeSeries object or xts object like this:

                           1         2         3
2000-01-01 00:00:00 0.2875775 0.0455565 0.9568333
2000-01-01 00:00:01        NA 0.5281055 0.4533342
2000-01-01 00:00:02 0.4089769 0.8924190 0.6775706
2000-01-01 00:00:03 0.8830174 0.5514350 0.5726334
2000-01-01 00:00:04        NA        NA 0.1029247

# CODE:

set.seed(123)
# set N to 5 to reproduce above data.frame
N <- 1000
# set K to 3 to reproduce above data.frame
K <- 10
X <- data.frame(
  ID = rep(1:K, each = N),
  DATE = as.character(rep(as.POSIXct("2000-01-01", tz = "GMT")+ 0:(N-1), K)),
  VALUE = runif(N*K), stringsAsFactors = FALSE)
X <- X[sample(1:(N*K), N*K),]
X <- X[-(sample(1:nrow(X), floor(nrow(X)*0.2))),]
str(X)


xtsSplit <- function(x)
{
  library(xts)
  x <- xts(x[,c("ID","VALUE")], as.POSIXct(x[,"DATE"]))
  return(do.call(merge, split(x$VALUE,x$ID)))
}
xtsSplitTime <- replicate(50,
  system.time(xtsSplit(X))[[1]])
median(xtsSplitTime)

xtsReshape <- function(x)
{
  library(xts)
  x <- reshape(x, idvar = "DATE", timevar = "ID", direction = "wide")
  x <- xts(x[,-1], as.POSIXct(x[,1]))
  return(x)
}
xtsReshapeTime <- replicate(50,
  system.time(xtsReshape(X))[[1]])
median(xtsReshapeTime)
#
I found a faster implementation (by an order of magnitude from my
tests) than the one using xts, split, merge (from Joshua).
I report the two fastest solution below with code to generate a test
case; some work still to be done for columns order and naming,
Test case has grown from my previous post to get a more realistic timing.

Any comment or idea to further speed up multivariate time series
creation with classes xts or timeSeries starting from a data.frame
like the one reported here is welcome.

Best regards,
Den


a data.frame example (code below to generate it)

?ID ? ? ? ? ? ? ? ?DATE ? ? VALUE
14 ?3 2000-01-01 00:00:03 0.5726334
4 ? 1 2000-01-01 00:00:03 0.8830174
1 ? 1 2000-01-01 00:00:00 0.2875775
15 ?3 2000-01-01 00:00:04 0.1029247
11 ?3 2000-01-01 00:00:00 0.9568333
9 ? 2 2000-01-01 00:00:03 0.5514350
7 ? 2 2000-01-01 00:00:01 0.5281055
6 ? 2 2000-01-01 00:00:00 0.0455565
12 ?3 2000-01-01 00:00:01 0.4533342
8 ? 2 2000-01-01 00:00:02 0.8924190
3 ? 1 2000-01-01 00:00:02 0.4089769
13 ?3 2000-01-01 00:00:02 0.6775706

And I want to get a timeSeries object or xts object like this:

? ? ? ? ? ? ? ? ? ? ? ? ? 1 ? ? ? ? 2 ? ? ? ? 3
2000-01-01 00:00:00 0.2875775 0.0455565 0.9568333
2000-01-01 00:00:01 ? ? ? ?NA 0.5281055 0.4533342
2000-01-01 00:00:02 0.4089769 0.8924190 0.6775706
2000-01-01 00:00:03 0.8830174 0.5514350 0.5726334
2000-01-01 00:00:04 ? ? ? ?NA ? ? ? ?NA 0.1029247

# CODE:

set.seed(123)
# set N to 5 to reproduce above data.frame
N <- 1000
# set K to 3 to reproduce above data.frame
K <- 10
X <- data.frame(
?ID = rep(1:K, each = N),
?DATE = as.character(rep(as.POSIXct("2000-01-01", tz = "GMT")+ 0:(N-1), K)),
?VALUE = runif(N*K), stringsAsFactors = FALSE)
X <- X[sample(1:(N*K), N*K),]
X <- X[-(sample(1:nrow(X), floor(nrow(X)*0.2))),]
str(X)


xtsSplit <- function(x)
{
?library(xts)
?x <- xts(x[,c("ID","VALUE")], as.POSIXct(x[,"DATE"]))
?return(do.call(merge, split(x$VALUE,x$ID)))
}
xtsSplitTime <- replicate(50,
?system.time(xtsSplit(X))[[1]])
median(xtsSplitTime)

xtsReshape <- function(x)
{
?library(xts)
?x <- reshape(x, idvar = "DATE", timevar = "ID", direction = "wide")
?x <- xts(x[,-1], as.POSIXct(x[,1]))
?return(x)
}
xtsReshapeTime <- replicate(50,
?system.time(xtsReshape(X))[[1]])
median(xtsReshapeTime)