Aggregating Statistics By Time Interval

An embedded and charset-unspecified text was scrubbed...
Name: not available
Url: https://stat.ethz.ch/pipermail/r-sig-finance/attachments/20070801/39d11a3c/attachment.pl
Something similar was just discussed this morning:
https://www.stat.math.ethz.ch/pipermail/r-help/2007-August/137695.html
Hi all

I have a question about aggegating statistics by time intervals. I have a
data set with 3 columns : time, bid, and ask. Time is specified as a
millisecond timestamp since epoch. I would like to compute summary
statistics for the data set on an hourly basis. Here is what I have tried so
far:

# Data is in pricedata

t <- ISODatetime(1970, 1, 1, 0, 0, 0) + pricedata$time
agg <- aggregate(pricedata$spread, list(byhour=format(t, "%Y-%m %H")), mean)

This seems to do what I want - however, what really want to do is more
specific: I would like to be able to extract a subset of the data frame
pricedata, and not just the aggregated entries - for instance, instead of
just extracting pricedata$spread by hour, I would like to extract a slice of
columns, e.g. pricedata$spread and pricedata$time on an hourly basis, and
pass these into a function that can compute a time-weighted average spread,
for instance. Does anyone know an elegant way to do this? I have a feeling
zoo may do what I want, but I'm new to zoo ...

Cheers
Rory

       [[alternative HTML version deleted]]

_______________________________________________
R-SIG-Finance at stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-sig-finance
-- Subscriber-posting only.
-- If you want to post, subscribe first.

An embedded and charset-unspecified text was scrubbed...
Name: not available
Url: https://stat.ethz.ch/pipermail/r-sig-finance/attachments/20070803/0c692623/attachment.pl
Can you provide a reproducible example that exhibits the warning.
Redoing it in a more easily reproducible way and using the data
in your post gives me no warning
tmp <- data.frame(time = c(1185882786, 1185882790, 1185882791, 1185882791,
+ 1185882792, 1185882795), spread = c(1e-04, 1e-04, 2e-04, 1e-04,
+ 2e-04, 1e-04))
twas <-
+  function(dat) {
+    data.frame(tapply(diff(dat$time), head(dat$spread, -1),
+  sum)/sum(diff(dat$time)) * 100.0)
+ }
now <- Sys.time()
epoch <- now - as.numeric(now)
z <- do.call("rbind", by(tmp, format(epoch + tmp$time, "%H"), twas))
z
1e-04    2e-04
07 66.66667 33.33333
R.version.string # XP
[1] "R version 2.5.1 (2007-06-27)"

Here is input:

tmp <- data.frame(time = c(1185882786, 1185882790, 1185882791, 1185882791,
1185882792, 1185882795), spread = c(1e-04, 1e-04, 2e-04, 1e-04,
2e-04, 1e-04))
twas <-
 function(dat) {
   data.frame(tapply(diff(dat$time), head(dat$spread, -1),
 sum)/sum(diff(dat$time)) * 100.0)
}
now <- Sys.time()
epoch <- now - as.numeric(now)
z <- do.call("rbind", by(tmp, format(epoch + tmp$time, "%H"), twas))
z
R.version.string # XP
Hi

I've been wrestling with this a little bit, using the example in the email
that Gabor pointed me to as a reference, and I think I have almost got what
I want...however its still not quite right.

I have a variable, tmp, with two dimensions: time and spread:

head(tmp$time)
[1] 1185882786 1185882790 1185882791 1185882791 1185882792 1185882795

head(tmp$spread)
[1] 1e-04 1e-04 2e-04 1e-04 2e-04 1e-04

I also have a function that calculates the time-weighted average spread:

twas
function(dat) {
  data.frame(tapply(diff(dat$time), head(dat$spread, -1),
sum)/sum(diff(dat$time)) * 100.0)
}

I can combine them using as rbind() and by():

z <- do.call("rbind", by(tmp, format(epoch + tmp$time, "%H"), twas))

(epoch is just an instance of ISOdatetime)

This gives me a warning:

Warning message:
number of columns of result
       is not a multiple of vector length (arg 3) in: rbind(1, "12" = c(
91.99207541277, 8.00792458723005), "13" = c(90.1884966797708,

The output from the above command is almost exactly what I need, apart from
the recycling:

     1e-04     2e-04      3e-04        4e-04
12 91.99208  8.007925 91.9920754  8.007924587 <== recycled values
13 90.18850  9.337448  0.4218405  0.052214551
14 90.59640  9.171417  0.2321811 90.596401668
15 89.55771 10.194291  0.2343418  0.013661453
...

I can just pass this into a barplot() and get a nice visual breakdown of
hourly weighted spreads, *but* I dont know how to get these results without
the recycling. Looking at rbind(), it seems that this will automatically
recycle. Does anyone know of a function I could use to get these results
without this problem?

Cheers
Rory

On 8/1/07, Gabor Grothendieck <ggrothendieck at gmail.com> wrote:
Something similar was just discussed this morning:
https://www.stat.math.ethz.ch/pipermail/r-help/2007-August/137695.html

On 8/1/07, Rory Winston <rory.winston at gmail.com> wrote:
Hi all

I have a question about aggegating statistics by time intervals. I have
a
data set with 3 columns : time, bid, and ask. Time is specified as a
millisecond timestamp since epoch. I would like to compute summary
statistics for the data set on an hourly basis. Here is what I have
tried so
far:

# Data is in pricedata

t <- ISODatetime(1970, 1, 1, 0, 0, 0) + pricedata$time
agg <- aggregate(pricedata$spread, list(byhour=format(t, "%Y-%m %H")),
mean)
This seems to do what I want - however, what really want to do is more
specific: I would like to be able to extract a subset of the data frame
pricedata, and not just the aggregated entries - for instance, instead
of
just extracting pricedata$spread by hour, I would like to extract a
slice of
columns, e.g. pricedata$spread and pricedata$time on an hourly basis,
and
pass these into a function that can compute a time-weighted average
spread,
for instance. Does anyone know an elegant way to do this? I have a
feeling
zoo may do what I want, but I'm new to zoo ...

Cheers
Rory

       [[alternative HTML version deleted]]

_______________________________________________
R-SIG-Finance at stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-sig-finance
-- Subscriber-posting only.
-- If you want to post, subscribe first.

       [[alternative HTML version deleted]]

_______________________________________________
R-SIG-Finance at stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-sig-finance
-- Subscriber-posting only.
-- If you want to post, subscribe first.

An embedded and charset-unspecified text was scrubbed...
Name: not available
Url: https://stat.ethz.ch/pipermail/r-sig-finance/attachments/20070803/06744fb6/attachment.pl
I still get no warning.  Please provide complete self contained input
and output.
tmp <- data.frame(time = c(1185882786, 1185882790, 1185882791, 1185882791,
+  1185882792, 1185882795), spread = c(1e-04, 1e-04, 2e-04, 1e-04,
+  2e-04, 3e-04))
twas <-
+  function(dat) {
+    data.frame(tapply(diff(dat$time), head(dat$spread, -1),
+  sum)/sum(diff(dat$time)) * 100.0)
+ }
now <- Sys.time()
epoch <- now - as.numeric(now)
z <- do.call("rbind", by(tmp, format(epoch + tmp$time, "%H"), twas))
z
1e-04    2e-04
07 66.66667 33.33333
R.version.string # XP
[1] "R version 2.5.1 (2007-06-27)"
Hi

I have figured out what causes the warning (and recycling), but I am not
sure how I can fix it. After seeing that it seemed to work for you, I went
back and tried working with different subsets of the data. I eventually
found where it occurs - when we get a third unique spread value. To
reproduce, just change the definition of tmp to be:

tmp <- data.frame(time = c(1185882786, 1185882790, 1185882791, 1185882791,
 1185882792, 1185882795), spread = c(1e-04, 1e-04, 2e-04, 1e-04,
 2e-04, 3e-04)) <== Added 3e-04

i.e. I have just changed one of the spread values to be a third value - this
seems to trigger the warning  "Warning message:number of columns of result
is not a multiple of vector length (arg 3) in: rbind", and the recycling. I
tried this on R 2.5.0 and 2.5.1

Can anyone see what I am doing wrong here?

Cheers
Rory

On 8/3/07, Gabor Grothendieck < ggrothendieck at gmail.com> wrote:
Can you provide a reproducible example that exhibits the warning.
Redoing it in a more easily reproducible way and using the data
in your post gives me no warning

tmp <- data.frame(time = c(1185882786, 1185882790, 1185882791,
1185882791,
+ 1185882792, 1185882795), spread = c(1e-04, 1e-04, 2e-04, 1e-04,
+ 2e-04, 1e-04))
twas <-
+  function(dat) {
+    data.frame(tapply(diff(dat$time), head(dat$spread, -1),
+  sum)/sum(diff(dat$time)) * 100.0)
+ }
now <- Sys.time()
epoch <- now - as.numeric(now)
z <- do.call("rbind", by(tmp, format(epoch + tmp$time, "%H"), twas))
z
      1e-04    2e-04
07 66.66667 33.33333
R.version.string # XP
[1] "R version 2.5.1 (2007-06-27)"

Here is input:

tmp <- data.frame(time = c(1185882786, 1185882790, 1185882791, 1185882791,
1185882792, 1185882795), spread = c(1e-04, 1e-04, 2e-04, 1e-04,
2e-04, 1e-04))
twas <-
function(dat) {
   data.frame(tapply(diff(dat$time), head(dat$spread, -1),
sum)/sum(diff(dat$time)) * 100.0)
}
now <- Sys.time()
epoch <- now - as.numeric(now)
z <- do.call("rbind", by(tmp, format(epoch + tmp$time, "%H"), twas))
z
R.version.string # XP

On 8/3/07, Rory Winston <rory.winston at gmail.com> wrote:
Hi

I've been wrestling with this a little bit, using the example in the
email
that Gabor pointed me to as a reference, and I think I have almost got
what
I want...however its still not quite right.

I have a variable, tmp, with two dimensions: time and spread:

head(tmp$time)
[1] 1185882786 1185882790 1185882791 1185882791 1185882792 1185882795

head(tmp$spread)
[1] 1e-04 1e-04 2e-04 1e-04 2e-04 1e-04

I also have a function that calculates the time-weighted average spread:

twas
function(dat) {
  data.frame(tapply(diff(dat$time), head(dat$spread, -1),
sum)/sum(diff(dat$time)) * 100.0)
}

I can combine them using as rbind() and by():

z <- do.call("rbind", by(tmp, format(epoch + tmp$time, "%H"), twas))

(epoch is just an instance of ISOdatetime)

This gives me a warning:

Warning message:
number of columns of result
       is not a multiple of vector length (arg 3) in: rbind(1, "12" = c(
91.99207541277, 8.00792458723005), "13" = c(90.1884966797708,

The output from the above command is almost exactly what I need, apart
from
the recycling:

     1e-04     2e-04      3e-04        4e-04
12 91.99208  8.007925 91.9920754  8.007924587 <== recycled values
13 90.18850  9.337448  0.4218405  0.052214551
14 90.59640  9.171417  0.2321811 90.596401668
15 89.55771 10.194291  0.2343418  0.013661453
...

I can just pass this into a barplot() and get a nice visual breakdown of
hourly weighted spreads, *but* I dont know how to get these results
without
the recycling. Looking at rbind(), it seems that this will automatically
recycle. Does anyone know of a function I could use to get these results
without this problem?

Cheers
Rory

On 8/1/07, Gabor Grothendieck <ggrothendieck at gmail.com > wrote:
Something similar was just discussed this morning:

https://www.stat.math.ethz.ch/pipermail/r-help/2007-August/137695.html

On 8/1/07, Rory Winston <rory.winston at gmail.com> wrote:
Hi all

I have a question about aggegating statistics by time intervals. I
have
a
data set with 3 columns : time, bid, and ask. Time is specified as a
millisecond timestamp since epoch. I would like to compute summary
statistics for the data set on an hourly basis. Here is what I have
tried so
far:

# Data is in pricedata

t <- ISODatetime(1970, 1, 1, 0, 0, 0) + pricedata$time
agg <- aggregate(pricedata$spread, list(byhour=format(t, "%Y-%m
%H")),
mean)
This seems to do what I want - however, what really want to do is
more
specific: I would like to be able to extract a subset of the data
frame
pricedata, and not just the aggregated entries - for instance,
instead
of
just extracting pricedata$spread by hour, I would like to extract a
slice of
columns, e.g. pricedata$spread and pricedata$time on an hourly
basis,
and
pass these into a function that can compute a time-weighted average
spread,
for instance. Does anyone know an elegant way to do this? I have a
feeling
zoo may do what I want, but I'm new to zoo ...

Cheers
Rory

       [[alternative HTML version deleted]]

_______________________________________________
R-SIG-Finance at stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-sig-finance
-- Subscriber-posting only.
-- If you want to post, subscribe first.

       [[alternative HTML version deleted]]

_______________________________________________
R-SIG-Finance at stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-sig-finance
-- Subscriber-posting only.
-- If you want to post, subscribe first.

An embedded and charset-unspecified text was scrubbed...
Name: not available
Url: https://stat.ethz.ch/pipermail/r-sig-finance/attachments/20070803/d0ecbd0e/attachment.pl
Try producing it in "long" format using aggregate and then reshaping
it into "wide" format using xtabs, reshape or the reshape package:

twas <- function(x) {
	y <- data.frame(timediff = diff(x$time), head(x, -1))
	aggregate(100 * y[1]/sum(y[1]), y[c("hour", "spread")], sum)
}
tmp2 <- cbind(tmp, hour = fmt(tmp$time))
long <- do.call("rbind", by(tmp2, tmp2["hour"], twas))

# any one of these three:

xtabs(timediff ~., long)

reshape(long, dir = "wide", timevar = "spread", idvar = "hour")

library(reshape)
cast(melt(long, id = 1:2), hour ~ spread)
Hi

Sorry, I'm not sure what happened with that last one. Here is a fully
contained example (sorry about the line length if this doesnt wrap).

tmp <- data.frame(
time=c(1185882786,1185882790,1185882791,1185882791,1185882792,1185882795,1185882796,1185882797,1185882797,1185882798,1185882799,1185882800,1185882806,1185882807,1185882809,1185882810,1185882810,1185882811,1185882845,1185882846,1185882906,1185882918,1185882950,1185882951,1185882951,1185882952,1185882953,1185882954,1185882955,1185882956,1185882991,1185882991,1185882995,1185882996,1185882997,1185882997,1185882998,1185882998,1185882999,1185883003,1185883004,1185883006,1185883007,1185883025,1185883026,1185883086,1185883129,1185883129,1185883133,1185883133,1185883137,1185883137,1185883144,1185883145,1185883145,1185883148,1185883148,1185883149,1185883150,1185883151,1185883152,1185883154,1185883154,1185883155,1185883155,1185883175,1185883176,1185883179,1185883179,1185883180,1185883181,1185883181,1185883182,1185883186,1185883187,1185883191,1185883191,1185883200,1185883200,1185883211,1185883212,1185883214,1185883214,1185883215,1185883217,1185883218,1185883219,1185883279,1185883307,1185883307,1185883365,1185883366,1185883366,1185883367,1185883368,1185883368,1185883368,1185883369,1185883373,1185883376),
spread=c(1e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,1e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,1e-04,2e-04,1e-04,1e-04,1e-04,2e-04,1e-04,1e-04,1e-04,2e-04,1e-04,1e-04,2e-04,1e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,1e-04,2e-04,1e-04,2e-04,1e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,1e-04,2e-04,1e-04,2e-04,1e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,3e-04,2e-04)
)

twas <- function (dat)
{
    data.frame(tapply(diff(dat$time), head(dat$spread, -1),
sum)/sum(diff(dat$time)) * 100)
}

now <- Sys.time()
epoch <- now - as.numeric(now)

z <- do.call("rbind", by(tmp, format(epoch + tmp$time, "%H"), twas))

Cheers
Rory

On 8/3/07, Gabor Grothendieck < ggrothendieck at gmail.com> wrote:
I still get no warning.  Please provide complete self contained input
and output.

tmp <- data.frame(time = c(1185882786, 1185882790, 1185882791,
1185882791,
+  1185882792, 1185882795), spread = c(1e-04, 1e-04, 2e-04, 1e-04,
+  2e-04, 3e-04))
twas <-
+  function(dat) {
+    data.frame(tapply(diff(dat$time), head(dat$spread, -1),
+  sum)/sum(diff(dat$time)) * 100.0)
+ }
now <- Sys.time()
epoch <- now - as.numeric(now)
z <- do.call("rbind", by(tmp, format(epoch + tmp$time, "%H"), twas))
z
      1e-04    2e-04
07 66.66667 33.33333
R.version.string # XP
[1] "R version 2.5.1 (2007-06-27)"

On 8/3/07, Rory Winston <rory.winston at gmail.com> wrote:
Hi

I have figured out what causes the warning (and recycling), but I am not
sure how I can fix it. After seeing that it seemed to work for you, I
went
back and tried working with different subsets of the data. I eventually
found where it occurs - when we get a third unique spread value. To
reproduce, just change the definition of tmp to be:

tmp <- data.frame(time = c(1185882786, 1185882790, 1185882791,
1185882791,
 1185882792, 1185882795), spread = c(1e-04, 1e-04, 2e-04, 1e-04,
 2e-04, 3e-04)) <== Added 3e-04

i.e. I have just changed one of the spread values to be a third value -
this
seems to trigger the warning  "Warning message:number of columns of
result
is not a multiple of vector length (arg 3) in: rbind", and the
recycling. I
tried this on R 2.5.0 and 2.5.1

Can anyone see what I am doing wrong here?

Cheers
Rory

On 8/3/07, Gabor Grothendieck < ggrothendieck at gmail.com> wrote:
Can you provide a reproducible example that exhibits the warning.
Redoing it in a more easily reproducible way and using the data
in your post gives me no warning

tmp <- data.frame(time = c(1185882786, 1185882790, 1185882791,
1185882791,
+ 1185882792, 1185882795), spread = c(1e-04, 1e-04, 2e-04, 1e-04,
+ 2e-04, 1e-04))
twas <-
+  function(dat) {
+     data.frame(tapply(diff(dat$time), head(dat$spread, -1),
+  sum)/sum(diff(dat$time)) * 100.0)
+ }
now <- Sys.time()
epoch <- now - as.numeric(now)
z <- do.call("rbind", by(tmp, format(epoch + tmp$time, "%H"), twas))
z
      1e-04    2e-04
07 66.66667 33.33333
R.version.string # XP
[1] "R version 2.5.1 (2007-06-27)"

Here is input:

tmp <- data.frame(time = c(1185882786, 1185882790, 1185882791,
1185882791,
1185882792, 1185882795), spread = c(1e-04, 1e-04, 2e-04, 1e-04,
2e-04, 1e-04))
twas <-
function(dat) {
   data.frame(tapply(diff(dat$time), head(dat$spread, -1),
sum)/sum(diff(dat$time)) * 100.0)
}
now <- Sys.time()
epoch <- now - as.numeric(now)
z <- do.call("rbind", by(tmp, format(epoch + tmp$time, "%H"), twas))
z
R.version.string # XP

On 8/3/07, Rory Winston <rory.winston at gmail.com> wrote:
Hi

I've been wrestling with this a little bit, using the example in the
email
that Gabor pointed me to as a reference, and I think I have almost
got
what
I want...however its still not quite right.

I have a variable, tmp, with two dimensions: time and spread:

head(tmp$time)
[1] 1185882786 1185882790 1185882791 1185882791 1185882792
1185882795

head(tmp$spread)
[1] 1e-04 1e-04 2e-04 1e-04 2e-04 1e-04

I also have a function that calculates the time-weighted average
spread:

twas
function(dat) {
  data.frame(tapply(diff(dat$time), head(dat$spread, -1),
sum)/sum(diff(dat$time)) * 100.0)
}

I can combine them using as rbind() and by():

z <- do.call("rbind", by(tmp, format(epoch + tmp$time, "%H"), twas))

(epoch is just an instance of ISOdatetime)

This gives me a warning:

Warning message:
number of columns of result
       is not a multiple of vector length (arg 3) in: rbind(1, "12"
= c(
91.99207541277 , 8.00792458723005), "13" = c(90.1884966797708,

The output from the above command is almost exactly what I need,
apart
from
the recycling:

     1e-04     2e-04      3e-04        4e-04
12 91.99208  8.007925 91.9920754  8.007924587 <== recycled values
13 90.18850  9.337448  0.4218405  0.052214551
14 90.59640  9.171417  0.2321811 90.596401668
15 89.55771 10.194291  0.2343418  0.013661453
...

I can just pass this into a barplot() and get a nice visual
breakdown of
hourly weighted spreads, *but* I dont know how to get these results
without
the recycling. Looking at rbind(), it seems that this will
automatically
recycle. Does anyone know of a function I could use to get these
results
without this problem?

Cheers
Rory

On 8/1/07, Gabor Grothendieck < ggrothendieck at gmail.com > wrote:
Something similar was just discussed this morning:

https://www.stat.math.ethz.ch/pipermail/r-help/2007-August/137695.html

On 8/1/07, Rory Winston <rory.winston at gmail.com > wrote:
Hi all

I have a question about aggegating statistics by time intervals.
I
have
a
data set with 3 columns : time, bid, and ask. Time is specified
as a
millisecond timestamp since epoch. I would like to compute
summary
statistics for the data set on an hourly basis. Here is what I
have
tried so
far:

# Data is in pricedata

t <- ISODatetime(1970, 1, 1, 0, 0, 0) + pricedata$time
agg <- aggregate(pricedata$spread, list(byhour=format(t, "%Y-%m
%H")),
mean)
This seems to do what I want - however, what really want to do
is
more
specific: I would like to be able to extract a subset of the
data
frame
pricedata, and not just the aggregated entries - for instance,
instead
of
just extracting pricedata$spread by hour, I would like to
extract a
slice of
columns, e.g. pricedata$spread and pricedata$time on an hourly
basis,
and
pass these into a function that can compute a time-weighted
average
spread,
for instance. Does anyone know an elegant way to do this? I have
a
feeling
zoo may do what I want, but I'm new to zoo ...

Cheers
Rory

       [[alternative HTML version deleted]]

_______________________________________________
R-SIG-Finance at stat.math.ethz.ch mailing list

https://stat.ethz.ch/mailman/listinfo/r-sig-finance
-- Subscriber-posting only.
-- If you want to post, subscribe first.

       [[alternative HTML version deleted]]

_______________________________________________
R-SIG-Finance at stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-sig-finance
-- Subscriber-posting only.
-- If you want to post, subscribe first.

I had omitted fmt and epoch.  tmp is as in your post.

twas <- function(x) {
	y <- data.frame(timediff = diff(x$time), head(x, -1))
	aggregate(100 * y[1]/sum(y[1]), y[c("hour", "spread")], sum)
}
now <- Sys.time()
epoch <- now - as.numeric(now)
fmt <- function(x) format(epoch + x, "%H")
tmp2 <- cbind(tmp, hour = fmt(tmp$time))

z <- do.call("rbind", by(tmp2, tmp2["hour"], twas))

# three alternatives

# 1
xtabs(timediff ~., z)

# 2
reshape(z, dir = "wide", timevar = "spread", idvar = "hour")

# 3
library(reshape)
cast(melt(z, id = 1:2), hour ~ spread)
Try producing it in "long" format using aggregate and then reshaping
it into "wide" format using xtabs, reshape or the reshape package:

twas <- function(x) {
       y <- data.frame(timediff = diff(x$time), head(x, -1))
       aggregate(100 * y[1]/sum(y[1]), y[c("hour", "spread")], sum)
}
tmp2 <- cbind(tmp, hour = fmt(tmp$time))
long <- do.call("rbind", by(tmp2, tmp2["hour"], twas))

# any one of these three:

xtabs(timediff ~., long)

reshape(long, dir = "wide", timevar = "spread", idvar = "hour")

library(reshape)
cast(melt(long, id = 1:2), hour ~ spread)

On 8/3/07, Rory Winston <rory.winston at gmail.com> wrote:
Hi

Sorry, I'm not sure what happened with that last one. Here is a fully
contained example (sorry about the line length if this doesnt wrap).

tmp <- data.frame(
time=c(1185882786,1185882790,1185882791,1185882791,1185882792,1185882795,1185882796,1185882797,1185882797,1185882798,1185882799,1185882800,1185882806,1185882807,1185882809,1185882810,1185882810,1185882811,1185882845,1185882846,1185882906,1185882918,1185882950,1185882951,1185882951,1185882952,1185882953,1185882954,1185882955,1185882956,1185882991,1185882991,1185882995,1185882996,1185882997,1185882997,1185882998,1185882998,1185882999,1185883003,1185883004,1185883006,1185883007,1185883025,1185883026,1185883086,1185883129,1185883129,1185883133,1185883133,1185883137,1185883137,1185883144,1185883145,1185883145,1185883148,1185883148,1185883149,1185883150,1185883151,1185883152,1185883154,1185883154,1185883155,1185883155,1185883175,1185883176,1185883179,1185883179,1185883180,1185883181,1185883181,1185883182,1185883186,1185883187,1185883191,1185883191,1185883200,1185883200,1185883211,1185883212,1185883214,1185883214,1185883215,1185883217,1185883218,1185883219,1185883279,1185883307,1185883307,1185883365,1185883366,1185883366,1185883367,1185883368,1185883368,1185883368,1185883369,1185883373,1185883376),
spread=c(1e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,1e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,1e-04,2e-04,1e-04,1e-04,1e-04,2e-04,1e-04,1e-04,1e-04,2e-04,1e-04,1e-04,2e-04,1e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,1e-04,2e-04,1e-04,2e-04,1e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,1e-04,2e-04,1e-04,2e-04,1e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,3e-04,2e-04)
)

twas <- function (dat)
{
    data.frame(tapply(diff(dat$time), head(dat$spread, -1),
sum)/sum(diff(dat$time)) * 100)
}

now <- Sys.time()
epoch <- now - as.numeric(now)

z <- do.call("rbind", by(tmp, format(epoch + tmp$time, "%H"), twas))

Cheers
Rory

On 8/3/07, Gabor Grothendieck < ggrothendieck at gmail.com> wrote:
I still get no warning.  Please provide complete self contained input
and output.

tmp <- data.frame(time = c(1185882786, 1185882790, 1185882791,
1185882791,
+  1185882792, 1185882795), spread = c(1e-04, 1e-04, 2e-04, 1e-04,
+  2e-04, 3e-04))
twas <-
+  function(dat) {
+    data.frame(tapply(diff(dat$time), head(dat$spread, -1),
+  sum)/sum(diff(dat$time)) * 100.0)
+ }
now <- Sys.time()
epoch <- now - as.numeric(now)
z <- do.call("rbind", by(tmp, format(epoch + tmp$time, "%H"), twas))
z
      1e-04    2e-04
07 66.66667 33.33333
R.version.string # XP
[1] "R version 2.5.1 (2007-06-27)"

On 8/3/07, Rory Winston <rory.winston at gmail.com> wrote:
Hi

I have figured out what causes the warning (and recycling), but I am not
sure how I can fix it. After seeing that it seemed to work for you, I
went
back and tried working with different subsets of the data. I eventually
found where it occurs - when we get a third unique spread value. To
reproduce, just change the definition of tmp to be:

tmp <- data.frame(time = c(1185882786, 1185882790, 1185882791,
1185882791,
 1185882792, 1185882795), spread = c(1e-04, 1e-04, 2e-04, 1e-04,
 2e-04, 3e-04)) <== Added 3e-04

i.e. I have just changed one of the spread values to be a third value -
this
seems to trigger the warning  "Warning message:number of columns of
result
is not a multiple of vector length (arg 3) in: rbind", and the
recycling. I
tried this on R 2.5.0 and 2.5.1

Can anyone see what I am doing wrong here?

Cheers
Rory

On 8/3/07, Gabor Grothendieck < ggrothendieck at gmail.com> wrote:
Can you provide a reproducible example that exhibits the warning.
Redoing it in a more easily reproducible way and using the data
in your post gives me no warning

tmp <- data.frame(time = c(1185882786, 1185882790, 1185882791,
1185882791,
+ 1185882792, 1185882795), spread = c(1e-04, 1e-04, 2e-04, 1e-04,
+ 2e-04, 1e-04))
twas <-
+  function(dat) {
+     data.frame(tapply(diff(dat$time), head(dat$spread, -1),
+  sum)/sum(diff(dat$time)) * 100.0)
+ }
now <- Sys.time()
epoch <- now - as.numeric(now)
z <- do.call("rbind", by(tmp, format(epoch + tmp$time, "%H"), twas))
z
      1e-04    2e-04
07 66.66667 33.33333
R.version.string # XP
[1] "R version 2.5.1 (2007-06-27)"

Here is input:

tmp <- data.frame(time = c(1185882786, 1185882790, 1185882791,
1185882791,
1185882792, 1185882795), spread = c(1e-04, 1e-04, 2e-04, 1e-04,
2e-04, 1e-04))
twas <-
function(dat) {
   data.frame(tapply(diff(dat$time), head(dat$spread, -1),
sum)/sum(diff(dat$time)) * 100.0)
}
now <- Sys.time()
epoch <- now - as.numeric(now)
z <- do.call("rbind", by(tmp, format(epoch + tmp$time, "%H"), twas))
z
R.version.string # XP

On 8/3/07, Rory Winston <rory.winston at gmail.com> wrote:
Hi

I've been wrestling with this a little bit, using the example in the
email
that Gabor pointed me to as a reference, and I think I have almost
got
what
I want...however its still not quite right.

I have a variable, tmp, with two dimensions: time and spread:

head(tmp$time)
[1] 1185882786 1185882790 1185882791 1185882791 1185882792
1185882795

head(tmp$spread)
[1] 1e-04 1e-04 2e-04 1e-04 2e-04 1e-04

I also have a function that calculates the time-weighted average
spread:

twas
function(dat) {
  data.frame(tapply(diff(dat$time), head(dat$spread, -1),
sum)/sum(diff(dat$time)) * 100.0)
}

I can combine them using as rbind() and by():

z <- do.call("rbind", by(tmp, format(epoch + tmp$time, "%H"), twas))

(epoch is just an instance of ISOdatetime)

This gives me a warning:

Warning message:
number of columns of result
       is not a multiple of vector length (arg 3) in: rbind(1, "12"
= c(
91.99207541277 , 8.00792458723005), "13" = c(90.1884966797708,

The output from the above command is almost exactly what I need,
apart
from
the recycling:

     1e-04     2e-04      3e-04        4e-04
12 91.99208  8.007925 91.9920754  8.007924587 <== recycled values
13 90.18850  9.337448  0.4218405  0.052214551
14 90.59640  9.171417  0.2321811 90.596401668
15 89.55771 10.194291  0.2343418  0.013661453
...

I can just pass this into a barplot() and get a nice visual
breakdown of
hourly weighted spreads, *but* I dont know how to get these results
without
the recycling. Looking at rbind(), it seems that this will
automatically
recycle. Does anyone know of a function I could use to get these
results
without this problem?

Cheers
Rory

On 8/1/07, Gabor Grothendieck < ggrothendieck at gmail.com > wrote:
Something similar was just discussed this morning:

https://www.stat.math.ethz.ch/pipermail/r-help/2007-August/137695.html

On 8/1/07, Rory Winston <rory.winston at gmail.com > wrote:
Hi all

I have a question about aggegating statistics by time intervals.
I
have
a
data set with 3 columns : time, bid, and ask. Time is specified
as a
millisecond timestamp since epoch. I would like to compute
summary
statistics for the data set on an hourly basis. Here is what I
have
tried so
far:

# Data is in pricedata

t <- ISODatetime(1970, 1, 1, 0, 0, 0) + pricedata$time
agg <- aggregate(pricedata$spread, list(byhour=format(t, "%Y-%m
%H")),
mean)
This seems to do what I want - however, what really want to do
is
more
specific: I would like to be able to extract a subset of the
data
frame
pricedata, and not just the aggregated entries - for instance,
instead
of
just extracting pricedata$spread by hour, I would like to
extract a
slice of
columns, e.g. pricedata$spread and pricedata$time on an hourly
basis,
and
pass these into a function that can compute a time-weighted
average
spread,
for instance. Does anyone know an elegant way to do this? I have
a
feeling
zoo may do what I want, but I'm new to zoo ...

Cheers
Rory

       [[alternative HTML version deleted]]

_______________________________________________
R-SIG-Finance at stat.math.ethz.ch mailing list

https://stat.ethz.ch/mailman/listinfo/r-sig-finance
-- Subscriber-posting only.
-- If you want to post, subscribe first.

       [[alternative HTML version deleted]]

_______________________________________________
R-SIG-Finance at stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-sig-finance
-- Subscriber-posting only.
-- If you want to post, subscribe first.

An embedded and charset-unspecified text was scrubbed...
Name: not available
Url: https://stat.ethz.ch/pipermail/r-sig-finance/attachments/20070803/16411395/attachment.pl
The different invocations of twas were creating data frames of different
numbers of columns because different hours had different numbers
of spreads.  The warning came when it tried to rbind together
data.frames with different numbers of columns.
Wow....thats great. Thank you very much! I appreciate the help greatly. I
dont quite understand what the issue was though....was it that the data
frame returned from my initial twas() function was of the wrong order?

Cheers
Rory

On 8/3/07, Gabor Grothendieck <ggrothendieck at gmail.com> wrote:
I had omitted fmt and epoch.  tmp is as in your post.

twas <- function(x) {
        y <- data.frame(timediff = diff(x$time), head(x, -1))
        aggregate(100 * y[1]/sum(y[1]), y[c("hour", "spread")], sum)
}
now <- Sys.time()
epoch <- now - as.numeric(now)
fmt <- function(x) format(epoch + x, "%H")
tmp2 <- cbind(tmp, hour = fmt(tmp$time))

z <- do.call("rbind", by(tmp2, tmp2["hour"], twas))

# three alternatives

# 1
xtabs(timediff ~., z)

# 2
reshape(z, dir = "wide", timevar = "spread", idvar = "hour")

# 3
library(reshape)
cast(melt(z, id = 1:2), hour ~ spread)

On 8/3/07, Gabor Grothendieck <ggrothendieck at gmail.com> wrote:
Try producing it in "long" format using aggregate and then reshaping
it into "wide" format using xtabs, reshape or the reshape package:

twas <- function(x) {
       y <- data.frame(timediff = diff(x$time), head(x, -1))
       aggregate(100 * y[1]/sum(y[1]), y[c("hour", "spread")], sum)
}
tmp2 <- cbind(tmp, hour = fmt(tmp$time))
long <- do.call("rbind", by(tmp2, tmp2["hour"], twas))

# any one of these three:

xtabs(timediff ~., long)

reshape(long, dir = "wide", timevar = "spread", idvar = "hour")

library(reshape)
cast(melt(long, id = 1:2), hour ~ spread)

On 8/3/07, Rory Winston < rory.winston at gmail.com> wrote:
Hi

Sorry, I'm not sure what happened with that last one. Here is a fully
contained example (sorry about the line length if this doesnt wrap).

tmp <- data.frame(

time=c(1185882786,1185882790,1185882791,1185882791,1185882792,1185882795,1185882796,1185882797,1185882797,1185882798,1185882799,1185882800,1185882806,1185882807,1185882809,1185882810,1185882810,1185882811,1185882845,1185882846,1185882906,1185882918,1185882950,1185882951,1185882951,1185882952,1185882953,1185882954,1185882955,1185882956,1185882991,1185882991,1185882995,1185882996,1185882997,1185882997,1185882998,1185882998,1185882999,1185883003,1185883004,1185883006,1185883007,1185883025,1185883026,1185883086,1185883129,1185883129,1185883133,1185883133,1185883137,1185883137,1185883144,1185883145,1185883145,1185883148,1185883148,1185883149,1185883150,1185883151,1185883152,1185883154,1185883154,1185883155,1185883155,1185883175,1185883176,1185883179,1185883179,1185883180,1185883181,1185883181,1185883182,1185883186,1185883187,1185883191,1185883191,1185883200,1185883200,1185883211,1185883212,1185883214,1185883214,1185883215,1185883217,1185883218,1185883219,1185883279,1185883307,1185883307,1185883365,1185883366,1185883366,1185883367,1185883368,1185883368,1185883368,1185883369,1185883373,1185883376),

spread=c(1e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,1e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,1e-04,2e-04,1e-04,1e-04,1e-04,2e-04,1e-04,1e-04,1e-04,2e-04,1e-04,1e-04,2e-04,1e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,1e-04,2e-04,1e-04,2e-04,1e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,1e-04,2e-04,1e-04,2e-04,1e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,3e-04,2e-04)
)

twas <- function (dat)
{
    data.frame(tapply(diff(dat$time), head(dat$spread, -1),
sum)/sum(diff(dat$time)) * 100)
}

now <- Sys.time()
epoch <- now - as.numeric(now)

z <- do.call("rbind", by(tmp, format(epoch + tmp$time, "%H"), twas))

Cheers
Rory

On 8/3/07, Gabor Grothendieck < ggrothendieck at gmail.com> wrote:
I still get no warning.  Please provide complete self contained
input
and output.

tmp <- data.frame(time = c(1185882786, 1185882790, 1185882791,
1185882791,
+  1185882792, 1185882795), spread = c(1e-04, 1e-04, 2e-04, 1e-04,
+  2e-04, 3e-04))
twas <-
+  function(dat) {
+    data.frame(tapply(diff(dat$time), head(dat$spread, -1),
+  sum)/sum(diff(dat$time)) * 100.0)
+ }
now <- Sys.time()
epoch <- now - as.numeric(now)
z <- do.call("rbind", by(tmp, format(epoch + tmp$time, "%H"),
twas))
z
      1e-04    2e-04
07 66.66667 33.33333
R.version.string # XP
[1] "R version 2.5.1 (2007-06-27)"

On 8/3/07, Rory Winston <rory.winston at gmail.com> wrote:
Hi

I have figured out what causes the warning (and recycling), but I
am not
sure how I can fix it. After seeing that it seemed to work for
you, I
went
back and tried working with different subsets of the data. I
eventually
found where it occurs - when we get a third unique spread value.
To
reproduce, just change the definition of tmp to be:

tmp <- data.frame(time = c(1185882786, 1185882790, 1185882791,
1185882791,
 1185882792, 1185882795), spread = c(1e-04, 1e-04, 2e-04, 1e-04,
 2e-04, 3e-04)) <== Added 3e-04

i.e. I have just changed one of the spread values to be a third
value -
this
seems to trigger the warning  "Warning message:number of columns
of
result
is not a multiple of vector length (arg 3) in: rbind", and the
recycling. I
tried this on R 2.5.0 and 2.5.1

Can anyone see what I am doing wrong here?

Cheers
Rory

On 8/3/07, Gabor Grothendieck < ggrothendieck at gmail.com> wrote:
Can you provide a reproducible example that exhibits the
warning.
Redoing it in a more easily reproducible way and using the data
in your post gives me no warning

tmp <- data.frame(time = c(1185882786, 1185882790, 1185882791,
1185882791,
+ 1185882792, 1185882795), spread = c(1e-04, 1e-04, 2e-04,
1e-04,
+ 2e-04, 1e-04))
twas <-
+  function(dat) {
+     data.frame(tapply(diff(dat$time), head(dat$spread, -1),
+  sum)/sum(diff(dat$time)) * 100.0)
+ }
now <- Sys.time()
epoch <- now - as.numeric(now)
z <- do.call("rbind", by(tmp, format(epoch + tmp$time, "%H"),
twas))
z
      1e-04    2e-04
07 66.66667 33.33333
R.version.string # XP
[1] "R version 2.5.1 (2007-06-27)"

Here is input:

tmp <- data.frame(time = c(1185882786, 1185882790, 1185882791,
1185882791,
1185882792, 1185882795), spread = c(1e-04, 1e-04, 2e-04, 1e-04,
2e-04, 1e-04))
twas <-
function(dat) {
   data.frame(tapply(diff(dat$time), head(dat$spread, -1),
sum)/sum(diff(dat$time)) * 100.0)
}
now <- Sys.time ()
epoch <- now - as.numeric(now)
z <- do.call("rbind", by(tmp, format(epoch + tmp$time, "%H"),
twas))
z
R.version.string # XP

On 8/3/07, Rory Winston <rory.winston at gmail.com > wrote:
Hi

I've been wrestling with this a little bit, using the example
in the
email
that Gabor pointed me to as a reference, and I think I have
almost
got
what
I want...however its still not quite right.

I have a variable, tmp, with two dimensions: time and spread:

head(tmp$time)
[1] 1185882786 1185882790 1185882791 1185882791 1185882792
1185882795

head(tmp$spread)
[1] 1e-04 1e-04 2e-04 1e-04 2e-04 1e-04

I also have a function that calculates the time-weighted
average
spread:

twas
function(dat) {
  data.frame(tapply(diff(dat$time), head(dat$spread, -1),
sum)/sum(diff(dat$time)) * 100.0)
}

I can combine them using as rbind() and by():

z <- do.call("rbind", by(tmp, format(epoch + tmp$time, "%H"),
twas))
(epoch is just an instance of ISOdatetime)

This gives me a warning:

Warning message:
number of columns of result
       is not a multiple of vector length (arg 3) in: rbind(1,
"12"
= c(
91.99207541277 , 8.00792458723005), "13" = c(90.1884966797708,

The output from the above command is almost exactly what I
need,
apart
from
the recycling:

     1e-04     2e-04      3e-04        4e-04
12 91.99208  8.007925 91.9920754  8.007924587 <== recycled
values
13 90.18850  9.337448  0.4218405  0.052214551
14 90.59640  9.171417  0.2321811 90.596401668
15 89.55771 10.194291  0.2343418  0.013661453
...

I can just pass this into a barplot() and get a nice visual
breakdown of
hourly weighted spreads, *but* I dont know how to get these
results
without
the recycling. Looking at rbind(), it seems that this will
automatically
recycle. Does anyone know of a function I could use to get
these
results
without this problem?

Cheers
Rory

On 8/1/07, Gabor Grothendieck < ggrothendieck at gmail.com >
wrote:
Something similar was just discussed this morning:

https://www.stat.math.ethz.ch/pipermail/r-help/2007-August/137695.html

On 8/1/07, Rory Winston <rory.winston at gmail.com > wrote:
Hi all

I have a question about aggegating statistics by time
intervals.
I
have
a
data set with 3 columns : time, bid, and ask. Time is
specified
as a
millisecond timestamp since epoch. I would like to compute
summary
statistics for the data set on an hourly basis. Here is
what I
have
tried so
far:

# Data is in pricedata

t <- ISODatetime(1970, 1, 1, 0, 0, 0) + pricedata$time
agg <- aggregate(pricedata$spread, list(byhour=format(t,
"%Y-%m
%H")),
mean)
This seems to do what I want - however, what really want
to do
is
more
specific: I would like to be able to extract a subset of
the
data
frame
pricedata, and not just the aggregated entries - for
instance,
instead
of
just extracting pricedata$spread by hour, I would like to
extract a
slice of
columns, e.g. pricedata$spread and pricedata$time on an
hourly
basis,
and
pass these into a function that can compute a
time-weighted
average
spread,
for instance. Does anyone know an elegant way to do this?
I have
a
feeling
zoo may do what I want, but I'm new to zoo ...

Cheers
Rory

       [[alternative HTML version deleted]]

_______________________________________________
R-SIG-Finance at stat.math.ethz.ch mailing list

https://stat.ethz.ch/mailman/listinfo/r-sig-finance
-- Subscriber-posting only.
-- If you want to post, subscribe first.

       [[alternative HTML version deleted]]

_______________________________________________
R-SIG-Finance at stat.math.ethz.ch mailing list

https://stat.ethz.ch/mailman/listinfo/r-sig-finance
-- Subscriber-posting only.
-- If you want to post, subscribe first.

Here is one more solution.  Using the tmp from your post, this one uses
SQLite via the sqldf package.  It produces a similar output as z from
our prior solution and then
we can use xtabs, reshape or the reshape package as before to get
the final layout.  The first subselect within the main select sums within
hour and spread and the second sums within hour.  We join the subselects
and take the ratio of the two the sums to get the answer.

In the solutions before we used hour relative to GMT rather than local time.
library(sqldf)
sqldf("select ahour, spread,
+   100 * aa.timediff / bb.timediff timediff from
+   (select
+     strftime('%H',a.time__1,'unixepoch') ahour,
+     strftime('%H',b.time__1,'unixepoch') bhour,
+     a.spread spread,
+     sum(b.time__1 - a.time__1) timediff
+    from tmp a, tmp b
+    where a.row_names = b.row_names-1 and ahour = bhour
+    group by ahour, a.spread) aa join
+    (select strftime('%H',c.time__1,'unixepoch') chour,
+       strftime('%H',d.time__1,'unixepoch') dhour,
+       sum(d.time__1 - c.time__1) timediff
+       from tmp c, tmp d
+       where c.row_names = d.row_names-1 and chour = dhour
+       group by chour) bb
+     where ahour = chour
+     group by spread, ahour",
+  row.names = TRUE)
  ahour spread  timediff
1    11  1e-04 91.358025
2    12  1e-04 92.613636
3    11  2e-04  8.641975
4    12  2e-04  5.681818
5    12  3e-04  1.704545
# old solution for comparison
twas <- function(x) {
+ y <- data.frame(timediff = diff(x$time), head(x, -1))
+ aggregate(100 * y[1]/sum(y[1]), y[c("hour", "spread")], sum)
+ }
now <- Sys.time()
epoch <- now - as.numeric(now)
fmt <- function(x) format(epoch + x, "%H", tz = "GMT")
tmp2 <- cbind(tmp, hour = fmt(tmp$time))

z <- do.call("rbind", by(tmp2, tmp2["hour"], twas))
z
hour spread  timediff
11.1   11  1e-04 91.358025
11.2   11  2e-04  8.641975
12.1   12  1e-04 92.613636
12.2   12  2e-04  5.681818
12.3   12  3e-04  1.704545

Here is input:

library(sqldf)
sqldf("select ahour, spread,
	100 * aa.timediff / bb.timediff timediff from
	(select
	strftime('%H',a.time__1,'unixepoch') ahour,
	strftime('%H',b.time__1,'unixepoch') bhour,
	a.spread spread,
	sum(b.time__1 - a.time__1) timediff
	from tmp a, tmp b
	where a.row_names = b.row_names-1 and ahour = bhour
	group by ahour, a.spread) aa join
	(select strftime('%H',c.time__1,'unixepoch') chour,
		strftime('%H',d.time__1,'unixepoch') dhour,
		sum(d.time__1 - c.time__1) timediff
		from tmp c, tmp d
		where c.row_names = d.row_names-1 and chour = dhour
		group by chour) bb
	where ahour = chour
	group by spread, ahour",
	row.names = TRUE)
The different invocations of twas were creating data frames of different
numbers of columns because different hours had different numbers
of spreads.  The warning came when it tried to rbind together
data.frames with different numbers of columns.

On 8/3/07, Rory Winston <rory.winston at gmail.com> wrote:
Wow....thats great. Thank you very much! I appreciate the help greatly. I
dont quite understand what the issue was though....was it that the data
frame returned from my initial twas() function was of the wrong order?

Cheers
Rory

On 8/3/07, Gabor Grothendieck <ggrothendieck at gmail.com> wrote:
I had omitted fmt and epoch.  tmp is as in your post.

twas <- function(x) {
        y <- data.frame(timediff = diff(x$time), head(x, -1))
        aggregate(100 * y[1]/sum(y[1]), y[c("hour", "spread")], sum)
}
now <- Sys.time()
epoch <- now - as.numeric(now)
fmt <- function(x) format(epoch + x, "%H")
tmp2 <- cbind(tmp, hour = fmt(tmp$time))

z <- do.call("rbind", by(tmp2, tmp2["hour"], twas))

# three alternatives

# 1
xtabs(timediff ~., z)

# 2
reshape(z, dir = "wide", timevar = "spread", idvar = "hour")

# 3
library(reshape)
cast(melt(z, id = 1:2), hour ~ spread)

On 8/3/07, Gabor Grothendieck <ggrothendieck at gmail.com> wrote:
Try producing it in "long" format using aggregate and then reshaping
it into "wide" format using xtabs, reshape or the reshape package:

twas <- function(x) {
       y <- data.frame(timediff = diff(x$time), head(x, -1))
       aggregate(100 * y[1]/sum(y[1]), y[c("hour", "spread")], sum)
}
tmp2 <- cbind(tmp, hour = fmt(tmp$time))
long <- do.call("rbind", by(tmp2, tmp2["hour"], twas))

# any one of these three:

xtabs(timediff ~., long)

reshape(long, dir = "wide", timevar = "spread", idvar = "hour")

library(reshape)
cast(melt(long, id = 1:2), hour ~ spread)

On 8/3/07, Rory Winston < rory.winston at gmail.com> wrote:
Hi

Sorry, I'm not sure what happened with that last one. Here is a fully
contained example (sorry about the line length if this doesnt wrap).

tmp <- data.frame(

time=c(1185882786,1185882790,1185882791,1185882791,1185882792,1185882795,1185882796,1185882797,1185882797,1185882798,1185882799,1185882800,1185882806,1185882807,1185882809,1185882810,1185882810,1185882811,1185882845,1185882846,1185882906,1185882918,1185882950,1185882951,1185882951,1185882952,1185882953,1185882954,1185882955,1185882956,1185882991,1185882991,1185882995,1185882996,1185882997,1185882997,1185882998,1185882998,1185882999,1185883003,1185883004,1185883006,1185883007,1185883025,1185883026,1185883086,1185883129,1185883129,1185883133,1185883133,1185883137,1185883137,1185883144,1185883145,1185883145,1185883148,1185883148,1185883149,1185883150,1185883151,1185883152,1185883154,1185883154,1185883155,1185883155,1185883175,1185883176,1185883179,1185883179,1185883180,1185883181,1185883181,1185883182,1185883186,1185883187,1185883191,1185883191,1185883200,1185883200,1185883211,1185883212,1185883214,1185883214,1185883215,1185883217,1185883218,1185883219,1185883279,1185883307,1185883307,1185883365,1185883366,1185883366,1185883367,1185883368,1185883368,1185883368,1185883369,1185883373,1185883376),

spread=c(1e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,1e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,1e-04,2e-04,1e-04,1e-04,1e-04,2e-04,1e-04,1e-04,1e-04,2e-04,1e-04,1e-04,2e-04,1e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,1e-04,2e-04,1e-04,2e-04,1e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,1e-04,2e-04,1e-04,2e-04,1e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,3e-04,2e-04)
)

twas <- function (dat)
{
    data.frame(tapply(diff(dat$time), head(dat$spread, -1),
sum)/sum(diff(dat$time)) * 100)
}

now <- Sys.time()
epoch <- now - as.numeric(now)

z <- do.call("rbind", by(tmp, format(epoch + tmp$time, "%H"), twas))

Cheers
Rory

On 8/3/07, Gabor Grothendieck < ggrothendieck at gmail.com> wrote:
I still get no warning.  Please provide complete self contained
input
and output.

tmp <- data.frame(time = c(1185882786, 1185882790, 1185882791,
1185882791,
+  1185882792, 1185882795), spread = c(1e-04, 1e-04, 2e-04, 1e-04,
+  2e-04, 3e-04))
twas <-
+  function(dat) {
+    data.frame(tapply(diff(dat$time), head(dat$spread, -1),
+  sum)/sum(diff(dat$time)) * 100.0)
+ }
now <- Sys.time()
epoch <- now - as.numeric(now)
z <- do.call("rbind", by(tmp, format(epoch + tmp$time, "%H"),
twas))
z
      1e-04    2e-04
07 66.66667 33.33333
R.version.string # XP
[1] "R version 2.5.1 (2007-06-27)"

On 8/3/07, Rory Winston <rory.winston at gmail.com> wrote:
Hi

I have figured out what causes the warning (and recycling), but I
am not
sure how I can fix it. After seeing that it seemed to work for
you, I
went
back and tried working with different subsets of the data. I
eventually
found where it occurs - when we get a third unique spread value.
To
reproduce, just change the definition of tmp to be:

tmp <- data.frame(time = c(1185882786, 1185882790, 1185882791,
1185882791,
 1185882792, 1185882795), spread = c(1e-04, 1e-04, 2e-04, 1e-04,
 2e-04, 3e-04)) <== Added 3e-04

i.e. I have just changed one of the spread values to be a third
value -
this
seems to trigger the warning  "Warning message:number of columns
of
result
is not a multiple of vector length (arg 3) in: rbind", and the
recycling. I
tried this on R 2.5.0 and 2.5.1

Can anyone see what I am doing wrong here?

Cheers
Rory

On 8/3/07, Gabor Grothendieck < ggrothendieck at gmail.com> wrote:
Can you provide a reproducible example that exhibits the
warning.
Redoing it in a more easily reproducible way and using the data
in your post gives me no warning

tmp <- data.frame(time = c(1185882786, 1185882790, 1185882791,
1185882791,
+ 1185882792, 1185882795), spread = c(1e-04, 1e-04, 2e-04,
1e-04,
+ 2e-04, 1e-04))
twas <-
+  function(dat) {
+     data.frame(tapply(diff(dat$time), head(dat$spread, -1),
+  sum)/sum(diff(dat$time)) * 100.0)
+ }
now <- Sys.time()
epoch <- now - as.numeric(now)
z <- do.call("rbind", by(tmp, format(epoch + tmp$time, "%H"),
twas))
z
      1e-04    2e-04
07 66.66667 33.33333
R.version.string # XP
[1] "R version 2.5.1 (2007-06-27)"

Here is input:

tmp <- data.frame(time = c(1185882786, 1185882790, 1185882791,
1185882791,
1185882792, 1185882795), spread = c(1e-04, 1e-04, 2e-04, 1e-04,
2e-04, 1e-04))
twas <-
function(dat) {
   data.frame(tapply(diff(dat$time), head(dat$spread, -1),
sum)/sum(diff(dat$time)) * 100.0)
}
now <- Sys.time ()
epoch <- now - as.numeric(now)
z <- do.call("rbind", by(tmp, format(epoch + tmp$time, "%H"),
twas))
z
R.version.string # XP

On 8/3/07, Rory Winston <rory.winston at gmail.com > wrote:
Hi

I've been wrestling with this a little bit, using the example
in the
email
that Gabor pointed me to as a reference, and I think I have
almost
got
what
I want...however its still not quite right.

I have a variable, tmp, with two dimensions: time and spread:

head(tmp$time)
[1] 1185882786 1185882790 1185882791 1185882791 1185882792
1185882795

head(tmp$spread)
[1] 1e-04 1e-04 2e-04 1e-04 2e-04 1e-04

I also have a function that calculates the time-weighted
average
spread:

twas
function(dat) {
  data.frame(tapply(diff(dat$time), head(dat$spread, -1),
sum)/sum(diff(dat$time)) * 100.0)
}

I can combine them using as rbind() and by():

z <- do.call("rbind", by(tmp, format(epoch + tmp$time, "%H"),
twas))
(epoch is just an instance of ISOdatetime)

This gives me a warning:

Warning message:
number of columns of result
       is not a multiple of vector length (arg 3) in: rbind(1,
"12"
= c(
91.99207541277 , 8.00792458723005), "13" = c(90.1884966797708,

The output from the above command is almost exactly what I
need,
apart
from
the recycling:

     1e-04     2e-04      3e-04        4e-04
12 91.99208  8.007925 91.9920754  8.007924587 <== recycled
values
13 90.18850  9.337448  0.4218405  0.052214551
14 90.59640  9.171417  0.2321811 90.596401668
15 89.55771 10.194291  0.2343418  0.013661453
...

I can just pass this into a barplot() and get a nice visual
breakdown of
hourly weighted spreads, *but* I dont know how to get these
results
without
the recycling. Looking at rbind(), it seems that this will
automatically
recycle. Does anyone know of a function I could use to get
these
results
without this problem?

Cheers
Rory

On 8/1/07, Gabor Grothendieck < ggrothendieck at gmail.com >
wrote:
Something similar was just discussed this morning:

https://www.stat.math.ethz.ch/pipermail/r-help/2007-August/137695.html

On 8/1/07, Rory Winston <rory.winston at gmail.com > wrote:
Hi all

I have a question about aggegating statistics by time
intervals.
I
have
a
data set with 3 columns : time, bid, and ask. Time is
specified
as a
millisecond timestamp since epoch. I would like to compute
summary
statistics for the data set on an hourly basis. Here is
what I
have
tried so
far:

# Data is in pricedata

t <- ISODatetime(1970, 1, 1, 0, 0, 0) + pricedata$time
agg <- aggregate(pricedata$spread, list(byhour=format(t,
"%Y-%m
%H")),
mean)
This seems to do what I want - however, what really want
to do
is
more
specific: I would like to be able to extract a subset of
the
data
frame
pricedata, and not just the aggregated entries - for
instance,
instead
of
just extracting pricedata$spread by hour, I would like to
extract a
slice of
columns, e.g. pricedata$spread and pricedata$time on an
hourly
basis,
and
pass these into a function that can compute a
time-weighted
average
spread,
for instance. Does anyone know an elegant way to do this?
I have
a
feeling
zoo may do what I want, but I'm new to zoo ...

Cheers
Rory

       [[alternative HTML version deleted]]

_______________________________________________
R-SIG-Finance at stat.math.ethz.ch mailing list

https://stat.ethz.ch/mailman/listinfo/r-sig-finance
-- Subscriber-posting only.
-- If you want to post, subscribe first.

       [[alternative HTML version deleted]]

_______________________________________________
R-SIG-Finance at stat.math.ethz.ch mailing list

https://stat.ethz.ch/mailman/listinfo/r-sig-finance
-- Subscriber-posting only.
-- If you want to post, subscribe first.

An embedded and charset-unspecified text was scrubbed...
Name: not available
Url: https://stat.ethz.ch/pipermail/r-sig-finance/attachments/20070803/c26e31ac/attachment.pl