An embedded and charset-unspecified text was scrubbed... Name: not available Url: https://stat.ethz.ch/pipermail/r-sig-finance/attachments/20070801/39d11a3c/attachment.pl
Aggregating Statistics By Time Interval
13 messages · Gabor Grothendieck, Rory Winston
Something similar was just discussed this morning: https://www.stat.math.ethz.ch/pipermail/r-help/2007-August/137695.html
On 8/1/07, Rory Winston <rory.winston at gmail.com> wrote:
Hi all
I have a question about aggegating statistics by time intervals. I have a
data set with 3 columns : time, bid, and ask. Time is specified as a
millisecond timestamp since epoch. I would like to compute summary
statistics for the data set on an hourly basis. Here is what I have tried so
far:
# Data is in pricedata
t <- ISODatetime(1970, 1, 1, 0, 0, 0) + pricedata$time
agg <- aggregate(pricedata$spread, list(byhour=format(t, "%Y-%m %H")), mean)
This seems to do what I want - however, what really want to do is more
specific: I would like to be able to extract a subset of the data frame
pricedata, and not just the aggregated entries - for instance, instead of
just extracting pricedata$spread by hour, I would like to extract a slice of
columns, e.g. pricedata$spread and pricedata$time on an hourly basis, and
pass these into a function that can compute a time-weighted average spread,
for instance. Does anyone know an elegant way to do this? I have a feeling
zoo may do what I want, but I'm new to zoo ...
Cheers
Rory
[[alternative HTML version deleted]]
_______________________________________________ R-SIG-Finance at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-finance -- Subscriber-posting only. -- If you want to post, subscribe first.
1 day later
An embedded and charset-unspecified text was scrubbed... Name: not available Url: https://stat.ethz.ch/pipermail/r-sig-finance/attachments/20070803/0c692623/attachment.pl
Can you provide a reproducible example that exhibits the warning. Redoing it in a more easily reproducible way and using the data in your post gives me no warning
tmp <- data.frame(time = c(1185882786, 1185882790, 1185882791, 1185882791,
+ 1185882792, 1185882795), spread = c(1e-04, 1e-04, 2e-04, 1e-04, + 2e-04, 1e-04))
twas <-
+ function(dat) {
+ data.frame(tapply(diff(dat$time), head(dat$spread, -1),
+ sum)/sum(diff(dat$time)) * 100.0)
+ }
now <- Sys.time()
epoch <- now - as.numeric(now)
z <- do.call("rbind", by(tmp, format(epoch + tmp$time, "%H"), twas))
z
1e-04 2e-04 07 66.66667 33.33333
R.version.string # XP
[1] "R version 2.5.1 (2007-06-27)"
Here is input:
tmp <- data.frame(time = c(1185882786, 1185882790, 1185882791, 1185882791,
1185882792, 1185882795), spread = c(1e-04, 1e-04, 2e-04, 1e-04,
2e-04, 1e-04))
twas <-
function(dat) {
data.frame(tapply(diff(dat$time), head(dat$spread, -1),
sum)/sum(diff(dat$time)) * 100.0)
}
now <- Sys.time()
epoch <- now - as.numeric(now)
z <- do.call("rbind", by(tmp, format(epoch + tmp$time, "%H"), twas))
z
R.version.string # XP
On 8/3/07, Rory Winston <rory.winston at gmail.com> wrote:
Hi I've been wrestling with this a little bit, using the example in the email that Gabor pointed me to as a reference, and I think I have almost got what I want...however its still not quite right. I have a variable, tmp, with two dimensions: time and spread:
head(tmp$time)
[1] 1185882786 1185882790 1185882791 1185882791 1185882792 1185882795
head(tmp$spread)
[1] 1e-04 1e-04 2e-04 1e-04 2e-04 1e-04
I also have a function that calculates the time-weighted average spread:
twas
function(dat) {
data.frame(tapply(diff(dat$time), head(dat$spread, -1),
sum)/sum(diff(dat$time)) * 100.0)
}
I can combine them using as rbind() and by():
z <- do.call("rbind", by(tmp, format(epoch + tmp$time, "%H"), twas))
(epoch is just an instance of ISOdatetime)
This gives me a warning:
Warning message:
number of columns of result
is not a multiple of vector length (arg 3) in: rbind(1, "12" = c(
91.99207541277, 8.00792458723005), "13" = c(90.1884966797708,
The output from the above command is almost exactly what I need, apart from
the recycling:
1e-04 2e-04 3e-04 4e-04
12 91.99208 8.007925 91.9920754 8.007924587 <== recycled values
13 90.18850 9.337448 0.4218405 0.052214551
14 90.59640 9.171417 0.2321811 90.596401668
15 89.55771 10.194291 0.2343418 0.013661453
...
I can just pass this into a barplot() and get a nice visual breakdown of
hourly weighted spreads, *but* I dont know how to get these results without
the recycling. Looking at rbind(), it seems that this will automatically
recycle. Does anyone know of a function I could use to get these results
without this problem?
Cheers
Rory
On 8/1/07, Gabor Grothendieck <ggrothendieck at gmail.com> wrote:
Something similar was just discussed this morning: https://www.stat.math.ethz.ch/pipermail/r-help/2007-August/137695.html On 8/1/07, Rory Winston <rory.winston at gmail.com> wrote:
Hi all I have a question about aggegating statistics by time intervals. I have
a
data set with 3 columns : time, bid, and ask. Time is specified as a millisecond timestamp since epoch. I would like to compute summary statistics for the data set on an hourly basis. Here is what I have
tried so
far: # Data is in pricedata t <- ISODatetime(1970, 1, 1, 0, 0, 0) + pricedata$time agg <- aggregate(pricedata$spread, list(byhour=format(t, "%Y-%m %H")),
mean)
This seems to do what I want - however, what really want to do is more specific: I would like to be able to extract a subset of the data frame pricedata, and not just the aggregated entries - for instance, instead
of
just extracting pricedata$spread by hour, I would like to extract a
slice of
columns, e.g. pricedata$spread and pricedata$time on an hourly basis,
and
pass these into a function that can compute a time-weighted average
spread,
for instance. Does anyone know an elegant way to do this? I have a
feeling
zoo may do what I want, but I'm new to zoo ...
Cheers
Rory
[[alternative HTML version deleted]]
_______________________________________________ R-SIG-Finance at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-finance -- Subscriber-posting only. -- If you want to post, subscribe first.
[[alternative HTML version deleted]]
_______________________________________________ R-SIG-Finance at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-finance -- Subscriber-posting only. -- If you want to post, subscribe first.
An embedded and charset-unspecified text was scrubbed... Name: not available Url: https://stat.ethz.ch/pipermail/r-sig-finance/attachments/20070803/06744fb6/attachment.pl
I still get no warning. Please provide complete self contained input and output.
tmp <- data.frame(time = c(1185882786, 1185882790, 1185882791, 1185882791,
+ 1185882792, 1185882795), spread = c(1e-04, 1e-04, 2e-04, 1e-04, + 2e-04, 3e-04))
twas <-
+ function(dat) {
+ data.frame(tapply(diff(dat$time), head(dat$spread, -1),
+ sum)/sum(diff(dat$time)) * 100.0)
+ }
now <- Sys.time()
epoch <- now - as.numeric(now)
z <- do.call("rbind", by(tmp, format(epoch + tmp$time, "%H"), twas))
z
1e-04 2e-04 07 66.66667 33.33333
R.version.string # XP
[1] "R version 2.5.1 (2007-06-27)"
On 8/3/07, Rory Winston <rory.winston at gmail.com> wrote:
Hi I have figured out what causes the warning (and recycling), but I am not sure how I can fix it. After seeing that it seemed to work for you, I went back and tried working with different subsets of the data. I eventually found where it occurs - when we get a third unique spread value. To reproduce, just change the definition of tmp to be: tmp <- data.frame(time = c(1185882786, 1185882790, 1185882791, 1185882791, 1185882792, 1185882795), spread = c(1e-04, 1e-04, 2e-04, 1e-04, 2e-04, 3e-04)) <== Added 3e-04 i.e. I have just changed one of the spread values to be a third value - this seems to trigger the warning "Warning message:number of columns of result is not a multiple of vector length (arg 3) in: rbind", and the recycling. I tried this on R 2.5.0 and 2.5.1 Can anyone see what I am doing wrong here? Cheers Rory On 8/3/07, Gabor Grothendieck < ggrothendieck at gmail.com> wrote:
Can you provide a reproducible example that exhibits the warning. Redoing it in a more easily reproducible way and using the data in your post gives me no warning
tmp <- data.frame(time = c(1185882786, 1185882790, 1185882791,
1185882791,
+ 1185882792, 1185882795), spread = c(1e-04, 1e-04, 2e-04, 1e-04, + 2e-04, 1e-04))
twas <-
+ function(dat) {
+ data.frame(tapply(diff(dat$time), head(dat$spread, -1),
+ sum)/sum(diff(dat$time)) * 100.0)
+ }
now <- Sys.time()
epoch <- now - as.numeric(now)
z <- do.call("rbind", by(tmp, format(epoch + tmp$time, "%H"), twas))
z
1e-04 2e-04 07 66.66667 33.33333
R.version.string # XP
[1] "R version 2.5.1 (2007-06-27)"
Here is input:
tmp <- data.frame(time = c(1185882786, 1185882790, 1185882791, 1185882791,
1185882792, 1185882795), spread = c(1e-04, 1e-04, 2e-04, 1e-04,
2e-04, 1e-04))
twas <-
function(dat) {
data.frame(tapply(diff(dat$time), head(dat$spread, -1),
sum)/sum(diff(dat$time)) * 100.0)
}
now <- Sys.time()
epoch <- now - as.numeric(now)
z <- do.call("rbind", by(tmp, format(epoch + tmp$time, "%H"), twas))
z
R.version.string # XP
On 8/3/07, Rory Winston <rory.winston at gmail.com> wrote:
Hi I've been wrestling with this a little bit, using the example in the
that Gabor pointed me to as a reference, and I think I have almost got
what
I want...however its still not quite right. I have a variable, tmp, with two dimensions: time and spread:
head(tmp$time)
[1] 1185882786 1185882790 1185882791 1185882791 1185882792 1185882795
head(tmp$spread)
[1] 1e-04 1e-04 2e-04 1e-04 2e-04 1e-04
I also have a function that calculates the time-weighted average spread:
twas
function(dat) {
data.frame(tapply(diff(dat$time), head(dat$spread, -1),
sum)/sum(diff(dat$time)) * 100.0)
}
I can combine them using as rbind() and by():
z <- do.call("rbind", by(tmp, format(epoch + tmp$time, "%H"), twas))
(epoch is just an instance of ISOdatetime)
This gives me a warning:
Warning message:
number of columns of result
is not a multiple of vector length (arg 3) in: rbind(1, "12" = c(
91.99207541277, 8.00792458723005), "13" = c(90.1884966797708,
The output from the above command is almost exactly what I need, apart
from
the recycling:
1e-04 2e-04 3e-04 4e-04
12 91.99208 8.007925 91.9920754 8.007924587 <== recycled values
13 90.18850 9.337448 0.4218405 0.052214551
14 90.59640 9.171417 0.2321811 90.596401668
15 89.55771 10.194291 0.2343418 0.013661453
...
I can just pass this into a barplot() and get a nice visual breakdown of
hourly weighted spreads, *but* I dont know how to get these results
without
the recycling. Looking at rbind(), it seems that this will automatically recycle. Does anyone know of a function I could use to get these results without this problem? Cheers Rory On 8/1/07, Gabor Grothendieck <ggrothendieck at gmail.com > wrote:
Something similar was just discussed this morning:
On 8/1/07, Rory Winston <rory.winston at gmail.com> wrote:
Hi all I have a question about aggegating statistics by time intervals. I
have
a
data set with 3 columns : time, bid, and ask. Time is specified as a millisecond timestamp since epoch. I would like to compute summary statistics for the data set on an hourly basis. Here is what I have
tried so
far: # Data is in pricedata t <- ISODatetime(1970, 1, 1, 0, 0, 0) + pricedata$time agg <- aggregate(pricedata$spread, list(byhour=format(t, "%Y-%m
%H")),
mean)
This seems to do what I want - however, what really want to do is
more
specific: I would like to be able to extract a subset of the data
frame
pricedata, and not just the aggregated entries - for instance,
instead
of
just extracting pricedata$spread by hour, I would like to extract a
slice of
columns, e.g. pricedata$spread and pricedata$time on an hourly
basis,
and
pass these into a function that can compute a time-weighted average
spread,
for instance. Does anyone know an elegant way to do this? I have a
feeling
zoo may do what I want, but I'm new to zoo ...
Cheers
Rory
[[alternative HTML version deleted]]
_______________________________________________ R-SIG-Finance at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-finance -- Subscriber-posting only. -- If you want to post, subscribe first.
[[alternative HTML version deleted]]
_______________________________________________ R-SIG-Finance at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-finance -- Subscriber-posting only. -- If you want to post, subscribe first.
An embedded and charset-unspecified text was scrubbed... Name: not available Url: https://stat.ethz.ch/pipermail/r-sig-finance/attachments/20070803/d0ecbd0e/attachment.pl
Try producing it in "long" format using aggregate and then reshaping
it into "wide" format using xtabs, reshape or the reshape package:
twas <- function(x) {
y <- data.frame(timediff = diff(x$time), head(x, -1))
aggregate(100 * y[1]/sum(y[1]), y[c("hour", "spread")], sum)
}
tmp2 <- cbind(tmp, hour = fmt(tmp$time))
long <- do.call("rbind", by(tmp2, tmp2["hour"], twas))
# any one of these three:
xtabs(timediff ~., long)
reshape(long, dir = "wide", timevar = "spread", idvar = "hour")
library(reshape)
cast(melt(long, id = 1:2), hour ~ spread)
On 8/3/07, Rory Winston <rory.winston at gmail.com> wrote:
Hi
Sorry, I'm not sure what happened with that last one. Here is a fully
contained example (sorry about the line length if this doesnt wrap).
tmp <- data.frame(
time=c(1185882786,1185882790,1185882791,1185882791,1185882792,1185882795,1185882796,1185882797,1185882797,1185882798,1185882799,1185882800,1185882806,1185882807,1185882809,1185882810,1185882810,1185882811,1185882845,1185882846,1185882906,1185882918,1185882950,1185882951,1185882951,1185882952,1185882953,1185882954,1185882955,1185882956,1185882991,1185882991,1185882995,1185882996,1185882997,1185882997,1185882998,1185882998,1185882999,1185883003,1185883004,1185883006,1185883007,1185883025,1185883026,1185883086,1185883129,1185883129,1185883133,1185883133,1185883137,1185883137,1185883144,1185883145,1185883145,1185883148,1185883148,1185883149,1185883150,1185883151,1185883152,1185883154,1185883154,1185883155,1185883155,1185883175,1185883176,1185883179,1185883179,1185883180,1185883181,1185883181,1185883182,1185883186,1185883187,1185883191,1185883191,1185883200,1185883200,1185883211,1185883212,1185883214,1185883214,1185883215,1185883217,1185883218,1185883219,1185883279,1185883307,1185883307,1185883365,1185883366,1185883366,1185883367,1185883368,1185883368,1185883368,1185883369,1185883373,1185883376),
spread=c(1e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,1e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,1e-04,2e-04,1e-04,1e-04,1e-04,2e-04,1e-04,1e-04,1e-04,2e-04,1e-04,1e-04,2e-04,1e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,1e-04,2e-04,1e-04,2e-04,1e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,1e-04,2e-04,1e-04,2e-04,1e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,3e-04,2e-04)
)
twas <- function (dat)
{
data.frame(tapply(diff(dat$time), head(dat$spread, -1),
sum)/sum(diff(dat$time)) * 100)
}
now <- Sys.time()
epoch <- now - as.numeric(now)
z <- do.call("rbind", by(tmp, format(epoch + tmp$time, "%H"), twas))
Cheers
Rory
On 8/3/07, Gabor Grothendieck < ggrothendieck at gmail.com> wrote:
I still get no warning. Please provide complete self contained input and output.
tmp <- data.frame(time = c(1185882786, 1185882790, 1185882791,
1185882791,
+ 1185882792, 1185882795), spread = c(1e-04, 1e-04, 2e-04, 1e-04, + 2e-04, 3e-04))
twas <-
+ function(dat) {
+ data.frame(tapply(diff(dat$time), head(dat$spread, -1),
+ sum)/sum(diff(dat$time)) * 100.0)
+ }
now <- Sys.time()
epoch <- now - as.numeric(now)
z <- do.call("rbind", by(tmp, format(epoch + tmp$time, "%H"), twas))
z
1e-04 2e-04 07 66.66667 33.33333
R.version.string # XP
[1] "R version 2.5.1 (2007-06-27)" On 8/3/07, Rory Winston <rory.winston at gmail.com> wrote:
Hi I have figured out what causes the warning (and recycling), but I am not sure how I can fix it. After seeing that it seemed to work for you, I
went
back and tried working with different subsets of the data. I eventually found where it occurs - when we get a third unique spread value. To reproduce, just change the definition of tmp to be: tmp <- data.frame(time = c(1185882786, 1185882790, 1185882791,
1185882791,
1185882792, 1185882795), spread = c(1e-04, 1e-04, 2e-04, 1e-04, 2e-04, 3e-04)) <== Added 3e-04 i.e. I have just changed one of the spread values to be a third value -
this
seems to trigger the warning "Warning message:number of columns of
result
is not a multiple of vector length (arg 3) in: rbind", and the
recycling. I
tried this on R 2.5.0 and 2.5.1 Can anyone see what I am doing wrong here? Cheers Rory On 8/3/07, Gabor Grothendieck < ggrothendieck at gmail.com> wrote:
Can you provide a reproducible example that exhibits the warning. Redoing it in a more easily reproducible way and using the data in your post gives me no warning
tmp <- data.frame(time = c(1185882786, 1185882790, 1185882791,
1185882791,
+ 1185882792, 1185882795), spread = c(1e-04, 1e-04, 2e-04, 1e-04, + 2e-04, 1e-04))
twas <-
+ function(dat) {
+ data.frame(tapply(diff(dat$time), head(dat$spread, -1),
+ sum)/sum(diff(dat$time)) * 100.0)
+ }
now <- Sys.time()
epoch <- now - as.numeric(now)
z <- do.call("rbind", by(tmp, format(epoch + tmp$time, "%H"), twas))
z
1e-04 2e-04 07 66.66667 33.33333
R.version.string # XP
[1] "R version 2.5.1 (2007-06-27)" Here is input: tmp <- data.frame(time = c(1185882786, 1185882790, 1185882791,
1185882791,
1185882792, 1185882795), spread = c(1e-04, 1e-04, 2e-04, 1e-04,
2e-04, 1e-04))
twas <-
function(dat) {
data.frame(tapply(diff(dat$time), head(dat$spread, -1),
sum)/sum(diff(dat$time)) * 100.0)
}
now <- Sys.time()
epoch <- now - as.numeric(now)
z <- do.call("rbind", by(tmp, format(epoch + tmp$time, "%H"), twas))
z
R.version.string # XP
On 8/3/07, Rory Winston <rory.winston at gmail.com> wrote:
Hi I've been wrestling with this a little bit, using the example in the
that Gabor pointed me to as a reference, and I think I have almost
got
what
I want...however its still not quite right. I have a variable, tmp, with two dimensions: time and spread:
head(tmp$time)
[1] 1185882786 1185882790 1185882791 1185882791 1185882792
1185882795
head(tmp$spread)
[1] 1e-04 1e-04 2e-04 1e-04 2e-04 1e-04
I also have a function that calculates the time-weighted average
spread:
twas
function(dat) {
data.frame(tapply(diff(dat$time), head(dat$spread, -1),
sum)/sum(diff(dat$time)) * 100.0)
}
I can combine them using as rbind() and by():
z <- do.call("rbind", by(tmp, format(epoch + tmp$time, "%H"), twas))
(epoch is just an instance of ISOdatetime)
This gives me a warning:
Warning message:
number of columns of result
is not a multiple of vector length (arg 3) in: rbind(1, "12"
= c(
91.99207541277 , 8.00792458723005), "13" = c(90.1884966797708, The output from the above command is almost exactly what I need,
apart
from
the recycling:
1e-04 2e-04 3e-04 4e-04
12 91.99208 8.007925 91.9920754 8.007924587 <== recycled values
13 90.18850 9.337448 0.4218405 0.052214551
14 90.59640 9.171417 0.2321811 90.596401668
15 89.55771 10.194291 0.2343418 0.013661453
...
I can just pass this into a barplot() and get a nice visual
breakdown of
hourly weighted spreads, *but* I dont know how to get these results
without
the recycling. Looking at rbind(), it seems that this will
automatically
recycle. Does anyone know of a function I could use to get these
results
without this problem? Cheers Rory On 8/1/07, Gabor Grothendieck < ggrothendieck at gmail.com > wrote:
Something similar was just discussed this morning:
On 8/1/07, Rory Winston <rory.winston at gmail.com > wrote:
Hi all I have a question about aggegating statistics by time intervals.
I
have
a
data set with 3 columns : time, bid, and ask. Time is specified
as a
millisecond timestamp since epoch. I would like to compute
summary
statistics for the data set on an hourly basis. Here is what I
have
tried so
far: # Data is in pricedata t <- ISODatetime(1970, 1, 1, 0, 0, 0) + pricedata$time agg <- aggregate(pricedata$spread, list(byhour=format(t, "%Y-%m
%H")),
mean)
This seems to do what I want - however, what really want to do
is
more
specific: I would like to be able to extract a subset of the
data
frame
pricedata, and not just the aggregated entries - for instance,
instead
of
just extracting pricedata$spread by hour, I would like to
extract a
slice of
columns, e.g. pricedata$spread and pricedata$time on an hourly
basis,
and
pass these into a function that can compute a time-weighted
average
spread,
for instance. Does anyone know an elegant way to do this? I have
a
feeling
zoo may do what I want, but I'm new to zoo ...
Cheers
Rory
[[alternative HTML version deleted]]
_______________________________________________ R-SIG-Finance at stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-sig-finance -- Subscriber-posting only. -- If you want to post, subscribe first. [[alternative HTML version deleted]] _______________________________________________ R-SIG-Finance at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-finance -- Subscriber-posting only. -- If you want to post, subscribe first.
I had omitted fmt and epoch. tmp is as in your post.
twas <- function(x) {
y <- data.frame(timediff = diff(x$time), head(x, -1))
aggregate(100 * y[1]/sum(y[1]), y[c("hour", "spread")], sum)
}
now <- Sys.time()
epoch <- now - as.numeric(now)
fmt <- function(x) format(epoch + x, "%H")
tmp2 <- cbind(tmp, hour = fmt(tmp$time))
z <- do.call("rbind", by(tmp2, tmp2["hour"], twas))
# three alternatives
# 1
xtabs(timediff ~., z)
# 2
reshape(z, dir = "wide", timevar = "spread", idvar = "hour")
# 3
library(reshape)
cast(melt(z, id = 1:2), hour ~ spread)
On 8/3/07, Gabor Grothendieck <ggrothendieck at gmail.com> wrote:
Try producing it in "long" format using aggregate and then reshaping
it into "wide" format using xtabs, reshape or the reshape package:
twas <- function(x) {
y <- data.frame(timediff = diff(x$time), head(x, -1))
aggregate(100 * y[1]/sum(y[1]), y[c("hour", "spread")], sum)
}
tmp2 <- cbind(tmp, hour = fmt(tmp$time))
long <- do.call("rbind", by(tmp2, tmp2["hour"], twas))
# any one of these three:
xtabs(timediff ~., long)
reshape(long, dir = "wide", timevar = "spread", idvar = "hour")
library(reshape)
cast(melt(long, id = 1:2), hour ~ spread)
On 8/3/07, Rory Winston <rory.winston at gmail.com> wrote:
Hi
Sorry, I'm not sure what happened with that last one. Here is a fully
contained example (sorry about the line length if this doesnt wrap).
tmp <- data.frame(
time=c(1185882786,1185882790,1185882791,1185882791,1185882792,1185882795,1185882796,1185882797,1185882797,1185882798,1185882799,1185882800,1185882806,1185882807,1185882809,1185882810,1185882810,1185882811,1185882845,1185882846,1185882906,1185882918,1185882950,1185882951,1185882951,1185882952,1185882953,1185882954,1185882955,1185882956,1185882991,1185882991,1185882995,1185882996,1185882997,1185882997,1185882998,1185882998,1185882999,1185883003,1185883004,1185883006,1185883007,1185883025,1185883026,1185883086,1185883129,1185883129,1185883133,1185883133,1185883137,1185883137,1185883144,1185883145,1185883145,1185883148,1185883148,1185883149,1185883150,1185883151,1185883152,1185883154,1185883154,1185883155,1185883155,1185883175,1185883176,1185883179,1185883179,1185883180,1185883181,1185883181,1185883182,1185883186,1185883187,1185883191,1185883191,1185883200,1185883200,1185883211,1185883212,1185883214,1185883214,1185883215,1185883217,1185883218,1185883219,1185883279,1185883307,1185883307,1185883365,1185883366,1185883366,1185883367,1185883368,1185883368,1185883368,1185883369,1185883373,1185883376),
spread=c(1e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,1e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,1e-04,2e-04,1e-04,1e-04,1e-04,2e-04,1e-04,1e-04,1e-04,2e-04,1e-04,1e-04,2e-04,1e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,1e-04,2e-04,1e-04,2e-04,1e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,1e-04,2e-04,1e-04,2e-04,1e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,3e-04,2e-04)
)
twas <- function (dat)
{
data.frame(tapply(diff(dat$time), head(dat$spread, -1),
sum)/sum(diff(dat$time)) * 100)
}
now <- Sys.time()
epoch <- now - as.numeric(now)
z <- do.call("rbind", by(tmp, format(epoch + tmp$time, "%H"), twas))
Cheers
Rory
On 8/3/07, Gabor Grothendieck < ggrothendieck at gmail.com> wrote:
I still get no warning. Please provide complete self contained input and output.
tmp <- data.frame(time = c(1185882786, 1185882790, 1185882791,
1185882791,
+ 1185882792, 1185882795), spread = c(1e-04, 1e-04, 2e-04, 1e-04, + 2e-04, 3e-04))
twas <-
+ function(dat) {
+ data.frame(tapply(diff(dat$time), head(dat$spread, -1),
+ sum)/sum(diff(dat$time)) * 100.0)
+ }
now <- Sys.time()
epoch <- now - as.numeric(now)
z <- do.call("rbind", by(tmp, format(epoch + tmp$time, "%H"), twas))
z
1e-04 2e-04 07 66.66667 33.33333
R.version.string # XP
[1] "R version 2.5.1 (2007-06-27)" On 8/3/07, Rory Winston <rory.winston at gmail.com> wrote:
Hi I have figured out what causes the warning (and recycling), but I am not sure how I can fix it. After seeing that it seemed to work for you, I
went
back and tried working with different subsets of the data. I eventually found where it occurs - when we get a third unique spread value. To reproduce, just change the definition of tmp to be: tmp <- data.frame(time = c(1185882786, 1185882790, 1185882791,
1185882791,
1185882792, 1185882795), spread = c(1e-04, 1e-04, 2e-04, 1e-04, 2e-04, 3e-04)) <== Added 3e-04 i.e. I have just changed one of the spread values to be a third value -
this
seems to trigger the warning "Warning message:number of columns of
result
is not a multiple of vector length (arg 3) in: rbind", and the
recycling. I
tried this on R 2.5.0 and 2.5.1 Can anyone see what I am doing wrong here? Cheers Rory On 8/3/07, Gabor Grothendieck < ggrothendieck at gmail.com> wrote:
Can you provide a reproducible example that exhibits the warning. Redoing it in a more easily reproducible way and using the data in your post gives me no warning
tmp <- data.frame(time = c(1185882786, 1185882790, 1185882791,
1185882791,
+ 1185882792, 1185882795), spread = c(1e-04, 1e-04, 2e-04, 1e-04, + 2e-04, 1e-04))
twas <-
+ function(dat) {
+ data.frame(tapply(diff(dat$time), head(dat$spread, -1),
+ sum)/sum(diff(dat$time)) * 100.0)
+ }
now <- Sys.time()
epoch <- now - as.numeric(now)
z <- do.call("rbind", by(tmp, format(epoch + tmp$time, "%H"), twas))
z
1e-04 2e-04 07 66.66667 33.33333
R.version.string # XP
[1] "R version 2.5.1 (2007-06-27)" Here is input: tmp <- data.frame(time = c(1185882786, 1185882790, 1185882791,
1185882791,
1185882792, 1185882795), spread = c(1e-04, 1e-04, 2e-04, 1e-04,
2e-04, 1e-04))
twas <-
function(dat) {
data.frame(tapply(diff(dat$time), head(dat$spread, -1),
sum)/sum(diff(dat$time)) * 100.0)
}
now <- Sys.time()
epoch <- now - as.numeric(now)
z <- do.call("rbind", by(tmp, format(epoch + tmp$time, "%H"), twas))
z
R.version.string # XP
On 8/3/07, Rory Winston <rory.winston at gmail.com> wrote:
Hi I've been wrestling with this a little bit, using the example in the
that Gabor pointed me to as a reference, and I think I have almost
got
what
I want...however its still not quite right. I have a variable, tmp, with two dimensions: time and spread:
head(tmp$time)
[1] 1185882786 1185882790 1185882791 1185882791 1185882792
1185882795
head(tmp$spread)
[1] 1e-04 1e-04 2e-04 1e-04 2e-04 1e-04
I also have a function that calculates the time-weighted average
spread:
twas
function(dat) {
data.frame(tapply(diff(dat$time), head(dat$spread, -1),
sum)/sum(diff(dat$time)) * 100.0)
}
I can combine them using as rbind() and by():
z <- do.call("rbind", by(tmp, format(epoch + tmp$time, "%H"), twas))
(epoch is just an instance of ISOdatetime)
This gives me a warning:
Warning message:
number of columns of result
is not a multiple of vector length (arg 3) in: rbind(1, "12"
= c(
91.99207541277 , 8.00792458723005), "13" = c(90.1884966797708, The output from the above command is almost exactly what I need,
apart
from
the recycling:
1e-04 2e-04 3e-04 4e-04
12 91.99208 8.007925 91.9920754 8.007924587 <== recycled values
13 90.18850 9.337448 0.4218405 0.052214551
14 90.59640 9.171417 0.2321811 90.596401668
15 89.55771 10.194291 0.2343418 0.013661453
...
I can just pass this into a barplot() and get a nice visual
breakdown of
hourly weighted spreads, *but* I dont know how to get these results
without
the recycling. Looking at rbind(), it seems that this will
automatically
recycle. Does anyone know of a function I could use to get these
results
without this problem? Cheers Rory On 8/1/07, Gabor Grothendieck < ggrothendieck at gmail.com > wrote:
Something similar was just discussed this morning:
On 8/1/07, Rory Winston <rory.winston at gmail.com > wrote:
Hi all I have a question about aggegating statistics by time intervals.
I
have
a
data set with 3 columns : time, bid, and ask. Time is specified
as a
millisecond timestamp since epoch. I would like to compute
summary
statistics for the data set on an hourly basis. Here is what I
have
tried so
far: # Data is in pricedata t <- ISODatetime(1970, 1, 1, 0, 0, 0) + pricedata$time agg <- aggregate(pricedata$spread, list(byhour=format(t, "%Y-%m
%H")),
mean)
This seems to do what I want - however, what really want to do
is
more
specific: I would like to be able to extract a subset of the
data
frame
pricedata, and not just the aggregated entries - for instance,
instead
of
just extracting pricedata$spread by hour, I would like to
extract a
slice of
columns, e.g. pricedata$spread and pricedata$time on an hourly
basis,
and
pass these into a function that can compute a time-weighted
average
spread,
for instance. Does anyone know an elegant way to do this? I have
a
feeling
zoo may do what I want, but I'm new to zoo ...
Cheers
Rory
[[alternative HTML version deleted]]
_______________________________________________ R-SIG-Finance at stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-sig-finance -- Subscriber-posting only. -- If you want to post, subscribe first. [[alternative HTML version deleted]] _______________________________________________ R-SIG-Finance at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-finance -- Subscriber-posting only. -- If you want to post, subscribe first.
An embedded and charset-unspecified text was scrubbed... Name: not available Url: https://stat.ethz.ch/pipermail/r-sig-finance/attachments/20070803/16411395/attachment.pl
The different invocations of twas were creating data frames of different numbers of columns because different hours had different numbers of spreads. The warning came when it tried to rbind together data.frames with different numbers of columns.
On 8/3/07, Rory Winston <rory.winston at gmail.com> wrote:
Wow....thats great. Thank you very much! I appreciate the help greatly. I dont quite understand what the issue was though....was it that the data frame returned from my initial twas() function was of the wrong order? Cheers Rory On 8/3/07, Gabor Grothendieck <ggrothendieck at gmail.com> wrote:
I had omitted fmt and epoch. tmp is as in your post.
twas <- function(x) {
y <- data.frame(timediff = diff(x$time), head(x, -1))
aggregate(100 * y[1]/sum(y[1]), y[c("hour", "spread")], sum)
}
now <- Sys.time()
epoch <- now - as.numeric(now)
fmt <- function(x) format(epoch + x, "%H")
tmp2 <- cbind(tmp, hour = fmt(tmp$time))
z <- do.call("rbind", by(tmp2, tmp2["hour"], twas))
# three alternatives
# 1
xtabs(timediff ~., z)
# 2
reshape(z, dir = "wide", timevar = "spread", idvar = "hour")
# 3
library(reshape)
cast(melt(z, id = 1:2), hour ~ spread)
On 8/3/07, Gabor Grothendieck <ggrothendieck at gmail.com> wrote:
Try producing it in "long" format using aggregate and then reshaping
it into "wide" format using xtabs, reshape or the reshape package:
twas <- function(x) {
y <- data.frame(timediff = diff(x$time), head(x, -1))
aggregate(100 * y[1]/sum(y[1]), y[c("hour", "spread")], sum)
}
tmp2 <- cbind(tmp, hour = fmt(tmp$time))
long <- do.call("rbind", by(tmp2, tmp2["hour"], twas))
# any one of these three:
xtabs(timediff ~., long)
reshape(long, dir = "wide", timevar = "spread", idvar = "hour")
library(reshape)
cast(melt(long, id = 1:2), hour ~ spread)
On 8/3/07, Rory Winston < rory.winston at gmail.com> wrote:
Hi Sorry, I'm not sure what happened with that last one. Here is a fully contained example (sorry about the line length if this doesnt wrap). tmp <- data.frame(
time=c(1185882786,1185882790,1185882791,1185882791,1185882792,1185882795,1185882796,1185882797,1185882797,1185882798,1185882799,1185882800,1185882806,1185882807,1185882809,1185882810,1185882810,1185882811,1185882845,1185882846,1185882906,1185882918,1185882950,1185882951,1185882951,1185882952,1185882953,1185882954,1185882955,1185882956,1185882991,1185882991,1185882995,1185882996,1185882997,1185882997,1185882998,1185882998,1185882999,1185883003,1185883004,1185883006,1185883007,1185883025,1185883026,1185883086,1185883129,1185883129,1185883133,1185883133,1185883137,1185883137,1185883144,1185883145,1185883145,1185883148,1185883148,1185883149,1185883150,1185883151,1185883152,1185883154,1185883154,1185883155,1185883155,1185883175,1185883176,1185883179,1185883179,1185883180,1185883181,1185883181,1185883182,1185883186,1185883187,1185883191,1185883191,1185883200,1185883200,1185883211,1185883212,1185883214,1185883214,1185883215,1185883217,1185883218,1185883219,1185883279,1185883307,1185883307,1185883365,1185883366,1185883366,1185883367,1185883368,1185883368,1185883368,1185883369,1185883373,1185883376),
spread=c(1e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,1e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,1e-04,2e-04,1e-04,1e-04,1e-04,2e-04,1e-04,1e-04,1e-04,2e-04,1e-04,1e-04,2e-04,1e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,1e-04,2e-04,1e-04,2e-04,1e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,1e-04,2e-04,1e-04,2e-04,1e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,3e-04,2e-04)
)
twas <- function (dat)
{
data.frame(tapply(diff(dat$time), head(dat$spread, -1),
sum)/sum(diff(dat$time)) * 100)
}
now <- Sys.time()
epoch <- now - as.numeric(now)
z <- do.call("rbind", by(tmp, format(epoch + tmp$time, "%H"), twas))
Cheers
Rory
On 8/3/07, Gabor Grothendieck < ggrothendieck at gmail.com> wrote:
I still get no warning. Please provide complete self contained
input
and output.
tmp <- data.frame(time = c(1185882786, 1185882790, 1185882791,
1185882791,
+ 1185882792, 1185882795), spread = c(1e-04, 1e-04, 2e-04, 1e-04, + 2e-04, 3e-04))
twas <-
+ function(dat) {
+ data.frame(tapply(diff(dat$time), head(dat$spread, -1),
+ sum)/sum(diff(dat$time)) * 100.0)
+ }
now <- Sys.time()
epoch <- now - as.numeric(now)
z <- do.call("rbind", by(tmp, format(epoch + tmp$time, "%H"),
twas))
z
1e-04 2e-04 07 66.66667 33.33333
R.version.string # XP
[1] "R version 2.5.1 (2007-06-27)" On 8/3/07, Rory Winston <rory.winston at gmail.com> wrote:
Hi I have figured out what causes the warning (and recycling), but I
am not
sure how I can fix it. After seeing that it seemed to work for
you, I
went
back and tried working with different subsets of the data. I
eventually
found where it occurs - when we get a third unique spread value.
To
reproduce, just change the definition of tmp to be: tmp <- data.frame(time = c(1185882786, 1185882790, 1185882791,
1185882791,
1185882792, 1185882795), spread = c(1e-04, 1e-04, 2e-04, 1e-04, 2e-04, 3e-04)) <== Added 3e-04 i.e. I have just changed one of the spread values to be a third
value -
this
seems to trigger the warning "Warning message:number of columns
of
result
is not a multiple of vector length (arg 3) in: rbind", and the
recycling. I
tried this on R 2.5.0 and 2.5.1 Can anyone see what I am doing wrong here? Cheers Rory On 8/3/07, Gabor Grothendieck < ggrothendieck at gmail.com> wrote:
Can you provide a reproducible example that exhibits the
warning.
Redoing it in a more easily reproducible way and using the data in your post gives me no warning
tmp <- data.frame(time = c(1185882786, 1185882790, 1185882791,
1185882791,
+ 1185882792, 1185882795), spread = c(1e-04, 1e-04, 2e-04,
1e-04,
+ 2e-04, 1e-04))
twas <-
+ function(dat) {
+ data.frame(tapply(diff(dat$time), head(dat$spread, -1),
+ sum)/sum(diff(dat$time)) * 100.0)
+ }
now <- Sys.time()
epoch <- now - as.numeric(now)
z <- do.call("rbind", by(tmp, format(epoch + tmp$time, "%H"),
twas))
z
1e-04 2e-04 07 66.66667 33.33333
R.version.string # XP
[1] "R version 2.5.1 (2007-06-27)" Here is input: tmp <- data.frame(time = c(1185882786, 1185882790, 1185882791,
1185882791,
1185882792, 1185882795), spread = c(1e-04, 1e-04, 2e-04, 1e-04,
2e-04, 1e-04))
twas <-
function(dat) {
data.frame(tapply(diff(dat$time), head(dat$spread, -1),
sum)/sum(diff(dat$time)) * 100.0)
}
now <- Sys.time ()
epoch <- now - as.numeric(now)
z <- do.call("rbind", by(tmp, format(epoch + tmp$time, "%H"),
twas))
z R.version.string # XP On 8/3/07, Rory Winston <rory.winston at gmail.com > wrote:
Hi I've been wrestling with this a little bit, using the example
in the
that Gabor pointed me to as a reference, and I think I have
almost
got
what
I want...however its still not quite right. I have a variable, tmp, with two dimensions: time and spread:
head(tmp$time)
[1] 1185882786 1185882790 1185882791 1185882791 1185882792
1185882795
head(tmp$spread)
[1] 1e-04 1e-04 2e-04 1e-04 2e-04 1e-04
I also have a function that calculates the time-weighted
average
spread:
twas
function(dat) {
data.frame(tapply(diff(dat$time), head(dat$spread, -1),
sum)/sum(diff(dat$time)) * 100.0)
}
I can combine them using as rbind() and by():
z <- do.call("rbind", by(tmp, format(epoch + tmp$time, "%H"),
twas))
(epoch is just an instance of ISOdatetime)
This gives me a warning:
Warning message:
number of columns of result
is not a multiple of vector length (arg 3) in: rbind(1,
"12"
= c(
91.99207541277 , 8.00792458723005), "13" = c(90.1884966797708, The output from the above command is almost exactly what I
need,
apart
from
the recycling:
1e-04 2e-04 3e-04 4e-04
12 91.99208 8.007925 91.9920754 8.007924587 <== recycled
values
13 90.18850 9.337448 0.4218405 0.052214551 14 90.59640 9.171417 0.2321811 90.596401668 15 89.55771 10.194291 0.2343418 0.013661453 ... I can just pass this into a barplot() and get a nice visual
breakdown of
hourly weighted spreads, *but* I dont know how to get these
results
without
the recycling. Looking at rbind(), it seems that this will
automatically
recycle. Does anyone know of a function I could use to get
these
results
without this problem? Cheers Rory On 8/1/07, Gabor Grothendieck < ggrothendieck at gmail.com >
wrote:
Something similar was just discussed this morning:
On 8/1/07, Rory Winston <rory.winston at gmail.com > wrote:
Hi all I have a question about aggegating statistics by time
intervals.
I
have
a
data set with 3 columns : time, bid, and ask. Time is
specified
as a
millisecond timestamp since epoch. I would like to compute
summary
statistics for the data set on an hourly basis. Here is
what I
have
tried so
far: # Data is in pricedata t <- ISODatetime(1970, 1, 1, 0, 0, 0) + pricedata$time agg <- aggregate(pricedata$spread, list(byhour=format(t,
"%Y-%m
%H")),
mean)
This seems to do what I want - however, what really want
to do
is
more
specific: I would like to be able to extract a subset of
the
data
frame
pricedata, and not just the aggregated entries - for
instance,
instead
of
just extracting pricedata$spread by hour, I would like to
extract a
slice of
columns, e.g. pricedata$spread and pricedata$time on an
hourly
basis,
and
pass these into a function that can compute a
time-weighted
average
spread,
for instance. Does anyone know an elegant way to do this?
I have
a
feeling
zoo may do what I want, but I'm new to zoo ...
Cheers
Rory
[[alternative HTML version deleted]]
_______________________________________________ R-SIG-Finance at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-finance -- Subscriber-posting only. -- If you want to post, subscribe first. [[alternative HTML version deleted]] _______________________________________________ R-SIG-Finance at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-finance -- Subscriber-posting only. -- If you want to post, subscribe first.
Here is one more solution. Using the tmp from your post, this one uses SQLite via the sqldf package. It produces a similar output as z from our prior solution and then we can use xtabs, reshape or the reshape package as before to get the final layout. The first subselect within the main select sums within hour and spread and the second sums within hour. We join the subselects and take the ratio of the two the sums to get the answer. In the solutions before we used hour relative to GMT rather than local time.
library(sqldf)
sqldf("select ahour, spread,
+ 100 * aa.timediff / bb.timediff timediff from
+ (select
+ strftime('%H',a.time__1,'unixepoch') ahour,
+ strftime('%H',b.time__1,'unixepoch') bhour,
+ a.spread spread,
+ sum(b.time__1 - a.time__1) timediff
+ from tmp a, tmp b
+ where a.row_names = b.row_names-1 and ahour = bhour
+ group by ahour, a.spread) aa join
+ (select strftime('%H',c.time__1,'unixepoch') chour,
+ strftime('%H',d.time__1,'unixepoch') dhour,
+ sum(d.time__1 - c.time__1) timediff
+ from tmp c, tmp d
+ where c.row_names = d.row_names-1 and chour = dhour
+ group by chour) bb
+ where ahour = chour
+ group by spread, ahour",
+ row.names = TRUE)
ahour spread timediff
1 11 1e-04 91.358025
2 12 1e-04 92.613636
3 11 2e-04 8.641975
4 12 2e-04 5.681818
5 12 3e-04 1.704545
# old solution for comparison
twas <- function(x) {
+ y <- data.frame(timediff = diff(x$time), head(x, -1))
+ aggregate(100 * y[1]/sum(y[1]), y[c("hour", "spread")], sum)
+ }
now <- Sys.time()
epoch <- now - as.numeric(now)
fmt <- function(x) format(epoch + x, "%H", tz = "GMT")
tmp2 <- cbind(tmp, hour = fmt(tmp$time))
z <- do.call("rbind", by(tmp2, tmp2["hour"], twas))
z
hour spread timediff
11.1 11 1e-04 91.358025
11.2 11 2e-04 8.641975
12.1 12 1e-04 92.613636
12.2 12 2e-04 5.681818
12.3 12 3e-04 1.704545
Here is input:
library(sqldf)
sqldf("select ahour, spread,
100 * aa.timediff / bb.timediff timediff from
(select
strftime('%H',a.time__1,'unixepoch') ahour,
strftime('%H',b.time__1,'unixepoch') bhour,
a.spread spread,
sum(b.time__1 - a.time__1) timediff
from tmp a, tmp b
where a.row_names = b.row_names-1 and ahour = bhour
group by ahour, a.spread) aa join
(select strftime('%H',c.time__1,'unixepoch') chour,
strftime('%H',d.time__1,'unixepoch') dhour,
sum(d.time__1 - c.time__1) timediff
from tmp c, tmp d
where c.row_names = d.row_names-1 and chour = dhour
group by chour) bb
where ahour = chour
group by spread, ahour",
row.names = TRUE)
On 8/3/07, Gabor Grothendieck <ggrothendieck at gmail.com> wrote:
The different invocations of twas were creating data frames of different numbers of columns because different hours had different numbers of spreads. The warning came when it tried to rbind together data.frames with different numbers of columns. On 8/3/07, Rory Winston <rory.winston at gmail.com> wrote:
Wow....thats great. Thank you very much! I appreciate the help greatly. I dont quite understand what the issue was though....was it that the data frame returned from my initial twas() function was of the wrong order? Cheers Rory On 8/3/07, Gabor Grothendieck <ggrothendieck at gmail.com> wrote:
I had omitted fmt and epoch. tmp is as in your post.
twas <- function(x) {
y <- data.frame(timediff = diff(x$time), head(x, -1))
aggregate(100 * y[1]/sum(y[1]), y[c("hour", "spread")], sum)
}
now <- Sys.time()
epoch <- now - as.numeric(now)
fmt <- function(x) format(epoch + x, "%H")
tmp2 <- cbind(tmp, hour = fmt(tmp$time))
z <- do.call("rbind", by(tmp2, tmp2["hour"], twas))
# three alternatives
# 1
xtabs(timediff ~., z)
# 2
reshape(z, dir = "wide", timevar = "spread", idvar = "hour")
# 3
library(reshape)
cast(melt(z, id = 1:2), hour ~ spread)
On 8/3/07, Gabor Grothendieck <ggrothendieck at gmail.com> wrote:
Try producing it in "long" format using aggregate and then reshaping
it into "wide" format using xtabs, reshape or the reshape package:
twas <- function(x) {
y <- data.frame(timediff = diff(x$time), head(x, -1))
aggregate(100 * y[1]/sum(y[1]), y[c("hour", "spread")], sum)
}
tmp2 <- cbind(tmp, hour = fmt(tmp$time))
long <- do.call("rbind", by(tmp2, tmp2["hour"], twas))
# any one of these three:
xtabs(timediff ~., long)
reshape(long, dir = "wide", timevar = "spread", idvar = "hour")
library(reshape)
cast(melt(long, id = 1:2), hour ~ spread)
On 8/3/07, Rory Winston < rory.winston at gmail.com> wrote:
Hi Sorry, I'm not sure what happened with that last one. Here is a fully contained example (sorry about the line length if this doesnt wrap). tmp <- data.frame(
time=c(1185882786,1185882790,1185882791,1185882791,1185882792,1185882795,1185882796,1185882797,1185882797,1185882798,1185882799,1185882800,1185882806,1185882807,1185882809,1185882810,1185882810,1185882811,1185882845,1185882846,1185882906,1185882918,1185882950,1185882951,1185882951,1185882952,1185882953,1185882954,1185882955,1185882956,1185882991,1185882991,1185882995,1185882996,1185882997,1185882997,1185882998,1185882998,1185882999,1185883003,1185883004,1185883006,1185883007,1185883025,1185883026,1185883086,1185883129,1185883129,1185883133,1185883133,1185883137,1185883137,1185883144,1185883145,1185883145,1185883148,1185883148,1185883149,1185883150,1185883151,1185883152,1185883154,1185883154,1185883155,1185883155,1185883175,1185883176,1185883179,1185883179,1185883180,1185883181,1185883181,1185883182,1185883186,1185883187,1185883191,1185883191,1185883200,1185883200,1185883211,1185883212,1185883214,1185883214,1185883215,1185883217,1185883218,1185883219,1185883279,1185883307,1185883307,1185883365,1185883366,1185883366,1185883367,1185883368,1185883368,1185883368,1185883369,1185883373,1185883376),
spread=c(1e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,1e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,1e-04,2e-04,1e-04,1e-04,1e-04,2e-04,1e-04,1e-04,1e-04,2e-04,1e-04,1e-04,2e-04,1e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,1e-04,2e-04,1e-04,2e-04,1e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,1e-04,2e-04,1e-04,2e-04,1e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,3e-04,2e-04)
)
twas <- function (dat)
{
data.frame(tapply(diff(dat$time), head(dat$spread, -1),
sum)/sum(diff(dat$time)) * 100)
}
now <- Sys.time()
epoch <- now - as.numeric(now)
z <- do.call("rbind", by(tmp, format(epoch + tmp$time, "%H"), twas))
Cheers
Rory
On 8/3/07, Gabor Grothendieck < ggrothendieck at gmail.com> wrote:
I still get no warning. Please provide complete self contained
input
and output.
tmp <- data.frame(time = c(1185882786, 1185882790, 1185882791,
1185882791,
+ 1185882792, 1185882795), spread = c(1e-04, 1e-04, 2e-04, 1e-04, + 2e-04, 3e-04))
twas <-
+ function(dat) {
+ data.frame(tapply(diff(dat$time), head(dat$spread, -1),
+ sum)/sum(diff(dat$time)) * 100.0)
+ }
now <- Sys.time()
epoch <- now - as.numeric(now)
z <- do.call("rbind", by(tmp, format(epoch + tmp$time, "%H"),
twas))
z
1e-04 2e-04 07 66.66667 33.33333
R.version.string # XP
[1] "R version 2.5.1 (2007-06-27)" On 8/3/07, Rory Winston <rory.winston at gmail.com> wrote:
Hi I have figured out what causes the warning (and recycling), but I
am not
sure how I can fix it. After seeing that it seemed to work for
you, I
went
back and tried working with different subsets of the data. I
eventually
found where it occurs - when we get a third unique spread value.
To
reproduce, just change the definition of tmp to be: tmp <- data.frame(time = c(1185882786, 1185882790, 1185882791,
1185882791,
1185882792, 1185882795), spread = c(1e-04, 1e-04, 2e-04, 1e-04, 2e-04, 3e-04)) <== Added 3e-04 i.e. I have just changed one of the spread values to be a third
value -
this
seems to trigger the warning "Warning message:number of columns
of
result
is not a multiple of vector length (arg 3) in: rbind", and the
recycling. I
tried this on R 2.5.0 and 2.5.1 Can anyone see what I am doing wrong here? Cheers Rory On 8/3/07, Gabor Grothendieck < ggrothendieck at gmail.com> wrote:
Can you provide a reproducible example that exhibits the
warning.
Redoing it in a more easily reproducible way and using the data in your post gives me no warning
tmp <- data.frame(time = c(1185882786, 1185882790, 1185882791,
1185882791,
+ 1185882792, 1185882795), spread = c(1e-04, 1e-04, 2e-04,
1e-04,
+ 2e-04, 1e-04))
twas <-
+ function(dat) {
+ data.frame(tapply(diff(dat$time), head(dat$spread, -1),
+ sum)/sum(diff(dat$time)) * 100.0)
+ }
now <- Sys.time()
epoch <- now - as.numeric(now)
z <- do.call("rbind", by(tmp, format(epoch + tmp$time, "%H"),
twas))
z
1e-04 2e-04 07 66.66667 33.33333
R.version.string # XP
[1] "R version 2.5.1 (2007-06-27)" Here is input: tmp <- data.frame(time = c(1185882786, 1185882790, 1185882791,
1185882791,
1185882792, 1185882795), spread = c(1e-04, 1e-04, 2e-04, 1e-04,
2e-04, 1e-04))
twas <-
function(dat) {
data.frame(tapply(diff(dat$time), head(dat$spread, -1),
sum)/sum(diff(dat$time)) * 100.0)
}
now <- Sys.time ()
epoch <- now - as.numeric(now)
z <- do.call("rbind", by(tmp, format(epoch + tmp$time, "%H"),
twas))
z R.version.string # XP On 8/3/07, Rory Winston <rory.winston at gmail.com > wrote:
Hi I've been wrestling with this a little bit, using the example
in the
that Gabor pointed me to as a reference, and I think I have
almost
got
what
I want...however its still not quite right. I have a variable, tmp, with two dimensions: time and spread:
head(tmp$time)
[1] 1185882786 1185882790 1185882791 1185882791 1185882792
1185882795
head(tmp$spread)
[1] 1e-04 1e-04 2e-04 1e-04 2e-04 1e-04
I also have a function that calculates the time-weighted
average
spread:
twas
function(dat) {
data.frame(tapply(diff(dat$time), head(dat$spread, -1),
sum)/sum(diff(dat$time)) * 100.0)
}
I can combine them using as rbind() and by():
z <- do.call("rbind", by(tmp, format(epoch + tmp$time, "%H"),
twas))
(epoch is just an instance of ISOdatetime)
This gives me a warning:
Warning message:
number of columns of result
is not a multiple of vector length (arg 3) in: rbind(1,
"12"
= c(
91.99207541277 , 8.00792458723005), "13" = c(90.1884966797708, The output from the above command is almost exactly what I
need,
apart
from
the recycling:
1e-04 2e-04 3e-04 4e-04
12 91.99208 8.007925 91.9920754 8.007924587 <== recycled
values
13 90.18850 9.337448 0.4218405 0.052214551 14 90.59640 9.171417 0.2321811 90.596401668 15 89.55771 10.194291 0.2343418 0.013661453 ... I can just pass this into a barplot() and get a nice visual
breakdown of
hourly weighted spreads, *but* I dont know how to get these
results
without
the recycling. Looking at rbind(), it seems that this will
automatically
recycle. Does anyone know of a function I could use to get
these
results
without this problem? Cheers Rory On 8/1/07, Gabor Grothendieck < ggrothendieck at gmail.com >
wrote:
Something similar was just discussed this morning:
On 8/1/07, Rory Winston <rory.winston at gmail.com > wrote:
Hi all I have a question about aggegating statistics by time
intervals.
I
have
a
data set with 3 columns : time, bid, and ask. Time is
specified
as a
millisecond timestamp since epoch. I would like to compute
summary
statistics for the data set on an hourly basis. Here is
what I
have
tried so
far: # Data is in pricedata t <- ISODatetime(1970, 1, 1, 0, 0, 0) + pricedata$time agg <- aggregate(pricedata$spread, list(byhour=format(t,
"%Y-%m
%H")),
mean)
This seems to do what I want - however, what really want
to do
is
more
specific: I would like to be able to extract a subset of
the
data
frame
pricedata, and not just the aggregated entries - for
instance,
instead
of
just extracting pricedata$spread by hour, I would like to
extract a
slice of
columns, e.g. pricedata$spread and pricedata$time on an
hourly
basis,
and
pass these into a function that can compute a
time-weighted
average
spread,
for instance. Does anyone know an elegant way to do this?
I have
a
feeling
zoo may do what I want, but I'm new to zoo ...
Cheers
Rory
[[alternative HTML version deleted]]
_______________________________________________ R-SIG-Finance at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-finance -- Subscriber-posting only. -- If you want to post, subscribe first. [[alternative HTML version deleted]] _______________________________________________ R-SIG-Finance at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-finance -- Subscriber-posting only. -- If you want to post, subscribe first.
An embedded and charset-unspecified text was scrubbed... Name: not available Url: https://stat.ethz.ch/pipermail/r-sig-finance/attachments/20070803/c26e31ac/attachment.pl