Skip to content

Aggregating Statistics By Time Interval

13 messages · Gabor Grothendieck, Rory Winston

#
Something similar was just discussed this morning:
https://www.stat.math.ethz.ch/pipermail/r-help/2007-August/137695.html
On 8/1/07, Rory Winston <rory.winston at gmail.com> wrote:
1 day later
#
Can you provide a reproducible example that exhibits the warning.
Redoing it in a more easily reproducible way and using the data
in your post gives me no warning
+ 1185882792, 1185882795), spread = c(1e-04, 1e-04, 2e-04, 1e-04,
+ 2e-04, 1e-04))
+  function(dat) {
+    data.frame(tapply(diff(dat$time), head(dat$spread, -1),
+  sum)/sum(diff(dat$time)) * 100.0)
+ }
1e-04    2e-04
07 66.66667 33.33333
[1] "R version 2.5.1 (2007-06-27)"


Here is input:

tmp <- data.frame(time = c(1185882786, 1185882790, 1185882791, 1185882791,
1185882792, 1185882795), spread = c(1e-04, 1e-04, 2e-04, 1e-04,
2e-04, 1e-04))
twas <-
 function(dat) {
   data.frame(tapply(diff(dat$time), head(dat$spread, -1),
 sum)/sum(diff(dat$time)) * 100.0)
}
now <- Sys.time()
epoch <- now - as.numeric(now)
z <- do.call("rbind", by(tmp, format(epoch + tmp$time, "%H"), twas))
z
R.version.string # XP
On 8/3/07, Rory Winston <rory.winston at gmail.com> wrote:
#
I still get no warning.  Please provide complete self contained input
and output.
+  1185882792, 1185882795), spread = c(1e-04, 1e-04, 2e-04, 1e-04,
+  2e-04, 3e-04))
+  function(dat) {
+    data.frame(tapply(diff(dat$time), head(dat$spread, -1),
+  sum)/sum(diff(dat$time)) * 100.0)
+ }
1e-04    2e-04
07 66.66667 33.33333
[1] "R version 2.5.1 (2007-06-27)"
On 8/3/07, Rory Winston <rory.winston at gmail.com> wrote:
#
Try producing it in "long" format using aggregate and then reshaping
it into "wide" format using xtabs, reshape or the reshape package:

twas <- function(x) {
	y <- data.frame(timediff = diff(x$time), head(x, -1))
	aggregate(100 * y[1]/sum(y[1]), y[c("hour", "spread")], sum)
}
tmp2 <- cbind(tmp, hour = fmt(tmp$time))
long <- do.call("rbind", by(tmp2, tmp2["hour"], twas))

# any one of these three:

xtabs(timediff ~., long)

reshape(long, dir = "wide", timevar = "spread", idvar = "hour")

library(reshape)
cast(melt(long, id = 1:2), hour ~ spread)
On 8/3/07, Rory Winston <rory.winston at gmail.com> wrote:
#
I had omitted fmt and epoch.  tmp is as in your post.

twas <- function(x) {
	y <- data.frame(timediff = diff(x$time), head(x, -1))
	aggregate(100 * y[1]/sum(y[1]), y[c("hour", "spread")], sum)
}
now <- Sys.time()
epoch <- now - as.numeric(now)
fmt <- function(x) format(epoch + x, "%H")
tmp2 <- cbind(tmp, hour = fmt(tmp$time))

z <- do.call("rbind", by(tmp2, tmp2["hour"], twas))

# three alternatives

# 1
xtabs(timediff ~., z)

# 2
reshape(z, dir = "wide", timevar = "spread", idvar = "hour")

# 3
library(reshape)
cast(melt(z, id = 1:2), hour ~ spread)
On 8/3/07, Gabor Grothendieck <ggrothendieck at gmail.com> wrote:
#
The different invocations of twas were creating data frames of different
numbers of columns because different hours had different numbers
of spreads.  The warning came when it tried to rbind together
data.frames with different numbers of columns.
On 8/3/07, Rory Winston <rory.winston at gmail.com> wrote:
#
Here is one more solution.  Using the tmp from your post, this one uses
SQLite via the sqldf package.  It produces a similar output as z from
our prior solution and then
we can use xtabs, reshape or the reshape package as before to get
the final layout.  The first subselect within the main select sums within
hour and spread and the second sums within hour.  We join the subselects
and take the ratio of the two the sums to get the answer.

In the solutions before we used hour relative to GMT rather than local time.
+   100 * aa.timediff / bb.timediff timediff from
+   (select
+     strftime('%H',a.time__1,'unixepoch') ahour,
+     strftime('%H',b.time__1,'unixepoch') bhour,
+     a.spread spread,
+     sum(b.time__1 - a.time__1) timediff
+    from tmp a, tmp b
+    where a.row_names = b.row_names-1 and ahour = bhour
+    group by ahour, a.spread) aa join
+    (select strftime('%H',c.time__1,'unixepoch') chour,
+       strftime('%H',d.time__1,'unixepoch') dhour,
+       sum(d.time__1 - c.time__1) timediff
+       from tmp c, tmp d
+       where c.row_names = d.row_names-1 and chour = dhour
+       group by chour) bb
+     where ahour = chour
+     group by spread, ahour",
+  row.names = TRUE)
  ahour spread  timediff
1    11  1e-04 91.358025
2    12  1e-04 92.613636
3    11  2e-04  8.641975
4    12  2e-04  5.681818
5    12  3e-04  1.704545
+ y <- data.frame(timediff = diff(x$time), head(x, -1))
+ aggregate(100 * y[1]/sum(y[1]), y[c("hour", "spread")], sum)
+ }
hour spread  timediff
11.1   11  1e-04 91.358025
11.2   11  2e-04  8.641975
12.1   12  1e-04 92.613636
12.2   12  2e-04  5.681818
12.3   12  3e-04  1.704545

Here is input:

library(sqldf)
sqldf("select ahour, spread,
	100 * aa.timediff / bb.timediff timediff from
	(select
	strftime('%H',a.time__1,'unixepoch') ahour,
	strftime('%H',b.time__1,'unixepoch') bhour,
	a.spread spread,
	sum(b.time__1 - a.time__1) timediff
	from tmp a, tmp b
	where a.row_names = b.row_names-1 and ahour = bhour
	group by ahour, a.spread) aa join
	(select strftime('%H',c.time__1,'unixepoch') chour,
		strftime('%H',d.time__1,'unixepoch') dhour,
		sum(d.time__1 - c.time__1) timediff
		from tmp c, tmp d
		where c.row_names = d.row_names-1 and chour = dhour
		group by chour) bb
	where ahour = chour
	group by spread, ahour",
	row.names = TRUE)
On 8/3/07, Gabor Grothendieck <ggrothendieck at gmail.com> wrote: