An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20120305/3be743c9/attachment.pl>
index instead of loop?
11 messages · Rui Barradas, Ben quant
Hello,
Mar 05, 2012; 8:53pm ? by Ben quant Ben quant Hello, Does anyone know of a way I can speed this up?
Maybe, let's see.
################################# change anything below.
# Yes.
# First, start by using dates, not characters
fdate <- function(x, format="%Y%m%d"){
DF <- data.frame(x)
for(i in colnames(DF)){
DF[, i] <- as.Date(DF[, i], format=format)
class(DF[, i]) <- "Date"
}
DF
}
rd1 <- fdate(rd1)
# This is yours, use it.
dt1 <- seq(from =as.Date(z.dates[1]), to = as.Date("2009-03-25"), by =
"day")
# Set up the result, no time expensive 'cbind' inside a loop
fin1 <- data.frame(matrix(NA, nrow=length(dt1), ncol=ncol(ua) + 1))
fin1[, 1] <- dt1
nr <- nrow(rd1)
# And vectorize
for(tkr in 1:ncol(ua)){
x <- c(rd1[, tkr], as.Date("9999-12-31"))
inxlist <- lapply(1:nr, function(i) which(x[i] <= dt1 & dt1 < x[i + 1]))
sapply(1:length(inxlist), function(i) if(length(ix[[i]])) fin1[ix[[i]], tkr
+ 1] <<- ua[i, tkr])
}
colnames(fin1) <- c("daily_dates", colnames(ua))
# Check results
str(fin)
str(fin1)
head(fin)
head(fin1)
tail(fin)
tail(fin1)
Note that 'fin' has facotrs, 'fin1' numerics.
I haven't timed it but I believe it should be faster.
Hope this helps,
Rui Barradas
--
View this message in context: http://r.789695.n4.nabble.com/index-instead-of-loop-tp4447672p4448567.html
Sent from the R help mailing list archive at Nabble.com.
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20120306/25209b80/attachment.pl>
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20120306/cae330be/attachment.pl>
Hello,
Just looking at this, but it looks like ix doesn't exist:
sapply(1:length(inxlist), function(i) if(length(ix[[i]]))
fin1[ix[[i]], tkr + 1] <<- ua[i, tkr])
Trying to sort it out now.
Right, sorry.
I've changed the name from 'ix' to 'inxlist' to make it more readable just
before posting.
And since the object 'ix' still existed in the R global environment it
didn't throw an error...
Your correction in the post that followed is what I meant.
Correction (full loop, tested):
for(tkr in 1:ncol(ua)){
x <- c(rd1[, tkr], as.Date("9999-12-31"))
ix <- lapply(1:nr, function(i)
which(x[i] <= dt1 & dt1 < x[i + 1]))
sapply(1:length(ix), function(i)
if(length(ix[[i]])) fin1[ix[[i]], tkr + 1] <<- ua[i, tkr])
}
Rui Barradas
--
View this message in context: http://r.789695.n4.nabble.com/index-instead-of-loop-tp4447672p4450186.html
Sent from the R help mailing list archive at Nabble.com.
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20120306/5e8797c0/attachment.pl>
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20120306/39d91c9c/attachment.pl>
1 day later
Hello again. Ben quant wrote
Hello,
In case anyone is interested in a faster solution for lots of columns.
This
solution is slower if you only have a few columns. If anyone has anything
faster, I would be interested in seeing it.
### some mockup data
z.dates =
c("2007-03-31","2007-06-30","2007-09-30","2007-12-31","2008-03-31","2008-06-30","2008-09-30","2008-12-31")
nms = c("A","B","C","D") # add more columns to see how the code below is
fsater
# these are the report dates that are the real days the data was
available,
so show the data the day after this date ('after' is a design decision)
rd1 = matrix(c("20070514","20070814","20071115", "20080213",
"20080514", "20080814", "20081114", "20090217",
"20070410","20070709","20071009", "20080109",
"20080407", "20080708", "20081007", "20090112",
"20070426","--","--","--","--","--","--","20090319",
"--","--","--","--","--","--","--","--"),
nrow=8,ncol=4)
dimnames(rd1) = list(z.dates,nms)
# this is the unadjusted raw data, that always has the same dimensions,
rownames, and colnames as the report dates
ua = matrix(c(640.35,636.16,655.91,657.41,682.06,702.90,736.15,667.65,
2625.050,2625.050,2645.000,2302.000,1972.000,1805.000,1547.000,1025.000,
NaN, NaN,-98.426,190.304,180.894,183.220,172.520, 144.138,
NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN),
nrow=8,ncol=4)
dimnames(ua) = list(z.dates,nms)
################################ the fastest code I have found:
start_t_all = Sys.time()
fix = function(x)
{
year = substring(x, 1, 4)
mo = substring(x, 5, 6)
day = substring(x, 7, 8)
ifelse(year=="--", "NA", paste(year, mo, day, sep = "-"))
}
rd = apply(rd1, 2, fix)
dimnames(rd) = dimnames(rd)
wd1 <- seq(from =as.Date(min(z.dates)), to = Sys.Date(), by = "day")
#wd1 = wd1[weekdays(wd1) == "Friday"] # uncomment to go weekly
wd = sapply(wd1, as.character)
mat = matrix(NA,nrow=length(wd),ncol=ncol(ua))
rownames(mat) = wd
nms = as.Date(rownames(ua))
for(i in 1:length(wd)){
d = as.Date(wd[i])
diff = abs(nms - d)
rd_row_idx = max(which(diff == min(diff)))
rd_col_idx = which(rd[rd_row_idx,] < d)
if((rd_row_idx - 1) > 0){
mat[i,] = ua[rd_row_idx - 1,]
}
if( length(rd_col_idx)){
mat[i,rd_col_idx] = ua[rd_row_idx,rd_col_idx]
}
}
colnames(mat)=colnames(ua)
print(Sys.time()-start_t_all)
Regards,
Ben
On Tue, Mar 6, 2012 at 8:22 AM, Rui Barradas <rui1174@> wrote:
Hello,
Just looking at this, but it looks like ix doesn't exist:
sapply(1:length(inxlist), function(i) if(length(ix[[i]]))
fin1[ix[[i]], tkr + 1] <<- ua[i, tkr])
Trying to sort it out now.
Right, sorry.
I've changed the name from 'ix' to 'inxlist' to make it more readable
just
before posting.
And since the object 'ix' still existed in the R global environment it
didn't throw an error...
Your correction in the post that followed is what I meant.
Correction (full loop, tested):
for(tkr in 1:ncol(ua)){
x <- c(rd1[, tkr], as.Date("9999-12-31"))
ix <- lapply(1:nr, function(i)
which(x[i] <= dt1 & dt1 < x[i + 1]))
sapply(1:length(ix), function(i)
if(length(ix[[i]])) fin1[ix[[i]], tkr + 1] <<- ua[i,
tkr])
}
Rui Barradas
--
View this message in context:
http://r.789695.n4.nabble.com/index-instead-of-loop-tp4447672p4450186.html
Sent from the R help mailing list archive at Nabble.com.
______________________________________________ R-help@ mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[[alternative HTML version deleted]]
______________________________________________ R-help@ mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Maybe I'm not understanding the problem very well, but let me describe what I'm thinking it is. You have two tables, 'rd1' and 'ua' and a vector of dates, 'z.dates'. The result is a table such that: 1. From 'z.dates' make a vector of daily dates. 2. Each column is filled with numbers from 'ua' based on dates in 'rd1', starting at the day given in step 1. My doubt is that your last posted code seems to give a special role to column 'A'.
mat[225:232, ]
A B C D 2007-11-10 636.16 2645 NaN NaN 2007-11-11 636.16 2645 NaN NaN 2007-11-12 636.16 2645 NaN NaN 2007-11-13 636.16 2645 NaN NaN 2007-11-14 636.16 2645 NaN NaN 2007-11-15 655.91 2645 -98.426 NaN 2007-11-16 655.91 2645 -98.426 NaN 2007-11-17 655.91 2645 -98.426 NaN The values in column 'C' change following the date in column 'A'. That is the third date in 'rd1', more exactly, rd1[3, 1] == "20071115". Shouldn't the values in mat[, "C"] start at 2009-03-20? The corresponding value in 'ua' would then be 144.138. (I still believe this can be made much faster.) Rui Barradas -- View this message in context: http://r.789695.n4.nabble.com/index-instead-of-loop-tp4447672p4455223.html Sent from the R help mailing list archive at Nabble.com.
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20120308/a8483e19/attachment.pl>
Hello,
Humm.... If I understand what you are saying, you are correct. I get 144.138 for 2009-03-20 for column C. Maybe I posted the wrong code? If so, sorry.
I think I have the fastest so far solution, and it checks with your
corrected,last one.
I've made just a change: to transform it into a function I renamed the
parameters
(only for use inside the function) 'zdates', without the period, 'rddata'
and 'uadata'.
'fun1' is yours, 'fun2', mine. Here it goes.
fun1 <- function(zdates, rddata, uadata){
fix = function(x)
{
year = substring(x, 1, 4)
mo = substring(x, 5, 6)
day = substring(x, 7, 8)
ifelse(year=="--", "--", paste(year, mo, day, sep = "-"))
}
rd = apply(rddata, 2, fix)
dimnames(rd) = dimnames(rd)
wd1 <- seq(from =as.Date(min(zdates)), to = Sys.Date(), by = "day")
#wd1 = wd1[weekdays(wd1) == "Friday"] # uncomment to go weekly
wd = sapply(wd1, as.character)
mat = matrix(NA,nrow=length(wd),ncol=ncol(uadata))
rownames(mat) = wd
nms = as.Date(rownames(uadata))
for(i in 1:length(wd)){
d = as.Date(wd[i])
diff = abs(nms - d)
rd_row_idx = max(which(diff == min(diff)))
rd_col_idx = which(as.Date(rd[rd_row_idx,], format="%Y-%m-%d") < d)
rd_col_idx_lag = which(as.Date(rd[rd_row_idx - 1,], format="%Y-%m-%d")
< d)
rd_col_idx_lag2 = which(as.Date(rd[rd_row_idx - 2,],
format="%Y-%m-%d") < d)
if(length(rd_col_idx_lag2) && (rd_row_idx - 2) > 0){
mat[i,rd_col_idx_lag2] = uadata[rd_row_idx - 2,rd_col_idx_lag2]
}
if(length(rd_col_idx_lag)){
mat[i,rd_col_idx_lag] = uadata[rd_row_idx - 1,rd_col_idx_lag]
}
if( length(rd_col_idx)){
mat[i,rd_col_idx] = uadata[rd_row_idx,rd_col_idx]
}
}
colnames(mat)=colnames(uadata)
mat
}
fun2 <- function(zdates, rddata, uadata){
fdate <- function(x, format="%Y%m%d"){
DF <- data.frame(x)
for(i in colnames(DF)){
DF[, i] <- as.Date(DF[, i], format=format)
class(DF[, i]) <- "Date"
}
DF
}
rddata <- fdate(rddata)
wd1 <- seq(from = as.Date(zdates[1]), to = Sys.Date(), by = "day")
nwd1 <- length(wd1)
fin1 <- matrix(NA, nrow=length(wd1), ncol=ncol(uadata))
nr <- nrow(rddata)
xstart <- c(integer(nr), nwd1)
for(j in 1:ncol(uadata)){
x <- xstart
for(i in 1:nr)
x[i] <- if(!is.na(rddata[i, j]) & !is.nan(rddata[i, j]))
which(wd1 == rddata[i, j])
else NA
ix <- which(!is.na(x))
for(i in seq_len(length(ix) - 1)){
from <- x[ ix[i] ] + 1
to <- x[ ix[i + 1] ]
fin1[ from:to, j ] <- uadata[ ix[i], j ]
}
}
colnames(fin1) <- colnames(uadata)
rownames(fin1) <- as.character(wd1)
fin1
}
t1 <- system.time(m1 <- fun1(z.dates, rd1, ua))
t2 <- system.time(m2 <- fun2(z.dates, rd1, ua))
all.equal(m1, m2)
[1] TRUE
rbind(t1, t2)
user.self sys.self elapsed user.child sys.child
t1 1.50 0 1.50 NA NA
t2 0.02 0 0.01 NA NA
And the better news is that I believe it scales up without degrading
performance,
like my first did.
See if it works.
Rui Barradas
--
View this message in context: http://r.789695.n4.nabble.com/index-instead-of-loop-tp4447672p4457290.html
Sent from the R help mailing list archive at Nabble.com.
1 day later
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20120309/a6669914/attachment.pl>