index instead of loop?

An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20120305/3be743c9/attachment.pl>
Hello,
Mar 05, 2012; 8:53pm ? by Ben quant Ben quant
Hello,

Does anyone know of a way I can speed this up? 

Maybe, let's see.
################################# change anything below.

# Yes.
# First, start by using dates, not characters

fdate <- function(x, format="%Y%m%d"){
	DF <- data.frame(x)
	for(i in colnames(DF)){
		DF[, i] <- as.Date(DF[, i], format=format)
		class(DF[, i]) <- "Date"
	}
	DF
}

rd1 <- fdate(rd1)
# This is yours, use it.
dt1 <- seq(from =as.Date(z.dates[1]), to = as.Date("2009-03-25"), by =
"day")
# Set up the result, no time expensive 'cbind' inside a loop
fin1 <- data.frame(matrix(NA, nrow=length(dt1), ncol=ncol(ua) + 1))
fin1[, 1] <- dt1
nr <- nrow(rd1)

# And vectorize
for(tkr in 1:ncol(ua)){
	x  <- c(rd1[, tkr], as.Date("9999-12-31"))
	inxlist <- lapply(1:nr, function(i) which(x[i] <= dt1 & dt1 < x[i + 1]))
	sapply(1:length(inxlist), function(i) if(length(ix[[i]])) fin1[ix[[i]], tkr
+ 1] <<- ua[i, tkr])
}
colnames(fin1) <- c("daily_dates", colnames(ua))

# Check results
str(fin)
str(fin1)
head(fin)
head(fin1)
tail(fin)
tail(fin1)

Note that 'fin' has facotrs, 'fin1' numerics.
I haven't timed it but I believe it should be faster.

Hope this helps,

Rui Barradas

--
View this message in context: http://r.789695.n4.nabble.com/index-instead-of-loop-tp4447672p4448567.html
Sent from the R help mailing list archive at Nabble.com.
An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20120306/25209b80/attachment.pl>
An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20120306/cae330be/attachment.pl>
Hello,
Just looking at this, but it looks like ix doesn't exist:
       sapply(1:length(inxlist), function(i) if(length(ix[[i]]))
fin1[ix[[i]], tkr + 1] <<- ua[i, tkr])

 Trying to sort it out now.
Right, sorry.
I've changed the name from 'ix' to 'inxlist' to make it more readable just
before posting.
And since the object 'ix' still existed in the R global environment it
didn't throw an error...

Your correction in the post that followed is what I meant.

Correction (full loop, tested):

for(tkr in 1:ncol(ua)){
	x  <- c(rd1[, tkr], as.Date("9999-12-31"))
	ix <- lapply(1:nr, function(i)
			which(x[i] <= dt1 & dt1 < x[i + 1]))
	sapply(1:length(ix), function(i)
		if(length(ix[[i]])) fin1[ix[[i]], tkr + 1] <<- ua[i, tkr])
}

Rui Barradas

--
View this message in context: http://r.789695.n4.nabble.com/index-instead-of-loop-tp4447672p4450186.html
Sent from the R help mailing list archive at Nabble.com.
An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20120306/5e8797c0/attachment.pl>
An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20120306/39d91c9c/attachment.pl>
Hello again.

Ben quant wrote
Hello,

In case anyone is interested in a faster solution for lots of columns.
This
solution is slower if you only have a few columns.  If anyone has anything
faster, I would be interested in seeing it.

### some mockup data
z.dates =
c("2007-03-31","2007-06-30","2007-09-30","2007-12-31","2008-03-31","2008-06-30","2008-09-30","2008-12-31")

nms = c("A","B","C","D") # add more columns to see how the code below is
fsater
# these are the report dates that are the real days the data was
available,
so show the data the day after this date ('after' is a design decision)
rd1 = matrix(c("20070514","20070814","20071115",   "20080213",
"20080514",  "20080814",  "20081114",  "20090217",
               "20070410","20070709","20071009",   "20080109",
"20080407",  "20080708",  "20081007",  "20090112",
               "20070426","--","--","--","--","--","--","20090319",
               "--","--","--","--","--","--","--","--"),
             nrow=8,ncol=4)
dimnames(rd1) = list(z.dates,nms)

# this is the unadjusted raw data, that always has the same dimensions,
rownames, and colnames as the report dates
ua = matrix(c(640.35,636.16,655.91,657.41,682.06,702.90,736.15,667.65,

2625.050,2625.050,2645.000,2302.000,1972.000,1805.000,1547.000,1025.000,
              NaN, NaN,-98.426,190.304,180.894,183.220,172.520, 144.138,
              NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN),
            nrow=8,ncol=4)
dimnames(ua) = list(z.dates,nms)

################################ the fastest code I have found:

start_t_all = Sys.time()
fix = function(x)
{
  year = substring(x, 1, 4)
  mo = substring(x, 5, 6)
  day = substring(x, 7, 8)
  ifelse(year=="--", "NA", paste(year, mo, day, sep = "-"))
}

rd = apply(rd1, 2, fix)
dimnames(rd) = dimnames(rd)

wd1 <- seq(from =as.Date(min(z.dates)), to = Sys.Date(), by = "day")
#wd1 = wd1[weekdays(wd1) == "Friday"] # uncomment to go weekly
wd = sapply(wd1, as.character)

mat = matrix(NA,nrow=length(wd),ncol=ncol(ua))
rownames(mat) = wd
nms = as.Date(rownames(ua))

for(i in 1:length(wd)){
  d = as.Date(wd[i])
  diff = abs(nms - d)
  rd_row_idx = max(which(diff == min(diff)))
  rd_col_idx = which(rd[rd_row_idx,] < d)

  if((rd_row_idx - 1) > 0){
    mat[i,] = ua[rd_row_idx - 1,]
  }
  if( length(rd_col_idx)){
    mat[i,rd_col_idx] = ua[rd_row_idx,rd_col_idx]
  }
}
colnames(mat)=colnames(ua)
print(Sys.time()-start_t_all)

Regards,

Ben

On Tue, Mar 6, 2012 at 8:22 AM, Rui Barradas &lt;rui1174@&gt; wrote:

Hello,

Just looking at this, but it looks like ix doesn't exist:
       sapply(1:length(inxlist), function(i) if(length(ix[[i]]))
fin1[ix[[i]], tkr + 1] <<- ua[i, tkr])

 Trying to sort it out now.
Right, sorry.
I've changed the name from 'ix' to 'inxlist' to make it more readable
just
before posting.
And since the object 'ix' still existed in the R global environment it
didn't throw an error...

Your correction in the post that followed is what I meant.

Correction (full loop, tested):

for(tkr in 1:ncol(ua)){
       x  <- c(rd1[, tkr], as.Date("9999-12-31"))
        ix <- lapply(1:nr, function(i)
                        which(x[i] <= dt1 & dt1 < x[i + 1]))
        sapply(1:length(ix), function(i)
                if(length(ix[[i]])) fin1[ix[[i]], tkr + 1] <<- ua[i,
tkr])
}

Rui Barradas

--
View this message in context:
http://r.789695.n4.nabble.com/index-instead-of-loop-tp4447672p4450186.html
Sent from the R help mailing list archive at Nabble.com.

______________________________________________
R-help@ mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

	[[alternative HTML version deleted]]

______________________________________________
R-help@ mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Maybe I'm not understanding the problem very well, but let me describe what
I'm thinking it is.

You have two tables, 'rd1' and 'ua' and a vector of dates, 'z.dates'.  The
result is a table such that:
1. From 'z.dates' make a vector of daily dates.
2. Each column is filled with numbers from 'ua' based on dates in 'rd1',
starting at the day given in step 1.
My doubt is that your last posted code seems to give a special role to
column 'A'.
mat[225:232, ]
A       B              C      D
2007-11-10 636.16 2645     NaN NaN
2007-11-11 636.16 2645     NaN NaN
2007-11-12 636.16 2645     NaN NaN
2007-11-13 636.16 2645     NaN NaN
2007-11-14 636.16 2645     NaN NaN
2007-11-15 655.91 2645 -98.426 NaN
2007-11-16 655.91 2645 -98.426 NaN
2007-11-17 655.91 2645 -98.426 NaN

The values in column 'C' change following the date in column 'A'. That is
the third date in 'rd1',
more exactly, rd1[3, 1] == "20071115".

Shouldn't the values in mat[, "C"] start at 2009-03-20? The corresponding
value in 'ua' would then be 144.138.

(I still believe this can be made much faster.)

Rui Barradas

--
View this message in context: http://r.789695.n4.nabble.com/index-instead-of-loop-tp4447672p4455223.html
Sent from the R help mailing list archive at Nabble.com.
An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20120308/a8483e19/attachment.pl>
Hello,
Humm.... If I understand what you are saying, you are correct. I get
144.138 for 2009-03-20 for column C. Maybe I posted the wrong code?  If
so,
sorry. 
I think I have the fastest so far solution, and it checks with your
corrected,last one.

I've made just a change: to transform it into a function I renamed the
parameters
(only for use inside the function) 'zdates', without the period, 'rddata'
and 'uadata'.

'fun1' is yours, 'fun2', mine. Here it goes.

fun1 <- function(zdates, rddata, uadata){
    fix = function(x)
    {
      year = substring(x, 1, 4)
      mo = substring(x, 5, 6)
      day = substring(x, 7, 8)
      ifelse(year=="--", "--", paste(year, mo, day, sep = "-"))

    }
    rd = apply(rddata, 2, fix)
    dimnames(rd) = dimnames(rd)

    wd1 <- seq(from =as.Date(min(zdates)), to = Sys.Date(), by = "day")
    #wd1 = wd1[weekdays(wd1) == "Friday"] # uncomment to go weekly
    wd = sapply(wd1, as.character)
    mat = matrix(NA,nrow=length(wd),ncol=ncol(uadata))
    rownames(mat) = wd
    nms = as.Date(rownames(uadata))

    for(i in 1:length(wd)){
      d = as.Date(wd[i])
      diff = abs(nms - d)
      rd_row_idx = max(which(diff == min(diff)))
      rd_col_idx = which(as.Date(rd[rd_row_idx,], format="%Y-%m-%d")  < d)
      rd_col_idx_lag = which(as.Date(rd[rd_row_idx - 1,], format="%Y-%m-%d") 
< d)
      rd_col_idx_lag2 = which(as.Date(rd[rd_row_idx - 2,],
format="%Y-%m-%d")  < d)

      if(length(rd_col_idx_lag2) && (rd_row_idx - 2) > 0){
        mat[i,rd_col_idx_lag2] = uadata[rd_row_idx - 2,rd_col_idx_lag2]
      }
      if(length(rd_col_idx_lag)){
        mat[i,rd_col_idx_lag] = uadata[rd_row_idx - 1,rd_col_idx_lag]
      }
      if( length(rd_col_idx)){
        mat[i,rd_col_idx] = uadata[rd_row_idx,rd_col_idx]
      }
    }
    colnames(mat)=colnames(uadata)
    mat
}

fun2 <- function(zdates, rddata, uadata){

	fdate <- function(x, format="%Y%m%d"){
		DF <- data.frame(x)
		for(i in colnames(DF)){
			DF[, i] <- as.Date(DF[, i], format=format)
			class(DF[, i]) <- "Date"
		}
		DF
	}

	rddata <- fdate(rddata)
	wd1 <- seq(from = as.Date(zdates[1]), to = Sys.Date(), by = "day")
	nwd1 <- length(wd1)

	fin1 <- matrix(NA, nrow=length(wd1), ncol=ncol(uadata))
	nr <- nrow(rddata)
	xstart <- c(integer(nr), nwd1)
	for(j in 1:ncol(uadata)){
		x <- xstart
		for(i in 1:nr)
			x[i] <- if(!is.na(rddata[i, j]) & !is.nan(rddata[i, j]))
					which(wd1 == rddata[i, j])
					else NA
		ix <- which(!is.na(x))
		for(i in seq_len(length(ix) - 1)){
			from <- x[ ix[i] ] + 1
			to   <- x[ ix[i + 1] ]
			fin1[ from:to, j ] <- uadata[ ix[i], j ]
		}
	}
	colnames(fin1) <- colnames(uadata)
	rownames(fin1) <- as.character(wd1)
	fin1
}

t1 <- system.time(m1 <- fun1(z.dates, rd1, ua))
t2 <- system.time(m2 <- fun2(z.dates, rd1, ua))

all.equal(m1, m2)
[1] TRUE

rbind(t1, t2)
   user.self sys.self elapsed user.child sys.child
t1      1.50        0    1.50         NA        NA
t2      0.02        0    0.01         NA        NA

And the better news is that I believe it scales up without degrading
performance,
like my first did.

See if it works.

Rui Barradas

--
View this message in context: http://r.789695.n4.nabble.com/index-instead-of-loop-tp4447672p4457290.html
Sent from the R help mailing list archive at Nabble.com.
An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20120309/a6669914/attachment.pl>