Back to formatted view
Raw Message

Message-ID: <012821F286ED1D4ABDC72F9E1DD63D0C041EA974@NLDNC003PEX1.ubsgs.ubsgroup.net>
Date: 2004-09-13T17:58:17Z
From: john.gavin@ubs.com
Subject: How can I do this better? (Filling in last traded price for NA)

Hi Ajay,

You will probably get other suggestions
along the following lines,
which use 'rle' and 'rep' to speed things up.

fillIn2 <- function(x)
{ bef <- x # keep a copy for display purposes only.
  xRle <- rle(is.na(x))
  # get indices where each NA seq starts (low) and stops (upp)
  upp <- (sumX <- cumsum(xRle$lengths))[xRle$values]
  low <- sumX[which(xRle$values)-1]+1
  # special case: NA at start _only_ i.e. c(NA, ..., NA, notNa, ..., notNA)
  if (length(low) == 0) return(cbind(before = x , after = x))
  # special case: NA at start and else where 
  if (length(upp) == length(low)+1) upp <- upp[-1]
  # Critical bit is 'rep' on RHS. 
  # On LHS, dont replace NAs at the start, if any.
  ind <- low[1]-1
  x[ind + which(is.na(x[-seq(ind)]))] <- x[rep(low-1, upp-low+1)]
  cbind(before = bef , after = x) # show off before and after effect
}
set.seed(123)
x <- 1:10
x[sample(length(x), floor(length(x)/2))] <- NA
fillIn2(x)

should produce

> fillIn2(x)
      before after
 [1,]      1     1
 [2,]      2     2
 [3,]     NA     2
 [4,]     NA     2
 [5,]      5     5
 [6,]     NA     5
 [7,]     NA     5
 [8,]     NA     5
 [9,]      9     9
[10,]     10    10

The code seems clunky and has special cases
so it is probably not optimal.

However, it is faster than, say, using 'mapply'

fillIn <- function(x)
{ bef <- x
  xRle <- rle(is.na(x))
  upp <- cumsum(xRle$lengths)[xRle$values]
  low <- cumsum(xRle$lengths)[which(xRle$values)-1]+1
  if (length(upp) == length(low)+1) upp <- upp[-1]
  mapply(function(l, u) x[l:u] <<- x[l-1], low, upp)
  cbind(before = bef , after = x) # show off before and after effect
}
fillIn(x)

Some simulations to compare times,
based on vectors of varying lengths with 50% of elements set to NA

simFillIn <- function(n, method = c("rep", "mapply"))
{ aa <- rpois(n, 5)
  aa[sample(seq(n), floor(n * .5))] <- NA
  method = match.arg(method)
  ansTime <- system.time(ans <- 
    switch(method,
      mapply = fillIn(aa),
      rep = fillIn2(aa), 
      stop("wrong method")
  )) # switch system.time
  list(time = ansTime) # ans = ans, 
}
ans <- lapply(c(2e4, 1e4, 1e3, 1e2, 1e1), simFillIn, method = "mapply")
lapply(ans, "[[", "time")
ans <- lapply(c(2e4, 1e4, 1e3, 1e2, 1e1), simFillIn, method = "rep")
lapply(ans, "[[", "time")

simFillIn (with 'mapply') seems at least 10 times slower
than simFillIn2 (with 'rep').

Regards,

John.

John Gavin <john.gavin at ubs.com>,
Quantitative Risk Models and Statistics,
UBS Investment Bank, 6th floor, 
100 Liverpool St., London EC2M 2RH, UK.
Phone +44 (0) 207 567 4289
Fax   +44 (0) 207 568 5352


Ajay Shah wrote:

>I have 3 different daily time-series. Using union() in the "its"
>package, I can make a long matrix, where rows are created when even
>one of the three time-series is observed:
>
>massive <- union(nifty.its, union(inrusd.its, infosys.its))
>
>Now in this, I want to replace NA values for prices by the
>most-recently observed price. I can do this painfully --
>
>for (i in 2:nrow(massive)) {
>  for (j in 1:3) {
>    if (is.na(massive[i,j])) {
>      massive[i,j] = massive[i-1,j]
>    }
>  }
>}
>
>But this is horribly slow. Is there a more clever way?

Visit our website at http://www.ubs.com

This message contains confidential information and is intend...{{dropped}}