My comments have nothing to do with speed of your code,
but with the correctness.
np.lm <-function(dat, X, Y, ...){
# Ch 9.2: Slope est. (X) for Thiel statistic
...
num[[i]] <- dat[j.s[i],Y] - dat[i.s[i],Y]
dom[[i]] <- dat[j.s[i],X] - dat[i.s[i],X]
Up to here is looks like X and Y are used to indicate which columns
of dat are the predictor and response, respectively.
..
X <- median( sort( do.call(c, num) / do.call(c, dom) ) )
Now X is used to mean the slope of the regression.
# Ch 9.4: Intercept est. for Thiel statistic
Intercept <- median(dat[,"Y"] - X*dat[,"X"])
Now the predictor column must be named "X" and the response "Y",
no matter what your input X and Y were. Hence the function will fail
to give the correct answer in most cases.
Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com
-----Original Message-----
From:
Of Berend Hasselman
Sent: Sunday, October 21, 2012 11:59 AM
To: Brad Schneid
Cc:
Subject: Re: [R] help speeding up simple Theil regression function
On 21-10-2012, at 20:06, Brad Schneid wrote:
Hello,
I am working on a simple non-parametric (Theil) regression function and
am following Hollander and Wolfe 1999 text. I would like some help
my function faster. I have compared with pre-packaged version from
which isnt very fast either, but it appears mine is faster with N =
(see results below). I plan on running this function repeatedly, and I
generally have data lengths of ~ N = 6000 or more.
# My function following Hollander and Wolfe text, Chapter 9
np.lm <-function(dat, X, Y, ...){
# Ch 9.2: Slope est. (X) for Thiel statistic
combos <- combn(nrow(dat), 2)
i.s <- combos[1,]
j.s <- combos[2,]
num <- vector("list", length=length(i.s))
dom <- vector("list", length=length(i.s))
for(i in 1:length(i.s)){
num[[i]] <- dat[j.s[i],Y] - dat[i.s[i],Y]
dom[[i]] <- dat[j.s[i],X] - dat[i.s[i],X]
}
X <- median( sort( do.call(c, num) / do.call(c, dom) ) )
# Ch 9.4: Intercept est. for Thiel statistic
Intercept <- median(dat[,"Y"] - X*dat[,"X"])
out <- data.frame(Intercept, X)
return(out)
} # usage: np.lm(dat, X=1, Y=2)
################################################################
library("mblm") # I will compare to mblm() function
X <- rnorm(1000)
Y <- rnorm(1000)
dat <- data.frame(X, Y)
system.time(np.lm(dat, X=1, Y=2) )
user system elapsed
118.610 0.130 119.144
109.000 0.040 109.416 # ran it twice
86.190 0.100 86.589 # 3rd time
Alternative function without your i loop (it isn't needed and can be
vectorized):
np.lm.alt <-function(dat, X, Y, ...){
# Ch 9.2: Slope est. (X) for Thiel statistic ==> (Pedantic comment: it
is Theil (swap
the i and e)
combos <- combn(nrow(dat), 2)
i.s <- combos[1,]
j.s <- combos[2,]
Y.num <- dat[j.s,Y] - dat[i.s,Y]
X.dom <- dat[j.s,X] - dat[i.s,X]
X <- median( Y.num / X.dom)
# Ch 9.4: Intercept est. for Thiel statistic ==> (Pedantic comment: it
is Theil (swap
the i and e)
Intercept <- median(dat[,"Y"] - X*dat[,"X"])
out <- data.frame(Intercept, X)
return(out)
} # usage: np.lm(dat, X=1, Y=2)
Try the compiler package on you original function:
library(compiler)
np.lm.c <- cmpfun(np.lm)
Test speed and correct results:
X <- rnorm(500)
Y <- rnorm(500)
dat <- data.frame(X, Y)
system.time(npout.c <- np.lm.c(dat, X=1, Y=2) )
system.time(npout.1 <- np.lm(dat, X=1, Y=2) )
system.time(npout.a <- np.lm.alt(dat, X=1, Y=2) )
identical(npout.1,npout.c)
identical(npout.1,npout.a)
Results:
system.time(npout.c <- np.lm.c(dat, X=1, Y=2) )
user system elapsed
21.442 0.066 21.517
system.time(npout.1 <- np.lm(dat, X=1, Y=2) )
user system elapsed
21.068 0.073 21.161
system.time(npout.a <- np.lm.alt(dat, X=1, Y=2) )
user system elapsed
0.303 0.010 0.313
identical(npout.1,npout.c)
identical(npout.1,npout.a)
[1] TRUE
You may try and test this with larger data lengths.
Berend