Skip to content

help speeding up simple Theil regression function

6 messages · Brad Patrick Schneid, Berend Hasselman, William Dunlap

#
Hello,

I am working on a simple non-parametric (Theil) regression function and and
am following Hollander and Wolfe 1999 text.  I would like some help making
my function faster.  I have compared with pre-packaged version from "MBLM",
which isnt very fast either, but it appears mine is faster with N = 1000
(see results below).  I plan on running this function repeatedly, and I
generally have data lengths of ~ N = 6000 or more.  

# My function following Hollander and Wolfe text, Chapter 9
np.lm <-function(dat, X, Y, ...){
	# Ch 9.2: Slope est. (X) for Thiel statistic
	combos <- combn(nrow(dat), 2)
	i.s <- combos[1,] 
	j.s <- combos[2,] 
	num <- vector("list", length=length(i.s))
	dom <- vector("list", length=length(i.s))

		for(i in 1:length(i.s)){
			num[[i]]  <- dat[j.s[i],Y] - dat[i.s[i],Y]    
			dom[[i]]  <- dat[j.s[i],X] - dat[i.s[i],X]    
	        	 	}	
	
	X <- median( sort( do.call(c, num) / do.call(c, dom) ) )
	# Ch 9.4: Intercept est. for Thiel statistic
	Intercept <- median(dat[,"Y"] - X*dat[,"X"])
	out <- data.frame(Intercept, X)
	return(out)
		}   # usage: np.lm(dat, X=1, Y=2)
################################################################

library("mblm") # I will compare to mblm() function

X <- rnorm(1000)
Y <- rnorm(1000)
dat <- data.frame(X, Y)

system.time(np.lm(dat, X=1, Y=2) )
   user  system elapsed 
118.610   0.130 119.144 
109.000   0.040 109.416 # ran it twice
 86.190   0.100  86.589 # 3rd time

system.time( mblm(Y~X, dat, repeated=F) )
    user   system  elapsed 
1509.200   87.670 1602.987 
# not waiting on that to run again

OK, mine appears to be way faster... oddly, mblm() seemed faster with
smaller (n=100) datasets

Can someone please tell me how they would improve the np.lm() function to
make it quicker, or faster with some other tricks (parallel?.. Ive never
done that.).

Thank you ahead of time for any help. 
Brad

Ubuntu 10.04
Intel i5 CPU 4*(650 @ 3.20GHz)
11.6 GiB memory
R version 2.15.0 (2012-03-30)




--
View this message in context: http://r.789695.n4.nabble.com/help-speeding-up-simple-Theil-regression-function-tp4646923.html
Sent from the R help mailing list archive at Nabble.com.
#
Obviously sort() is not needed in the following line:

 X <- median( sort( do.call(c, num) / do.call(c, dom) ) ) 







Brad Schneid wrote
--
View this message in context: http://r.789695.n4.nabble.com/help-speeding-up-simple-Theil-regression-function-tp4646923p4646926.html
Sent from the R help mailing list archive at Nabble.com.
#
On 21-10-2012, at 20:06, Brad Schneid wrote:

            
Alternative function without your i loop (it isn't needed and can be vectorized):

np.lm.alt <-function(dat, X, Y, ...){
	# Ch 9.2: Slope est. (X) for Thiel statistic ==> (Pedantic comment: it is Theil (swap the i and e)
	combos <- combn(nrow(dat), 2)
	i.s <- combos[1,] 
	j.s <- combos[2,] 
	
	Y.num <- dat[j.s,Y] - dat[i.s,Y]
	X.dom <- dat[j.s,X] - dat[i.s,X]
	X <- median( Y.num / X.dom)
	# Ch 9.4: Intercept est. for Thiel statistic ==> (Pedantic comment: it is Theil (swap the i and e)
	Intercept <- median(dat[,"Y"] - X*dat[,"X"])
	out <- data.frame(Intercept, X)
	return(out)
}   # usage: np.lm(dat, X=1, Y=2)


Try the compiler package on you original function:

library(compiler)
np.lm.c <- cmpfun(np.lm)

Test speed and correct results:

X <- rnorm(500)
Y <- rnorm(500)
dat <- data.frame(X, Y)

system.time(npout.c <- np.lm.c(dat, X=1, Y=2) )
system.time(npout.1 <- np.lm(dat, X=1, Y=2) )
system.time(npout.a <- np.lm.alt(dat, X=1, Y=2) )
identical(npout.1,npout.c)      
identical(npout.1,npout.a)

Results:
user  system elapsed 
 21.442   0.066  21.517
user  system elapsed 
 21.068   0.073  21.161
user  system elapsed 
  0.303   0.010   0.313
[1] TRUE
[1] TRUE

You may try and test this with larger data lengths.


Berend
#
Wow.  Thank you greatly, that is amazing.  

 Thiel statistic ==> (Pedantic comment: it is Theil (swap the i and e)
Yes sir; I do that every time. 
Dyslexia perhaps?

Thanks again. 



Berend Hasselman wrote

            
--
View this message in context: http://r.789695.n4.nabble.com/help-speeding-up-simple-Theil-regression-function-tp4646923p4646933.html
Sent from the R help mailing list archive at Nabble.com.
#
My comments have nothing to do with speed of your code,
but with the correctness.
Up to here is looks like X and Y are used to indicate which columns
of dat are the predictor and response, respectively.
Now X is used to mean the slope of the regression.
Now the predictor column must be named "X" and the response "Y",
no matter what your input X and Y were.  Hence the function will fail
to give the correct answer in most cases.

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com
#
I understand, thank you.   I think I wanted the output to look similar to
that from mblm for quick comparison; that is obviously a problem.  


William Dunlap wrote

            

            
--
View this message in context: http://r.789695.n4.nabble.com/help-speeding-up-simple-Theil-regression-function-tp4646923p4646935.html
Sent from the R help mailing list archive at Nabble.com.