Hello,
I am working on a simple non-parametric (Theil) regression function and and
am following Hollander and Wolfe 1999 text. I would like some help making
my function faster. I have compared with pre-packaged version from "MBLM",
which isnt very fast either, but it appears mine is faster with N = 1000
(see results below). I plan on running this function repeatedly, and I
generally have data lengths of ~ N = 6000 or more.
# My function following Hollander and Wolfe text, Chapter 9
np.lm <-function(dat, X, Y, ...){
# Ch 9.2: Slope est. (X) for Thiel statistic
combos <- combn(nrow(dat), 2)
i.s <- combos[1,]
j.s <- combos[2,]
num <- vector("list", length=length(i.s))
dom <- vector("list", length=length(i.s))
for(i in 1:length(i.s)){
num[[i]] <- dat[j.s[i],Y] - dat[i.s[i],Y]
dom[[i]] <- dat[j.s[i],X] - dat[i.s[i],X]
}
X <- median( sort( do.call(c, num) / do.call(c, dom) ) )
# Ch 9.4: Intercept est. for Thiel statistic
Intercept <- median(dat[,"Y"] - X*dat[,"X"])
out <- data.frame(Intercept, X)
return(out)
} # usage: np.lm(dat, X=1, Y=2)
################################################################
library("mblm") # I will compare to mblm() function
X <- rnorm(1000)
Y <- rnorm(1000)
dat <- data.frame(X, Y)
system.time(np.lm(dat, X=1, Y=2) )
user system elapsed
118.610 0.130 119.144
109.000 0.040 109.416 # ran it twice
86.190 0.100 86.589 # 3rd time
system.time( mblm(Y~X, dat, repeated=F) )
user system elapsed
1509.200 87.670 1602.987
# not waiting on that to run again
OK, mine appears to be way faster... oddly, mblm() seemed faster with
smaller (n=100) datasets
Can someone please tell me how they would improve the np.lm() function to
make it quicker, or faster with some other tricks (parallel?.. Ive never
done that.).
Thank you ahead of time for any help.
Brad
Ubuntu 10.04
Intel i5 CPU 4*(650 @ 3.20GHz)
11.6 GiB memory
R version 2.15.0 (2012-03-30)
--
View this message in context: http://r.789695.n4.nabble.com/help-speeding-up-simple-Theil-regression-function-tp4646923.html
Sent from the R help mailing list archive at Nabble.com.
help speeding up simple Theil regression function
6 messages · Brad Patrick Schneid, Berend Hasselman, William Dunlap
Obviously sort() is not needed in the following line: X <- median( sort( do.call(c, num) / do.call(c, dom) ) ) Brad Schneid wrote
Hello,
I am working on a simple non-parametric (Theil) regression function and
and am following Hollander and Wolfe 1999 text. I would like some help
making my function faster. I have compared with pre-packaged version from
"MBLM", which isnt very fast either, but it appears mine is faster with N
= 1000 (see results below). I plan on running this function repeatedly,
and I generally have data lengths of ~ N = 6000 or more.
# My function following Hollander and Wolfe text, Chapter 9
np.lm <-function(dat, X, Y, ...){
# Ch 9.2: Slope est. (X) for Thiel statistic
combos <- combn(nrow(dat), 2)
i.s <- combos[1,]
j.s <- combos[2,]
num <- vector("list", length=length(i.s))
dom <- vector("list", length=length(i.s))
for(i in 1:length(i.s)){
num[[i]] <- dat[j.s[i],Y] - dat[i.s[i],Y]
dom[[i]] <- dat[j.s[i],X] - dat[i.s[i],X]
}
X <- median( sort( do.call(c, num) / do.call(c, dom) ) )
# Ch 9.4: Intercept est. for Thiel statistic
Intercept <- median(dat[,"Y"] - X*dat[,"X"])
out <- data.frame(Intercept, X)
return(out)
} # usage: np.lm(dat, X=1, Y=2)
################################################################
library("mblm") # I will compare to mblm() function
X <- rnorm(1000)
Y <- rnorm(1000)
dat <- data.frame(X, Y)
system.time(np.lm(dat, X=1, Y=2) )
user system elapsed
118.610 0.130 119.144
109.000 0.040 109.416 # ran it twice
86.190 0.100 86.589 # 3rd time
system.time( mblm(Y~X, dat, repeated=F) )
user system elapsed
1509.200 87.670 1602.987
# not waiting on that to run again
OK, mine appears to be way faster... oddly, mblm() seemed faster with
smaller (n=100) datasets
Can someone please tell me how they would improve the np.lm() function to
make it quicker, or faster with some other tricks (parallel?.. Ive never
done that.).
Thank you ahead of time for any help.
Brad
Ubuntu 10.04
Intel i5 CPU 4*(650 @ 3.20GHz)
11.6 GiB memory
R version 2.15.0 (2012-03-30)
-- View this message in context: http://r.789695.n4.nabble.com/help-speeding-up-simple-Theil-regression-function-tp4646923p4646926.html Sent from the R help mailing list archive at Nabble.com.
On 21-10-2012, at 20:06, Brad Schneid wrote:
Hello,
I am working on a simple non-parametric (Theil) regression function and and
am following Hollander and Wolfe 1999 text. I would like some help making
my function faster. I have compared with pre-packaged version from "MBLM",
which isnt very fast either, but it appears mine is faster with N = 1000
(see results below). I plan on running this function repeatedly, and I
generally have data lengths of ~ N = 6000 or more.
# My function following Hollander and Wolfe text, Chapter 9
np.lm <-function(dat, X, Y, ...){
# Ch 9.2: Slope est. (X) for Thiel statistic
combos <- combn(nrow(dat), 2)
i.s <- combos[1,]
j.s <- combos[2,]
num <- vector("list", length=length(i.s))
dom <- vector("list", length=length(i.s))
for(i in 1:length(i.s)){
num[[i]] <- dat[j.s[i],Y] - dat[i.s[i],Y]
dom[[i]] <- dat[j.s[i],X] - dat[i.s[i],X]
}
X <- median( sort( do.call(c, num) / do.call(c, dom) ) )
# Ch 9.4: Intercept est. for Thiel statistic
Intercept <- median(dat[,"Y"] - X*dat[,"X"])
out <- data.frame(Intercept, X)
return(out)
} # usage: np.lm(dat, X=1, Y=2)
################################################################
library("mblm") # I will compare to mblm() function
X <- rnorm(1000)
Y <- rnorm(1000)
dat <- data.frame(X, Y)
system.time(np.lm(dat, X=1, Y=2) )
user system elapsed
118.610 0.130 119.144
109.000 0.040 109.416 # ran it twice
86.190 0.100 86.589 # 3rd time
Alternative function without your i loop (it isn't needed and can be vectorized):
np.lm.alt <-function(dat, X, Y, ...){
# Ch 9.2: Slope est. (X) for Thiel statistic ==> (Pedantic comment: it is Theil (swap the i and e)
combos <- combn(nrow(dat), 2)
i.s <- combos[1,]
j.s <- combos[2,]
Y.num <- dat[j.s,Y] - dat[i.s,Y]
X.dom <- dat[j.s,X] - dat[i.s,X]
X <- median( Y.num / X.dom)
# Ch 9.4: Intercept est. for Thiel statistic ==> (Pedantic comment: it is Theil (swap the i and e)
Intercept <- median(dat[,"Y"] - X*dat[,"X"])
out <- data.frame(Intercept, X)
return(out)
} # usage: np.lm(dat, X=1, Y=2)
Try the compiler package on you original function:
library(compiler)
np.lm.c <- cmpfun(np.lm)
Test speed and correct results:
X <- rnorm(500)
Y <- rnorm(500)
dat <- data.frame(X, Y)
system.time(npout.c <- np.lm.c(dat, X=1, Y=2) )
system.time(npout.1 <- np.lm(dat, X=1, Y=2) )
system.time(npout.a <- np.lm.alt(dat, X=1, Y=2) )
identical(npout.1,npout.c)
identical(npout.1,npout.a)
Results:
system.time(npout.c <- np.lm.c(dat, X=1, Y=2) )
user system elapsed 21.442 0.066 21.517
system.time(npout.1 <- np.lm(dat, X=1, Y=2) )
user system elapsed 21.068 0.073 21.161
system.time(npout.a <- np.lm.alt(dat, X=1, Y=2) )
user system elapsed 0.303 0.010 0.313
identical(npout.1,npout.c)
[1] TRUE
identical(npout.1,npout.a)
[1] TRUE You may try and test this with larger data lengths. Berend
Wow. Thank you greatly, that is amazing. Thiel statistic ==> (Pedantic comment: it is Theil (swap the i and e) Yes sir; I do that every time. Dyslexia perhaps? Thanks again. Berend Hasselman wrote
On 21-10-2012, at 20:06, Brad Schneid wrote:
Hello,
I am working on a simple non-parametric (Theil) regression function and
and
am following Hollander and Wolfe 1999 text. I would like some help
making
my function faster. I have compared with pre-packaged version from
"MBLM",
which isnt very fast either, but it appears mine is faster with N = 1000
(see results below). I plan on running this function repeatedly, and I
generally have data lengths of ~ N = 6000 or more.
# My function following Hollander and Wolfe text, Chapter 9
np.lm <-function(dat, X, Y, ...){
# Ch 9.2: Slope est. (X) for Thiel statistic
combos <- combn(nrow(dat), 2)
i.s <- combos[1,]
j.s <- combos[2,]
num <- vector("list", length=length(i.s))
dom <- vector("list", length=length(i.s))
for(i in 1:length(i.s)){
num[[i]] <- dat[j.s[i],Y] - dat[i.s[i],Y]
dom[[i]] <- dat[j.s[i],X] - dat[i.s[i],X]
}
X <- median( sort( do.call(c, num) / do.call(c, dom) ) )
# Ch 9.4: Intercept est. for Thiel statistic
Intercept <- median(dat[,"Y"] - X*dat[,"X"])
out <- data.frame(Intercept, X)
return(out)
} # usage: np.lm(dat, X=1, Y=2)
################################################################
library("mblm") # I will compare to mblm() function
X <- rnorm(1000)
Y <- rnorm(1000)
dat <- data.frame(X, Y)
system.time(np.lm(dat, X=1, Y=2) )
user system elapsed
118.610 0.130 119.144
109.000 0.040 109.416 # ran it twice
86.190 0.100 86.589 # 3rd time
Alternative function without your i loop (it isn't needed and can be
vectorized):
np.lm.alt <-function(dat, X, Y, ...){
# Ch 9.2: Slope est. (X) for Thiel statistic ==> (Pedantic comment: it is
Theil (swap the i and e)
combos <- combn(nrow(dat), 2)
i.s <- combos[1,]
j.s <- combos[2,]
Y.num <- dat[j.s,Y] - dat[i.s,Y]
X.dom <- dat[j.s,X] - dat[i.s,X]
X <- median( Y.num / X.dom)
# Ch 9.4: Intercept est. for Thiel statistic ==> (Pedantic comment: it is
Theil (swap the i and e)
Intercept <- median(dat[,"Y"] - X*dat[,"X"])
out <- data.frame(Intercept, X)
return(out)
} # usage: np.lm(dat, X=1, Y=2)
Try the compiler package on you original function:
library(compiler)
np.lm.c <- cmpfun(np.lm)
Test speed and correct results:
X <- rnorm(500)
Y <- rnorm(500)
dat <- data.frame(X, Y)
system.time(npout.c <- np.lm.c(dat, X=1, Y=2) )
system.time(npout.1 <- np.lm(dat, X=1, Y=2) )
system.time(npout.a <- np.lm.alt(dat, X=1, Y=2) )
identical(npout.1,npout.c)
identical(npout.1,npout.a)
Results:
system.time(npout.c <- np.lm.c(dat, X=1, Y=2) )
user system elapsed 21.442 0.066 21.517
system.time(npout.1 <- np.lm(dat, X=1, Y=2) )
user system elapsed 21.068 0.073 21.161
system.time(npout.a <- np.lm.alt(dat, X=1, Y=2) )
user system elapsed 0.303 0.010 0.313
identical(npout.1,npout.c)
[1] TRUE
identical(npout.1,npout.a)
[1] TRUE You may try and test this with larger data lengths. Berend
______________________________________________
R-help@
mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
-- View this message in context: http://r.789695.n4.nabble.com/help-speeding-up-simple-Theil-regression-function-tp4646923p4646933.html Sent from the R help mailing list archive at Nabble.com.
My comments have nothing to do with speed of your code, but with the correctness.
np.lm <-function(dat, X, Y, ...){
# Ch 9.2: Slope est. (X) for Thiel statistic
...
num[[i]] <- dat[j.s[i],Y] - dat[i.s[i],Y]
dom[[i]] <- dat[j.s[i],X] - dat[i.s[i],X]
Up to here is looks like X and Y are used to indicate which columns of dat are the predictor and response, respectively.
.. X <- median( sort( do.call(c, num) / do.call(c, dom) ) )
Now X is used to mean the slope of the regression.
# Ch 9.4: Intercept est. for Thiel statistic Intercept <- median(dat[,"Y"] - X*dat[,"X"])
Now the predictor column must be named "X" and the response "Y", no matter what your input X and Y were. Hence the function will fail to give the correct answer in most cases. Bill Dunlap Spotfire, TIBCO Software wdunlap tibco.com
-----Original Message----- From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf Of Berend Hasselman Sent: Sunday, October 21, 2012 11:59 AM To: Brad Schneid Cc: r-help at r-project.org Subject: Re: [R] help speeding up simple Theil regression function On 21-10-2012, at 20:06, Brad Schneid wrote:
Hello,
I am working on a simple non-parametric (Theil) regression function and and
am following Hollander and Wolfe 1999 text. I would like some help making
my function faster. I have compared with pre-packaged version from "MBLM",
which isnt very fast either, but it appears mine is faster with N = 1000
(see results below). I plan on running this function repeatedly, and I
generally have data lengths of ~ N = 6000 or more.
# My function following Hollander and Wolfe text, Chapter 9
np.lm <-function(dat, X, Y, ...){
# Ch 9.2: Slope est. (X) for Thiel statistic
combos <- combn(nrow(dat), 2)
i.s <- combos[1,]
j.s <- combos[2,]
num <- vector("list", length=length(i.s))
dom <- vector("list", length=length(i.s))
for(i in 1:length(i.s)){
num[[i]] <- dat[j.s[i],Y] - dat[i.s[i],Y]
dom[[i]] <- dat[j.s[i],X] - dat[i.s[i],X]
}
X <- median( sort( do.call(c, num) / do.call(c, dom) ) )
# Ch 9.4: Intercept est. for Thiel statistic
Intercept <- median(dat[,"Y"] - X*dat[,"X"])
out <- data.frame(Intercept, X)
return(out)
} # usage: np.lm(dat, X=1, Y=2)
################################################################
library("mblm") # I will compare to mblm() function
X <- rnorm(1000)
Y <- rnorm(1000)
dat <- data.frame(X, Y)
system.time(np.lm(dat, X=1, Y=2) )
user system elapsed
118.610 0.130 119.144
109.000 0.040 109.416 # ran it twice
86.190 0.100 86.589 # 3rd time
Alternative function without your i loop (it isn't needed and can be vectorized):
np.lm.alt <-function(dat, X, Y, ...){
# Ch 9.2: Slope est. (X) for Thiel statistic ==> (Pedantic comment: it is Theil (swap
the i and e)
combos <- combn(nrow(dat), 2)
i.s <- combos[1,]
j.s <- combos[2,]
Y.num <- dat[j.s,Y] - dat[i.s,Y]
X.dom <- dat[j.s,X] - dat[i.s,X]
X <- median( Y.num / X.dom)
# Ch 9.4: Intercept est. for Thiel statistic ==> (Pedantic comment: it is Theil (swap
the i and e)
Intercept <- median(dat[,"Y"] - X*dat[,"X"])
out <- data.frame(Intercept, X)
return(out)
} # usage: np.lm(dat, X=1, Y=2)
Try the compiler package on you original function:
library(compiler)
np.lm.c <- cmpfun(np.lm)
Test speed and correct results:
X <- rnorm(500)
Y <- rnorm(500)
dat <- data.frame(X, Y)
system.time(npout.c <- np.lm.c(dat, X=1, Y=2) )
system.time(npout.1 <- np.lm(dat, X=1, Y=2) )
system.time(npout.a <- np.lm.alt(dat, X=1, Y=2) )
identical(npout.1,npout.c)
identical(npout.1,npout.a)
Results:
system.time(npout.c <- np.lm.c(dat, X=1, Y=2) )
user system elapsed 21.442 0.066 21.517
system.time(npout.1 <- np.lm(dat, X=1, Y=2) )
user system elapsed 21.068 0.073 21.161
system.time(npout.a <- np.lm.alt(dat, X=1, Y=2) )
user system elapsed 0.303 0.010 0.313
identical(npout.1,npout.c)
[1] TRUE
identical(npout.1,npout.a)
[1] TRUE You may try and test this with larger data lengths. Berend
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
I understand, thank you. I think I wanted the output to look similar to that from mblm for quick comparison; that is obviously a problem. William Dunlap wrote
My comments have nothing to do with speed of your code, but with the correctness.
np.lm <-function(dat, X, Y, ...){
# Ch 9.2: Slope est. (X) for Thiel statistic
...
num[[i]] <- dat[j.s[i],Y] - dat[i.s[i],Y]
dom[[i]] <- dat[j.s[i],X] - dat[i.s[i],X]
Up to here is looks like X and Y are used to indicate which columns of dat are the predictor and response, respectively.
.. X <- median( sort( do.call(c, num) / do.call(c, dom) ) )
Now X is used to mean the slope of the regression.
# Ch 9.4: Intercept est. for Thiel statistic Intercept <- median(dat[,"Y"] - X*dat[,"X"])
Now the predictor column must be named "X" and the response "Y", no matter what your input X and Y were. Hence the function will fail to give the correct answer in most cases. Bill Dunlap Spotfire, TIBCO Software wdunlap tibco.com
-----Original Message----- From:
r-help-bounces@
[mailto:
r-help-bounces@
] On Behalf
Of Berend Hasselman Sent: Sunday, October 21, 2012 11:59 AM To: Brad Schneid Cc:
r-help@
Subject: Re: [R] help speeding up simple Theil regression function On 21-10-2012, at 20:06, Brad Schneid wrote:
Hello, I am working on a simple non-parametric (Theil) regression function and
and
am following Hollander and Wolfe 1999 text. I would like some help
making
my function faster. I have compared with pre-packaged version from
"MBLM",
which isnt very fast either, but it appears mine is faster with N =
1000
(see results below). I plan on running this function repeatedly, and I
generally have data lengths of ~ N = 6000 or more.
# My function following Hollander and Wolfe text, Chapter 9
np.lm <-function(dat, X, Y, ...){
# Ch 9.2: Slope est. (X) for Thiel statistic
combos <- combn(nrow(dat), 2)
i.s <- combos[1,]
j.s <- combos[2,]
num <- vector("list", length=length(i.s))
dom <- vector("list", length=length(i.s))
for(i in 1:length(i.s)){
num[[i]] <- dat[j.s[i],Y] - dat[i.s[i],Y]
dom[[i]] <- dat[j.s[i],X] - dat[i.s[i],X]
}
X <- median( sort( do.call(c, num) / do.call(c, dom) ) )
# Ch 9.4: Intercept est. for Thiel statistic
Intercept <- median(dat[,"Y"] - X*dat[,"X"])
out <- data.frame(Intercept, X)
return(out)
} # usage: np.lm(dat, X=1, Y=2)
################################################################
library("mblm") # I will compare to mblm() function
X <- rnorm(1000)
Y <- rnorm(1000)
dat <- data.frame(X, Y)
system.time(np.lm(dat, X=1, Y=2) )
user system elapsed
118.610 0.130 119.144
109.000 0.040 109.416 # ran it twice
86.190 0.100 86.589 # 3rd time
Alternative function without your i loop (it isn't needed and can be
vectorized):
np.lm.alt <-function(dat, X, Y, ...){
# Ch 9.2: Slope est. (X) for Thiel statistic ==> (Pedantic comment: it
is Theil (swap
the i and e)
combos <- combn(nrow(dat), 2)
i.s <- combos[1,]
j.s <- combos[2,]
Y.num <- dat[j.s,Y] - dat[i.s,Y]
X.dom <- dat[j.s,X] - dat[i.s,X]
X <- median( Y.num / X.dom)
# Ch 9.4: Intercept est. for Thiel statistic ==> (Pedantic comment: it
is Theil (swap
the i and e)
Intercept <- median(dat[,"Y"] - X*dat[,"X"])
out <- data.frame(Intercept, X)
return(out)
} # usage: np.lm(dat, X=1, Y=2)
Try the compiler package on you original function:
library(compiler)
np.lm.c <- cmpfun(np.lm)
Test speed and correct results:
X <- rnorm(500)
Y <- rnorm(500)
dat <- data.frame(X, Y)
system.time(npout.c <- np.lm.c(dat, X=1, Y=2) )
system.time(npout.1 <- np.lm(dat, X=1, Y=2) )
system.time(npout.a <- np.lm.alt(dat, X=1, Y=2) )
identical(npout.1,npout.c)
identical(npout.1,npout.a)
Results:
system.time(npout.c <- np.lm.c(dat, X=1, Y=2) )
user system elapsed 21.442 0.066 21.517
system.time(npout.1 <- np.lm(dat, X=1, Y=2) )
user system elapsed 21.068 0.073 21.161
system.time(npout.a <- np.lm.alt(dat, X=1, Y=2) )
user system elapsed 0.303 0.010 0.313
identical(npout.1,npout.c)
[1] TRUE
identical(npout.1,npout.a)
[1] TRUE You may try and test this with larger data lengths. Berend
______________________________________________
R-help@
mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
______________________________________________
R-help@
mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
-- View this message in context: http://r.789695.n4.nabble.com/help-speeding-up-simple-Theil-regression-function-tp4646923p4646935.html Sent from the R help mailing list archive at Nabble.com.