quantreg speed

You could try method = "pin".  

Sent from my iPhone
On Nov 16, 2014, at 1:40 AM, Yunqi Zhang <yqzhang at ucsd.edu> wrote:

Hi William,

Thank you very much for your reply.

I did a subsampling to reduce the number of samples to ~1.8 million. It
seems to work fine except for 99th percentile (p-values for all the
features are 1.0). Does this mean I?m subsampling too much? How should I
interpret the result?

tau: [1] 0.25

Coefficients:

              Value      Std. Error t value    Pr(>|t|)

(Intercept)      72.15700    0.03651 1976.10513    0.00000

f1            -0.51000    0.04906  -10.39508    0.00000

f2            -20.44200    0.03933 -519.78766    0.00000

f3              -2.37000    0.04871  -48.65117    0.00000

f1:f2       -2.52500    0.05315  -47.50361    0.00000

f1:f3         1.03600    0.06573   15.76193    0.00000

f2:f3          3.41300    0.05247   65.05075    0.00000

f1:f2:f3   -0.83800    0.07120  -11.77002    0.00000

Call: rq(formula = output ~ f1 + f2 + f3 + f1 * f2 + f1 *

   f3 + f2 * f3 + f1 * f2 * f3, tau = c(0.25, 0.5,

   0.75, 0.9, 0.95, 0.99), data = data_stats)

tau: [1] 0.5

Coefficients:

              Value      Std. Error t value    Pr(>|t|)

(Intercept)      83.80900    0.05626 1489.61222    0.00000

f1            -0.92200    0.07528  -12.24692    0.00000

f2            -27.90700    0.05937 -470.07189    0.00000

f3              -6.45000    0.07204  -89.53909    0.00000

f1:f2       -2.66500    0.07933  -33.59275    0.00000

f1:f3         1.99000    0.09869   20.16440    0.00000

f2:f3          7.09600    0.07611   93.23813    0.00000

f1:f2:f3   -1.71200    0.10390  -16.47660    0.00000

Call: rq(formula = output ~ f1 + f2 + f3 + f1 * f2 + f1 *

   f3 + f2 * f3 + f1 * f2 * f3, tau = c(0.25, 0.5,

   0.75, 0.9, 0.95, 0.99), data = data_stats)

tau: [1] 0.75

Coefficients:

              Value      Std. Error t value    Pr(>|t|)

(Intercept)     102.71700    0.10175 1009.45946    0.00000

f1            -1.59300    0.13241  -12.03125    0.00000

f2            -40.64200    0.10623 -382.58456    0.00000

f3             -14.40900    0.12096 -119.11988    0.00000

f1:f2       -2.97600    0.13867  -21.46071    0.00000

f1:f3         3.74600    0.16335   22.93165    0.00000

f2:f3         14.14800    0.12692  111.47217    0.00000

f1:f2:f3   -3.16400    0.17159  -18.43899    0.00000

Call: rq(formula = output ~ f1 + f2 + f3 + f1 * f2 + f1 *

   f3 + f2 * f3 + f1 * f2 * f3, tau = c(0.25, 0.5,

   0.75, 0.9, 0.95, 0.99), data = data_stats)

tau: [1] 0.9

Coefficients:

              Value      Std. Error t value    Pr(>|t|)

(Intercept)     130.89400    0.20609  635.12464    0.00000

f1            -2.55500    0.28139   -9.07995    0.00000

f2            -60.90500    0.21322 -285.64558    0.00000

f3             -29.42300    0.23409 -125.69092    0.00000

f1:f2       -2.77700    0.29052   -9.55870    0.00000

f1:f3         7.89700    0.33308   23.70870    0.00000

f2:f3         27.78100    0.24338  114.14722    0.00000

f1:f2:f3   -6.95800    0.34491  -20.17327    0.00000

Call: rq(formula = output ~ f1 + f2 + f3 + f1 * f2 + f1 *

   f3 + f2 * f3 + f1 * f2 * f3, tau = c(0.25, 0.5,

   0.75, 0.9, 0.95, 0.99), data = data_stats)

tau: [1] 0.95

Coefficients:

              Value      Std. Error t value    Pr(>|t|)

(Intercept)     157.45900    0.42733  368.47413    0.00000

f1            -4.10200    0.55834   -7.34678    0.00000

f2            -81.24000    0.44012 -184.58697    0.00000

f3             -46.17500    0.46235  -99.87033    0.00000

f1:f2       -2.01700    0.57651   -3.49866    0.00047

f1:f3        15.67000    0.67409   23.24600    0.00000

f2:f3         43.00100    0.47973   89.63500    0.00000

f1:f2:f3  -14.05100    0.69737  -20.14843    0.00000

Call: rq(formula = output ~ f1 + f2 + f3 + f1 * f2 + f1 *

   f3 + f2 * f3 + f1 * f2 * f3, tau = c(0.25, 0.5,

   0.75, 0.9, 0.95, 0.99), data = data_stats)

tau: [1] 0.99

Coefficients:

              Value         Std. Error    t value       Pr(>|t|)

(Intercept)     2.544860e+02  3.878303e+07  1.000000e-05  9.999900e-01

f1          -1.420000e+01  5.917548e+11  0.000000e+00  1.000000e+00

f2           -1.582920e+02  3.450261e+07  0.000000e+00  1.000000e+00

f3            -1.139210e+02  4.763057e+07  0.000000e+00  1.000000e+00

f1:f2      5.725000e+00  1.324283e+12  0.000000e+00  1.000000e+00

f1:f3       6.811780e+02  1.153645e+13  0.000000e+00  1.000000e+00

f2:f3        1.042510e+02  2.299953e+24  0.000000e+00  1.000000e+00

f1:f2:f3 -6.763210e+02  2.299953e+24  0.000000e+00  1.000000e+00

Warning message:

In summary.rq(xi, ...) : 288000 non-positive fis

On Sat, Nov 15, 2014 at 8:19 PM, William Dunlap <wdunlap at tibco.com> wrote:

You can time it yourself on increasingly large subsets of your data.  E.g.,

dat <- data.frame(x1=rnorm(1e6), x2=rnorm(1e6),
x3=sample(c("A","B","C"),size=1e6,replace=TRUE))
dat$y <- with(dat, x1 + 2*(x3=="B")*x2 + rnorm(1e6))
t <- vapply(n<-4^(3:10),FUN=function(n){d<-dat[seq_len(n),];
print(system.time(rq(data=d, y ~ x1 + x2*x3,
tau=0.9)))},FUN.VALUE=numeric(5))
  user  system elapsed
     0       0       0
  user  system elapsed
     0       0       0
  user  system elapsed
  0.02    0.00    0.01
  user  system elapsed
  0.01    0.00    0.02
  user  system elapsed
  0.10    0.00    0.11
  user  system elapsed
  1.09    0.00    1.10
  user  system elapsed
 13.05    0.02   13.07
  user  system elapsed
273.30    0.11  273.74
t
          [,1] [,2] [,3] [,4] [,5] [,6]  [,7]   [,8]
user.self     0    0 0.02 0.01 0.10 1.09 13.05 273.30
sys.self      0    0 0.00 0.00 0.00 0.00  0.02   0.11
elapsed       0    0 0.01 0.02 0.11 1.10 13.07 273.74
user.child   NA   NA   NA   NA   NA   NA    NA     NA
sys.child    NA   NA   NA   NA   NA   NA    NA     NA

Do some regressions on t["elapsed",] as a function of n and predict up to
n=10^7.  E.g.,
summary(lm(t["elapsed",] ~ poly(n,4)))
Call:
lm(formula = t["elapsed", ] ~ poly(n, 4))

Residuals:
        1          2          3          4          5          6
7          8
-2.375e-03 -2.970e-03  4.484e-03  1.674e-03 -8.723e-04  6.096e-05
-9.199e-07  2.715e-09

Coefficients:
            Estimate Std. Error  t value Pr(>|t|)
(Intercept) 3.601e+01  1.261e-03 28564.33 9.46e-14 ***
poly(n, 4)1 2.493e+02  3.565e-03 69917.04 6.45e-15 ***
poly(n, 4)2 5.093e+01  3.565e-03 14284.61 7.57e-13 ***
poly(n, 4)3 1.158e+00  3.565e-03   324.83 6.43e-08 ***
poly(n, 4)4 4.392e-02  3.565e-03    12.32  0.00115 **
---
Signif. codes:  0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1

Residual standard error: 0.003565 on 3 degrees of freedom
Multiple R-squared:      1,     Adjusted R-squared:      1
F-statistic: 1.273e+09 on 4 and 3 DF,  p-value: 3.575e-14

It does not look good for n=10^7.

Bill Dunlap
TIBCO Software
wdunlap tibco.com

On Sat, Nov 15, 2014 at 12:12 PM, Yunqi Zhang <yqzhang at ucsd.edu> wrote:

Hi all,

I'm using quantreg rq() to perform quantile regression on a large data
set.
Each record has 4 fields and there are about 18 million records in total.
I
wonder if anyone has tried rq() on a large dataset and how long I should
expect it to finish. Or it is simply too large and I should subsample the
data. I would like to have an idea before I start to run and wait forever.

In addition, I will appreciate if anyone could give me an idea how long it
takes for rq() to run approximately for certain dataset size.

Yunqi

       [[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
   [[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Hi Roger,

Thank you for your reply. To my understanding, changing the regression method only helps to speed up the computation, but not necessarily solve the problem with 99th percentile that p-values for all the factors are 1.0. I wonder how I should interpret the result for 99th percentile, while the results for other percentiles seem to work fine.

Correct me if I?m wrong.

Thank you!

Yunqi

You could try method = "pin".  

Sent from my iPhone

On Nov 16, 2014, at 1:40 AM, Yunqi Zhang <yqzhang at ucsd.edu> wrote:

Hi William,

Thank you very much for your reply.

I did a subsampling to reduce the number of samples to ~1.8 million. It
seems to work fine except for 99th percentile (p-values for all the
features are 1.0). Does this mean I?m subsampling too much? How should I
interpret the result?

tau: [1] 0.25

Coefficients:

             Value      Std. Error t value    Pr(>|t|)

(Intercept)      72.15700    0.03651 1976.10513    0.00000

f1            -0.51000    0.04906  -10.39508    0.00000

f2            -20.44200    0.03933 -519.78766    0.00000

f3              -2.37000    0.04871  -48.65117    0.00000

f1:f2       -2.52500    0.05315  -47.50361    0.00000

f1:f3         1.03600    0.06573   15.76193    0.00000

f2:f3          3.41300    0.05247   65.05075    0.00000

f1:f2:f3   -0.83800    0.07120  -11.77002    0.00000

Call: rq(formula = output ~ f1 + f2 + f3 + f1 * f2 + f1 *

  f3 + f2 * f3 + f1 * f2 * f3, tau = c(0.25, 0.5,

  0.75, 0.9, 0.95, 0.99), data = data_stats)

tau: [1] 0.5

Coefficients:

             Value      Std. Error t value    Pr(>|t|)

(Intercept)      83.80900    0.05626 1489.61222    0.00000

f1            -0.92200    0.07528  -12.24692    0.00000

f2            -27.90700    0.05937 -470.07189    0.00000

f3              -6.45000    0.07204  -89.53909    0.00000

f1:f2       -2.66500    0.07933  -33.59275    0.00000

f1:f3         1.99000    0.09869   20.16440    0.00000

f2:f3          7.09600    0.07611   93.23813    0.00000

f1:f2:f3   -1.71200    0.10390  -16.47660    0.00000

Call: rq(formula = output ~ f1 + f2 + f3 + f1 * f2 + f1 *

  f3 + f2 * f3 + f1 * f2 * f3, tau = c(0.25, 0.5,

  0.75, 0.9, 0.95, 0.99), data = data_stats)

tau: [1] 0.75

Coefficients:

             Value      Std. Error t value    Pr(>|t|)

(Intercept)     102.71700    0.10175 1009.45946    0.00000

f1            -1.59300    0.13241  -12.03125    0.00000

f2            -40.64200    0.10623 -382.58456    0.00000

f3             -14.40900    0.12096 -119.11988    0.00000

f1:f2       -2.97600    0.13867  -21.46071    0.00000

f1:f3         3.74600    0.16335   22.93165    0.00000

f2:f3         14.14800    0.12692  111.47217    0.00000

f1:f2:f3   -3.16400    0.17159  -18.43899    0.00000

Call: rq(formula = output ~ f1 + f2 + f3 + f1 * f2 + f1 *

  f3 + f2 * f3 + f1 * f2 * f3, tau = c(0.25, 0.5,

  0.75, 0.9, 0.95, 0.99), data = data_stats)

tau: [1] 0.9

Coefficients:

             Value      Std. Error t value    Pr(>|t|)

(Intercept)     130.89400    0.20609  635.12464    0.00000

f1            -2.55500    0.28139   -9.07995    0.00000

f2            -60.90500    0.21322 -285.64558    0.00000

f3             -29.42300    0.23409 -125.69092    0.00000

f1:f2       -2.77700    0.29052   -9.55870    0.00000

f1:f3         7.89700    0.33308   23.70870    0.00000

f2:f3         27.78100    0.24338  114.14722    0.00000

f1:f2:f3   -6.95800    0.34491  -20.17327    0.00000

Call: rq(formula = output ~ f1 + f2 + f3 + f1 * f2 + f1 *

  f3 + f2 * f3 + f1 * f2 * f3, tau = c(0.25, 0.5,

  0.75, 0.9, 0.95, 0.99), data = data_stats)

tau: [1] 0.95

Coefficients:

             Value      Std. Error t value    Pr(>|t|)

(Intercept)     157.45900    0.42733  368.47413    0.00000

f1            -4.10200    0.55834   -7.34678    0.00000

f2            -81.24000    0.44012 -184.58697    0.00000

f3             -46.17500    0.46235  -99.87033    0.00000

f1:f2       -2.01700    0.57651   -3.49866    0.00047

f1:f3        15.67000    0.67409   23.24600    0.00000

f2:f3         43.00100    0.47973   89.63500    0.00000

f1:f2:f3  -14.05100    0.69737  -20.14843    0.00000

Call: rq(formula = output ~ f1 + f2 + f3 + f1 * f2 + f1 *

  f3 + f2 * f3 + f1 * f2 * f3, tau = c(0.25, 0.5,

  0.75, 0.9, 0.95, 0.99), data = data_stats)

tau: [1] 0.99

Coefficients:

             Value         Std. Error    t value       Pr(>|t|)

(Intercept)     2.544860e+02  3.878303e+07  1.000000e-05  9.999900e-01

f1          -1.420000e+01  5.917548e+11  0.000000e+00  1.000000e+00

f2           -1.582920e+02  3.450261e+07  0.000000e+00  1.000000e+00

f3            -1.139210e+02  4.763057e+07  0.000000e+00  1.000000e+00

f1:f2      5.725000e+00  1.324283e+12  0.000000e+00  1.000000e+00

f1:f3       6.811780e+02  1.153645e+13  0.000000e+00  1.000000e+00

f2:f3        1.042510e+02  2.299953e+24  0.000000e+00  1.000000e+00

f1:f2:f3 -6.763210e+02  2.299953e+24  0.000000e+00  1.000000e+00

Warning message:

In summary.rq(xi, ...) : 288000 non-positive fis

On Sat, Nov 15, 2014 at 8:19 PM, William Dunlap <wdunlap at tibco.com> wrote:

You can time it yourself on increasingly large subsets of your data.  E.g.,

dat <- data.frame(x1=rnorm(1e6), x2=rnorm(1e6),
x3=sample(c("A","B","C"),size=1e6,replace=TRUE))
dat$y <- with(dat, x1 + 2*(x3=="B")*x2 + rnorm(1e6))
t <- vapply(n<-4^(3:10),FUN=function(n){d<-dat[seq_len(n),];
print(system.time(rq(data=d, y ~ x1 + x2*x3,
tau=0.9)))},FUN.VALUE=numeric(5))
 user  system elapsed
    0       0       0
 user  system elapsed
    0       0       0
 user  system elapsed
 0.02    0.00    0.01
 user  system elapsed
 0.01    0.00    0.02
 user  system elapsed
 0.10    0.00    0.11
 user  system elapsed
 1.09    0.00    1.10
 user  system elapsed
13.05    0.02   13.07
 user  system elapsed
273.30    0.11  273.74
t
         [,1] [,2] [,3] [,4] [,5] [,6]  [,7]   [,8]
user.self     0    0 0.02 0.01 0.10 1.09 13.05 273.30
sys.self      0    0 0.00 0.00 0.00 0.00  0.02   0.11
elapsed       0    0 0.01 0.02 0.11 1.10 13.07 273.74
user.child   NA   NA   NA   NA   NA   NA    NA     NA
sys.child    NA   NA   NA   NA   NA   NA    NA     NA

Do some regressions on t["elapsed",] as a function of n and predict up to
n=10^7.  E.g.,
summary(lm(t["elapsed",] ~ poly(n,4)))
Call:
lm(formula = t["elapsed", ] ~ poly(n, 4))

Residuals:
       1          2          3          4          5          6
7          8
-2.375e-03 -2.970e-03  4.484e-03  1.674e-03 -8.723e-04  6.096e-05
-9.199e-07  2.715e-09

Coefficients:
           Estimate Std. Error  t value Pr(>|t|)
(Intercept) 3.601e+01  1.261e-03 28564.33 9.46e-14 ***
poly(n, 4)1 2.493e+02  3.565e-03 69917.04 6.45e-15 ***
poly(n, 4)2 5.093e+01  3.565e-03 14284.61 7.57e-13 ***
poly(n, 4)3 1.158e+00  3.565e-03   324.83 6.43e-08 ***
poly(n, 4)4 4.392e-02  3.565e-03    12.32  0.00115 **
---
Signif. codes:  0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1

Residual standard error: 0.003565 on 3 degrees of freedom
Multiple R-squared:      1,     Adjusted R-squared:      1
F-statistic: 1.273e+09 on 4 and 3 DF,  p-value: 3.575e-14

It does not look good for n=10^7.

Bill Dunlap
TIBCO Software
wdunlap tibco.com

On Sat, Nov 15, 2014 at 12:12 PM, Yunqi Zhang <yqzhang at ucsd.edu> wrote:

Hi all,

I'm using quantreg rq() to perform quantile regression on a large data
set.
Each record has 4 fields and there are about 18 million records in total.
I
wonder if anyone has tried rq() on a large dataset and how long I should
expect it to finish. Or it is simply too large and I should subsample the
data. I would like to have an idea before I start to run and wait forever.

In addition, I will appreciate if anyone could give me an idea how long it
takes for rq() to run approximately for certain dataset size.

Yunqi

      [[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
  [[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.