You could try method = "pin".
Sent from my iPhone
On Nov 16, 2014, at 1:40 AM, Yunqi Zhang <yqzhang at ucsd.edu> wrote:
Hi William,
Thank you very much for your reply.
I did a subsampling to reduce the number of samples to ~1.8 million. It
seems to work fine except for 99th percentile (p-values for all the
features are 1.0). Does this mean I?m subsampling too much? How should I
interpret the result?
tau: [1] 0.25
Coefficients:
Value Std. Error t value Pr(>|t|)
(Intercept) 72.15700 0.03651 1976.10513 0.00000
f1 -0.51000 0.04906 -10.39508 0.00000
f2 -20.44200 0.03933 -519.78766 0.00000
f3 -2.37000 0.04871 -48.65117 0.00000
f1:f2 -2.52500 0.05315 -47.50361 0.00000
f1:f3 1.03600 0.06573 15.76193 0.00000
f2:f3 3.41300 0.05247 65.05075 0.00000
f1:f2:f3 -0.83800 0.07120 -11.77002 0.00000
Call: rq(formula = output ~ f1 + f2 + f3 + f1 * f2 + f1 *
f3 + f2 * f3 + f1 * f2 * f3, tau = c(0.25, 0.5,
0.75, 0.9, 0.95, 0.99), data = data_stats)
tau: [1] 0.5
Coefficients:
Value Std. Error t value Pr(>|t|)
(Intercept) 83.80900 0.05626 1489.61222 0.00000
f1 -0.92200 0.07528 -12.24692 0.00000
f2 -27.90700 0.05937 -470.07189 0.00000
f3 -6.45000 0.07204 -89.53909 0.00000
f1:f2 -2.66500 0.07933 -33.59275 0.00000
f1:f3 1.99000 0.09869 20.16440 0.00000
f2:f3 7.09600 0.07611 93.23813 0.00000
f1:f2:f3 -1.71200 0.10390 -16.47660 0.00000
Call: rq(formula = output ~ f1 + f2 + f3 + f1 * f2 + f1 *
f3 + f2 * f3 + f1 * f2 * f3, tau = c(0.25, 0.5,
0.75, 0.9, 0.95, 0.99), data = data_stats)
tau: [1] 0.75
Coefficients:
Value Std. Error t value Pr(>|t|)
(Intercept) 102.71700 0.10175 1009.45946 0.00000
f1 -1.59300 0.13241 -12.03125 0.00000
f2 -40.64200 0.10623 -382.58456 0.00000
f3 -14.40900 0.12096 -119.11988 0.00000
f1:f2 -2.97600 0.13867 -21.46071 0.00000
f1:f3 3.74600 0.16335 22.93165 0.00000
f2:f3 14.14800 0.12692 111.47217 0.00000
f1:f2:f3 -3.16400 0.17159 -18.43899 0.00000
Call: rq(formula = output ~ f1 + f2 + f3 + f1 * f2 + f1 *
f3 + f2 * f3 + f1 * f2 * f3, tau = c(0.25, 0.5,
0.75, 0.9, 0.95, 0.99), data = data_stats)
tau: [1] 0.9
Coefficients:
Value Std. Error t value Pr(>|t|)
(Intercept) 130.89400 0.20609 635.12464 0.00000
f1 -2.55500 0.28139 -9.07995 0.00000
f2 -60.90500 0.21322 -285.64558 0.00000
f3 -29.42300 0.23409 -125.69092 0.00000
f1:f2 -2.77700 0.29052 -9.55870 0.00000
f1:f3 7.89700 0.33308 23.70870 0.00000
f2:f3 27.78100 0.24338 114.14722 0.00000
f1:f2:f3 -6.95800 0.34491 -20.17327 0.00000
Call: rq(formula = output ~ f1 + f2 + f3 + f1 * f2 + f1 *
f3 + f2 * f3 + f1 * f2 * f3, tau = c(0.25, 0.5,
0.75, 0.9, 0.95, 0.99), data = data_stats)
tau: [1] 0.95
Coefficients:
Value Std. Error t value Pr(>|t|)
(Intercept) 157.45900 0.42733 368.47413 0.00000
f1 -4.10200 0.55834 -7.34678 0.00000
f2 -81.24000 0.44012 -184.58697 0.00000
f3 -46.17500 0.46235 -99.87033 0.00000
f1:f2 -2.01700 0.57651 -3.49866 0.00047
f1:f3 15.67000 0.67409 23.24600 0.00000
f2:f3 43.00100 0.47973 89.63500 0.00000
f1:f2:f3 -14.05100 0.69737 -20.14843 0.00000
Call: rq(formula = output ~ f1 + f2 + f3 + f1 * f2 + f1 *
f3 + f2 * f3 + f1 * f2 * f3, tau = c(0.25, 0.5,
0.75, 0.9, 0.95, 0.99), data = data_stats)
tau: [1] 0.99
Coefficients:
Value Std. Error t value Pr(>|t|)
(Intercept) 2.544860e+02 3.878303e+07 1.000000e-05 9.999900e-01
f1 -1.420000e+01 5.917548e+11 0.000000e+00 1.000000e+00
f2 -1.582920e+02 3.450261e+07 0.000000e+00 1.000000e+00
f3 -1.139210e+02 4.763057e+07 0.000000e+00 1.000000e+00
f1:f2 5.725000e+00 1.324283e+12 0.000000e+00 1.000000e+00
f1:f3 6.811780e+02 1.153645e+13 0.000000e+00 1.000000e+00
f2:f3 1.042510e+02 2.299953e+24 0.000000e+00 1.000000e+00
f1:f2:f3 -6.763210e+02 2.299953e+24 0.000000e+00 1.000000e+00
Warning message:
In summary.rq(xi, ...) : 288000 non-positive fis
On Sat, Nov 15, 2014 at 8:19 PM, William Dunlap <wdunlap at tibco.com> wrote:
You can time it yourself on increasingly large subsets of your data. E.g.,
dat <- data.frame(x1=rnorm(1e6), x2=rnorm(1e6),
x3=sample(c("A","B","C"),size=1e6,replace=TRUE))
dat$y <- with(dat, x1 + 2*(x3=="B")*x2 + rnorm(1e6))
t <- vapply(n<-4^(3:10),FUN=function(n){d<-dat[seq_len(n),];
print(system.time(rq(data=d, y ~ x1 + x2*x3,
tau=0.9)))},FUN.VALUE=numeric(5))
user system elapsed
0 0 0
user system elapsed
0 0 0
user system elapsed
0.02 0.00 0.01
user system elapsed
0.01 0.00 0.02
user system elapsed
0.10 0.00 0.11
user system elapsed
1.09 0.00 1.10
user system elapsed
13.05 0.02 13.07
user system elapsed
273.30 0.11 273.74
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
user.self 0 0 0.02 0.01 0.10 1.09 13.05 273.30
sys.self 0 0 0.00 0.00 0.00 0.00 0.02 0.11
elapsed 0 0 0.01 0.02 0.11 1.10 13.07 273.74
user.child NA NA NA NA NA NA NA NA
sys.child NA NA NA NA NA NA NA NA
Do some regressions on t["elapsed",] as a function of n and predict up to
n=10^7. E.g.,
summary(lm(t["elapsed",] ~ poly(n,4)))
Call:
lm(formula = t["elapsed", ] ~ poly(n, 4))
Residuals:
1 2 3 4 5 6
7 8
-2.375e-03 -2.970e-03 4.484e-03 1.674e-03 -8.723e-04 6.096e-05
-9.199e-07 2.715e-09
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.601e+01 1.261e-03 28564.33 9.46e-14 ***
poly(n, 4)1 2.493e+02 3.565e-03 69917.04 6.45e-15 ***
poly(n, 4)2 5.093e+01 3.565e-03 14284.61 7.57e-13 ***
poly(n, 4)3 1.158e+00 3.565e-03 324.83 6.43e-08 ***
poly(n, 4)4 4.392e-02 3.565e-03 12.32 0.00115 **
---
Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1
Residual standard error: 0.003565 on 3 degrees of freedom
Multiple R-squared: 1, Adjusted R-squared: 1
F-statistic: 1.273e+09 on 4 and 3 DF, p-value: 3.575e-14
It does not look good for n=10^7.
Bill Dunlap
TIBCO Software
wdunlap tibco.com
On Sat, Nov 15, 2014 at 12:12 PM, Yunqi Zhang <yqzhang at ucsd.edu> wrote:
Hi all,
I'm using quantreg rq() to perform quantile regression on a large data
set.
Each record has 4 fields and there are about 18 million records in total.
I
wonder if anyone has tried rq() on a large dataset and how long I should
expect it to finish. Or it is simply too large and I should subsample the
data. I would like to have an idea before I start to run and wait forever.
In addition, I will appreciate if anyone could give me an idea how long it
takes for rq() to run approximately for certain dataset size.
Yunqi
[[alternative HTML version deleted]]