I did this:
nb <- naiveBayes(users, platform)
pl <- predict(nb,users)
nrow(users) ==> 314781
ncol(users) ==> 109
1. naiveBayes() was quite fast (~20 seconds), while predict() was slow
(tens of minutes). why?
2. the predict results were completely off the mark (quite the opposite
of the expected overfitting). suffice it to show the tables:
pl:
android blackberry ipad iphone lg linux mac
3 5 11 14 312723 5 11
mobile nokia samsung symbian unknown windows
1864 17 16 112 0 0
platform:
android blackberry ipad iphone lg linux mac
18013 1221 2647 1328 4 2936 34336
mobile nokia samsung symbian unknown windows
18 88 39 103 2660 251388
i.e., nb classified nearly everything as "lg" while in the actual data
"lg" is virtually nonexistent.
3. when I print "nb", I see "A-priori probabilities" (which are what I
expected) and "Conditional probabilities" which are confusing because
there are only two of them, e.g.:
android 0.048464998 0.43946764
blackberry 0.001638002 0.04045564
ipad 0.322251606 1.84940588
iphone 0.030873494 0.23250250
lg 0.000000000 0.00000000
linux 0.023501362 0.34698919
mac 0.082653774 1.22535027
mobile 0.000000000 0.00000000
nokia 0.000000000 0.00000000
samsung 0.000000000 0.00000000
symbian 0.000000000 0.00000000
unknown 0.003759398 0.08219078
windows 0.021158528 0.32916970
the predictors are integers.
is the first column for the 0 predictors and the second for all non-0?
Is there a way to ask naiveBayes to differenciate between non-0 values?
thanks!
When I tried to run svm on the same data frame, memory usage as reported
by top(1) doubled to 4GB almost right away and the function never
returned (has been running for ~15 hours now). ^C does not stop it.
This is most unusual, libsvm has always seemed very fast.
This is R version 2.13.1 (2011-07-08) (as distributed with ubuntu).
* Sam Steingold <fqf at tah.bet> [2012-02-09 21:43:30 -0500]:
I did this:
nb <- naiveBayes(users, platform)
pl <- predict(nb,users)
nrow(users) ==> 314781
ncol(users) ==> 109
1. naiveBayes() was quite fast (~20 seconds), while predict() was slow
(tens of minutes). why?
2. the predict results were completely off the mark (quite the opposite
of the expected overfitting). suffice it to show the tables:
pl:
android blackberry ipad iphone lg linux mac
3 5 11 14 312723 5 11
mobile nokia samsung symbian unknown windows
1864 17 16 112 0 0
platform:
android blackberry ipad iphone lg linux mac
18013 1221 2647 1328 4 2936 34336
mobile nokia samsung symbian unknown windows
18 88 39 103 2660 251388
i.e., nb classified nearly everything as "lg" while in the actual data
"lg" is virtually nonexistent.
3. when I print "nb", I see "A-priori probabilities" (which are what I
expected) and "Conditional probabilities" which are confusing because
there are only two of them, e.g.:
android 0.048464998 0.43946764
blackberry 0.001638002 0.04045564
ipad 0.322251606 1.84940588
iphone 0.030873494 0.23250250
lg 0.000000000 0.00000000
linux 0.023501362 0.34698919
mac 0.082653774 1.22535027
mobile 0.000000000 0.00000000
nokia 0.000000000 0.00000000
samsung 0.000000000 0.00000000
symbian 0.000000000 0.00000000
unknown 0.003759398 0.08219078
windows 0.021158528 0.32916970
the predictors are integers.
is the first column for the 0 predictors and the second for all non-0?
Is there a way to ask naiveBayes to differenciate between non-0 values?
thanks!
* Sam Steingold <fqf at tah.bet> [2012-02-10 10:01:54 -0500]:
When I tried to run svm on the same data frame, memory usage as reported
by top(1) doubled to 4GB almost right away and the function never
returned (has been running for ~15 hours now). ^C does not stop it.
This is most unusual, libsvm has always seemed very fast.
looks like it _is_ libsvm:
#0 0x00007ffff2aedc64 in Solver::select_working_set (this=0x7fffffff97f0, out_i=@0x7fffffff95a0, out_j=@0x7fffffff95b0) at svm.cpp:852
#1 0x00007ffff2aef91d in Solver::Solve (this=0x7fffffff97f0, l=285724, Q=..., p_=<optimized out>, y_=<optimized out>, alpha_=0x6023fb60, Cp=1,
Cn=1, eps=<optimized out>, si=0x7fffffff9980, shrinking=1) at svm.cpp:573
#2 0x00007ffff2af1747 in solve_c_svc (Cn=1, Cp=1, si=0x7fffffff9980, alpha=0x6023fb60, param=<optimized out>, prob=0x7fffffff9c30) at svm.cpp:1444
#3 svm_train_one (prob=0x7fffffff9c30, param=<optimized out>, Cp=1, Cn=1) at svm.cpp:1641
#4 0x00007ffff2af4a8e in svm_train (prob=<optimized out>, param=0x7fffffff9d40) at svm.cpp:2179
#5 0x00007ffff2aea281 in svmtrain (x=0x7fff7e698038, r=0x11c9b1e0, c=<optimized out>, y=<optimized out>, rowindex=<optimized out>,
colindex=<optimized out>, svm_type=0x11c9b2a0, kernel_type=0x11c9b2d0, degree=0x11c9b300, gamma=0x356e3a28, coef0=0x356e3a60, cost=0x356e3ad0,
nu=0x103589a8, weightlabels=0x0, weights=0x0, nweights=0x11c9b330, cache=0x103589e0, tolerance=0x10358a18, epsilon=0x10358a50,
shrinking=0x11c9b360, cross=0x11c9b390, sparse=0x11c9b3c0, probability=0x1524dbb0, seed=0x1524dbe0, nclasses=0x1524dc10, nr=0x1524dc40,
index=0x148a0fa8, labels=0xa3303b8, nSV=0xa330420, rho=0x170083e8, coefs=0x391dbb48, sigma=0x10358a88, probA=0xdf94678, probB=0xcbb7eb8,
cresults=0x0, ctotal1=0x10358ac0, ctotal2=0x10358af8, error=0x10358b30) at Rsvm.c:275
#6 0x00007ffff792cefc in ?? () from /usr/lib/R/lib/libR.so
#7 0x00007ffff795da1d in Rf_eval () from /usr/lib/R/lib/libR.so
#8 0x00007ffff795f540 in ?? () from /usr/lib/R/lib/libR.so
#9 0x00007ffff795d7ff in Rf_eval () from /usr/lib/R/lib/libR.so
#10 0x00007ffff795f6c9 in ?? () from /usr/lib/R/lib/libR.so
#11 0x00007ffff795d7ff in Rf_eval () from /usr/lib/R/lib/libR.so
#12 0x00007ffff7960a7f in Rf_applyClosure () from /usr/lib/R/lib/libR.so
#13 0x00007ffff79ad784 in Rf_usemethod () from /usr/lib/R/lib/libR.so
#14 0x00007ffff79ada47 in ?? () from /usr/lib/R/lib/libR.so
#15 0x00007ffff795d7ff in Rf_eval () from /usr/lib/R/lib/libR.so
#16 0x00007ffff7960a7f in Rf_applyClosure () from /usr/lib/R/lib/libR.so
#17 0x00007ffff795d6e0 in Rf_eval () from /usr/lib/R/lib/libR.so
#18 0x00007ffff795f540 in ?? () from /usr/lib/R/lib/libR.so
#19 0x00007ffff795d7ff in Rf_eval () from /usr/lib/R/lib/libR.so
#20 0x00007ffff795db9b in ?? () from /usr/lib/R/lib/libR.so
#21 0x00007ffff795dad9 in Rf_eval () from /usr/lib/R/lib/libR.so
#22 0x00007ffff795f6c9 in ?? () from /usr/lib/R/lib/libR.so
#23 0x00007ffff795d7ff in Rf_eval () from /usr/lib/R/lib/libR.so
#24 0x00007ffff7960a7f in Rf_applyClosure () from /usr/lib/R/lib/libR.so
#25 0x00007ffff795d6e0 in Rf_eval () from /usr/lib/R/lib/libR.so
#26 0x00007ffff7998055 in Rf_ReplIteration () from /usr/lib/R/lib/libR.so
#27 0x00007ffff79982e0 in ?? () from /usr/lib/R/lib/libR.so
#28 0x00007ffff7998370 in run_Rmainloop () from /usr/lib/R/lib/libR.so
#29 0x000000000040078b in main ()
#30 0x00007ffff72d930d in __libc_start_main () from /lib/x86_64-linux-gnu/libc.so.6
#31 0x00000000004007bd in _start ()
#0 0x00007ffff2aeeb67 in Kernel::dot (px=0x48eeb220, py=0x4b21890) at svm.cpp:295
#1 0x00007ffff2af7a25 in Kernel::kernel_rbf (this=<optimized out>, i=<optimized out>, j=<optimized out>) at svm.cpp:239
#2 0x00007ffff2af782c in SVC_Q::get_Q (this=0x7fffffff9870, i=187701, len=208039) at svm.cpp:1271
#3 0x00007ffff2aef9ab in Solver::Solve (this=0x7fffffff97f0, l=285724, Q=..., p_=<optimized out>, y_=<optimized out>, alpha_=0x6023fb60, Cp=1,
Cn=1, eps=<optimized out>, si=0x7fffffff9980, shrinking=1) at svm.cpp:591
#4 0x00007ffff2af1747 in solve_c_svc (Cn=1, Cp=1, si=0x7fffffff9980, alpha=0x6023fb60, param=<optimized out>, prob=0x7fffffff9c30) at svm.cpp:1444
#5 svm_train_one (prob=0x7fffffff9c30, param=<optimized out>, Cp=1, Cn=1) at svm.cpp:1641
#6 0x00007ffff2af4a8e in svm_train (prob=<optimized out>, param=0x7fffffff9d40) at svm.cpp:2179
#7 0x00007ffff2aea281 in svmtrain (x=0x7fff7e698038, r=0x11c9b1e0, c=<optimized out>, y=<optimized out>, rowindex=<optimized out>,
colindex=<optimized out>, svm_type=0x11c9b2a0, kernel_type=0x11c9b2d0, degree=0x11c9b300, gamma=0x356e3a28, coef0=0x356e3a60, cost=0x356e3ad0,
nu=0x103589a8, weightlabels=0x0, weights=0x0, nweights=0x11c9b330, cache=0x103589e0, tolerance=0x10358a18, epsilon=0x10358a50,
shrinking=0x11c9b360, cross=0x11c9b390, sparse=0x11c9b3c0, probability=0x1524dbb0, seed=0x1524dbe0, nclasses=0x1524dc10, nr=0x1524dc40,
index=0x148a0fa8, labels=0xa3303b8, nSV=0xa330420, rho=0x170083e8, coefs=0x391dbb48, sigma=0x10358a88, probA=0xdf94678, probB=0xcbb7eb8,
cresults=0x0, ctotal1=0x10358ac0, ctotal2=0x10358af8, error=0x10358b30) at Rsvm.c:275
#8 0x00007ffff792cefc in ?? () from /usr/lib/R/lib/libR.so
#9 0x00007ffff795da1d in Rf_eval () from /usr/lib/R/lib/libR.so
#10 0x00007ffff795f540 in ?? () from /usr/lib/R/lib/libR.so
#11 0x00007ffff795d7ff in Rf_eval () from /usr/lib/R/lib/libR.so
#12 0x00007ffff795f6c9 in ?? () from /usr/lib/R/lib/libR.so
#13 0x00007ffff795d7ff in Rf_eval () from /usr/lib/R/lib/libR.so
#14 0x00007ffff7960a7f in Rf_applyClosure () from /usr/lib/R/lib/libR.so
#15 0x00007ffff79ad784 in Rf_usemethod () from /usr/lib/R/lib/libR.so
#16 0x00007ffff79ada47 in ?? () from /usr/lib/R/lib/libR.so
#17 0x00007ffff795d7ff in Rf_eval () from /usr/lib/R/lib/libR.so
#18 0x00007ffff7960a7f in Rf_applyClosure () from /usr/lib/R/lib/libR.so
#19 0x00007ffff795d6e0 in Rf_eval () from /usr/lib/R/lib/libR.so
#20 0x00007ffff795f540 in ?? () from /usr/lib/R/lib/libR.so
#21 0x00007ffff795d7ff in Rf_eval () from /usr/lib/R/lib/libR.so
#22 0x00007ffff795db9b in ?? () from /usr/lib/R/lib/libR.so
#23 0x00007ffff795dad9 in Rf_eval () from /usr/lib/R/lib/libR.so
#24 0x00007ffff795f6c9 in ?? () from /usr/lib/R/lib/libR.so
#25 0x00007ffff795d7ff in Rf_eval () from /usr/lib/R/lib/libR.so
#26 0x00007ffff7960a7f in Rf_applyClosure () from /usr/lib/R/lib/libR.so
#27 0x00007ffff795d6e0 in Rf_eval () from /usr/lib/R/lib/libR.so
#28 0x00007ffff7998055 in Rf_ReplIteration () from /usr/lib/R/lib/libR.so
#29 0x00007ffff79982e0 in ?? () from /usr/lib/R/lib/libR.so
#30 0x00007ffff7998370 in run_Rmainloop () from /usr/lib/R/lib/libR.so
#31 0x000000000040078b in main ()
#32 0x00007ffff72d930d in __libc_start_main () from /lib/x86_64-linux-gnu/libc.so.6
#33 0x00000000004007bd in _start ()
This is R version 2.13.1 (2011-07-08) (as distributed with ubuntu).
* Sam Steingold <fqf at tah.bet> [2012-02-09 21:43:30 -0500]:
I did this:
nb <- naiveBayes(users, platform)
pl <- predict(nb,users)
nrow(users) ==> 314781
ncol(users) ==> 109
1. naiveBayes() was quite fast (~20 seconds), while predict() was slow
(tens of minutes). why?
2. the predict results were completely off the mark (quite the opposite
of the expected overfitting). suffice it to show the tables:
pl:
android blackberry ipad iphone lg linux mac
3 5 11 14 312723 5 11
mobile nokia samsung symbian unknown windows
1864 17 16 112 0 0
platform:
android blackberry ipad iphone lg linux mac
18013 1221 2647 1328 4 2936 34336
mobile nokia samsung symbian unknown windows
18 88 39 103 2660 251388
i.e., nb classified nearly everything as "lg" while in the actual data
"lg" is virtually nonexistent.
3. when I print "nb", I see "A-priori probabilities" (which are what I
expected) and "Conditional probabilities" which are confusing because
there are only two of them, e.g.:
android 0.048464998 0.43946764
blackberry 0.001638002 0.04045564
ipad 0.322251606 1.84940588
iphone 0.030873494 0.23250250
lg 0.000000000 0.00000000
linux 0.023501362 0.34698919
mac 0.082653774 1.22535027
mobile 0.000000000 0.00000000
nokia 0.000000000 0.00000000
samsung 0.000000000 0.00000000
symbian 0.000000000 0.00000000
unknown 0.003759398 0.08219078
windows 0.021158528 0.32916970
the predictors are integers.
is the first column for the 0 predictors and the second for all non-0?
Is there a way to ask naiveBayes to differenciate between non-0 values?
thanks!
We don't have the data, but my guess is that you want to have some
factors in your data that were integers when you tried the code below.
Uwe Ligges
On 10.02.2012 03:43, Sam Steingold wrote:
I did this:
nb<- naiveBayes(users, platform)
pl<- predict(nb,users)
nrow(users) ==> 314781
ncol(users) ==> 109
1. naiveBayes() was quite fast (~20 seconds), while predict() was slow
(tens of minutes). why?
2. the predict results were completely off the mark (quite the opposite
of the expected overfitting). suffice it to show the tables:
pl:
android blackberry ipad iphone lg linux mac
3 5 11 14 312723 5 11
mobile nokia samsung symbian unknown windows
1864 17 16 112 0 0
platform:
android blackberry ipad iphone lg linux mac
18013 1221 2647 1328 4 2936 34336
mobile nokia samsung symbian unknown windows
18 88 39 103 2660 251388
i.e., nb classified nearly everything as "lg" while in the actual data
"lg" is virtually nonexistent.
3. when I print "nb", I see "A-priori probabilities" (which are what I
expected) and "Conditional probabilities" which are confusing because
there are only two of them, e.g.:
android 0.048464998 0.43946764
blackberry 0.001638002 0.04045564
ipad 0.322251606 1.84940588
iphone 0.030873494 0.23250250
lg 0.000000000 0.00000000
linux 0.023501362 0.34698919
mac 0.082653774 1.22535027
mobile 0.000000000 0.00000000
nokia 0.000000000 0.00000000
samsung 0.000000000 0.00000000
symbian 0.000000000 0.00000000
unknown 0.003759398 0.08219078
windows 0.021158528 0.32916970
the predictors are integers.
is the first column for the 0 predictors and the second for all non-0?
Is there a way to ask naiveBayes to differenciate between non-0 values?
thanks!