naiveBayes: slow predict, weird results - R-help

Thu, Feb 9, 2012 6:43 PM #

I did this:
nb <- naiveBayes(users, platform)
pl <- predict(nb,users)
nrow(users) ==> 314781
ncol(users) ==> 109

1. naiveBayes() was quite fast (~20 seconds), while predict() was slow
(tens of minutes).  why?

2. the predict results were completely off the mark (quite the opposite
of the expected overfitting).  suffice it to show the tables:

pl:

   android blackberry       ipad     iphone         lg      linux        mac 
         3          5         11         14     312723          5         11 
    mobile      nokia    samsung    symbian    unknown    windows 
      1864         17         16        112          0          0 

platform:
   android blackberry       ipad     iphone         lg      linux        mac 
     18013       1221       2647       1328          4       2936      34336 
    mobile      nokia    samsung    symbian    unknown    windows 
        18         88         39        103       2660     251388 

i.e., nb classified nearly everything as "lg" while in the actual data
"lg" is virtually nonexistent.

3. when I print "nb", I see "A-priori probabilities" (which are what I
expected) and "Conditional probabilities" which are confusing because
there are only two of them, e.g.:

             android    0.048464998 0.43946764
             blackberry 0.001638002 0.04045564
             ipad       0.322251606 1.84940588
             iphone     0.030873494 0.23250250
             lg         0.000000000 0.00000000
             linux      0.023501362 0.34698919
             mac        0.082653774 1.22535027
             mobile     0.000000000 0.00000000
             nokia      0.000000000 0.00000000
             samsung    0.000000000 0.00000000
             symbian    0.000000000 0.00000000
             unknown    0.003759398 0.08219078
             windows    0.021158528 0.32916970

the predictors are integers.
is the first column for the 0 predictors and the second for all non-0?
Is there a way to ask naiveBayes to differenciate between non-0 values?

thanks!

Sam Steingold (http://sds.podval.org/) on Ubuntu 11.10 (oneiric) X 11.0.11004000
http://www.childpsy.net/ http://ffii.org http://www.PetitionOnline.com/tap12009/
http://mideasttruth.com http://iris.org.il http://openvotingconsortium.org
The program isn't debugged until the last user is dead.

Sam Steingold

Fri, Feb 10, 2012 7:01 AM #

When I tried to run svm on the same data frame, memory usage as reported
by top(1) doubled to 4GB almost right away and the function never
returned (has been running for ~15 hours now). ^C does not stop it.
This is most unusual, libsvm has always seemed very fast.

This is R version 2.13.1 (2011-07-08) (as distributed with ubuntu).

* Sam Steingold <fqf at tah.bet> [2012-02-09 21:43:30 -0500]:

I did this:
nb <- naiveBayes(users, platform)
pl <- predict(nb,users)
nrow(users) ==> 314781
ncol(users) ==> 109

1. naiveBayes() was quite fast (~20 seconds), while predict() was slow
(tens of minutes).  why?

2. the predict results were completely off the mark (quite the opposite
of the expected overfitting).  suffice it to show the tables:

pl:

   android blackberry       ipad     iphone         lg      linux        mac 
         3          5         11         14     312723          5         11 
    mobile      nokia    samsung    symbian    unknown    windows 
      1864         17         16        112          0          0 

platform:
   android blackberry       ipad     iphone         lg      linux        mac 
     18013       1221       2647       1328          4       2936      34336 
    mobile      nokia    samsung    symbian    unknown    windows 
        18         88         39        103       2660     251388 

i.e., nb classified nearly everything as "lg" while in the actual data
"lg" is virtually nonexistent.

3. when I print "nb", I see "A-priori probabilities" (which are what I
expected) and "Conditional probabilities" which are confusing because
there are only two of them, e.g.:

             android    0.048464998 0.43946764
             blackberry 0.001638002 0.04045564
             ipad       0.322251606 1.84940588
             iphone     0.030873494 0.23250250
             lg         0.000000000 0.00000000
             linux      0.023501362 0.34698919
             mac        0.082653774 1.22535027
             mobile     0.000000000 0.00000000
             nokia      0.000000000 0.00000000
             samsung    0.000000000 0.00000000
             symbian    0.000000000 0.00000000
             unknown    0.003759398 0.08219078
             windows    0.021158528 0.32916970

the predictors are integers.
is the first column for the 0 predictors and the second for all non-0?
Is there a way to ask naiveBayes to differenciate between non-0 values?

thanks!

Sam Steingold (http://sds.podval.org/) on Ubuntu 11.10 (oneiric) X 11.0.11004000
http://www.childpsy.net/ http://openvotingconsortium.org http://iris.org.il
http://jihadwatch.org http://camera.org http://www.memritv.org
Don't ascribe to malice what can be adequately explained by stupidity.

Sam Steingold

Fri, Feb 10, 2012 8:00 AM #

looks like it _is_ libsvm:

#0  0x00007ffff2aedc64 in Solver::select_working_set (this=0x7fffffff97f0, out_i=@0x7fffffff95a0, out_j=@0x7fffffff95b0) at svm.cpp:852
#1  0x00007ffff2aef91d in Solver::Solve (this=0x7fffffff97f0, l=285724, Q=..., p_=<optimized out>, y_=<optimized out>, alpha_=0x6023fb60, Cp=1, 
    Cn=1, eps=<optimized out>, si=0x7fffffff9980, shrinking=1) at svm.cpp:573
#2  0x00007ffff2af1747 in solve_c_svc (Cn=1, Cp=1, si=0x7fffffff9980, alpha=0x6023fb60, param=<optimized out>, prob=0x7fffffff9c30) at svm.cpp:1444
#3  svm_train_one (prob=0x7fffffff9c30, param=<optimized out>, Cp=1, Cn=1) at svm.cpp:1641
#4  0x00007ffff2af4a8e in svm_train (prob=<optimized out>, param=0x7fffffff9d40) at svm.cpp:2179
#5  0x00007ffff2aea281 in svmtrain (x=0x7fff7e698038, r=0x11c9b1e0, c=<optimized out>, y=<optimized out>, rowindex=<optimized out>, 
    colindex=<optimized out>, svm_type=0x11c9b2a0, kernel_type=0x11c9b2d0, degree=0x11c9b300, gamma=0x356e3a28, coef0=0x356e3a60, cost=0x356e3ad0, 
    nu=0x103589a8, weightlabels=0x0, weights=0x0, nweights=0x11c9b330, cache=0x103589e0, tolerance=0x10358a18, epsilon=0x10358a50, 
    shrinking=0x11c9b360, cross=0x11c9b390, sparse=0x11c9b3c0, probability=0x1524dbb0, seed=0x1524dbe0, nclasses=0x1524dc10, nr=0x1524dc40, 
    index=0x148a0fa8, labels=0xa3303b8, nSV=0xa330420, rho=0x170083e8, coefs=0x391dbb48, sigma=0x10358a88, probA=0xdf94678, probB=0xcbb7eb8, 
    cresults=0x0, ctotal1=0x10358ac0, ctotal2=0x10358af8, error=0x10358b30) at Rsvm.c:275
#6  0x00007ffff792cefc in ?? () from /usr/lib/R/lib/libR.so
#7  0x00007ffff795da1d in Rf_eval () from /usr/lib/R/lib/libR.so
#8  0x00007ffff795f540 in ?? () from /usr/lib/R/lib/libR.so
#9  0x00007ffff795d7ff in Rf_eval () from /usr/lib/R/lib/libR.so
#10 0x00007ffff795f6c9 in ?? () from /usr/lib/R/lib/libR.so
#11 0x00007ffff795d7ff in Rf_eval () from /usr/lib/R/lib/libR.so
#12 0x00007ffff7960a7f in Rf_applyClosure () from /usr/lib/R/lib/libR.so
#13 0x00007ffff79ad784 in Rf_usemethod () from /usr/lib/R/lib/libR.so
#14 0x00007ffff79ada47 in ?? () from /usr/lib/R/lib/libR.so
#15 0x00007ffff795d7ff in Rf_eval () from /usr/lib/R/lib/libR.so
#16 0x00007ffff7960a7f in Rf_applyClosure () from /usr/lib/R/lib/libR.so
#17 0x00007ffff795d6e0 in Rf_eval () from /usr/lib/R/lib/libR.so
#18 0x00007ffff795f540 in ?? () from /usr/lib/R/lib/libR.so
#19 0x00007ffff795d7ff in Rf_eval () from /usr/lib/R/lib/libR.so
#20 0x00007ffff795db9b in ?? () from /usr/lib/R/lib/libR.so
#21 0x00007ffff795dad9 in Rf_eval () from /usr/lib/R/lib/libR.so
#22 0x00007ffff795f6c9 in ?? () from /usr/lib/R/lib/libR.so
#23 0x00007ffff795d7ff in Rf_eval () from /usr/lib/R/lib/libR.so
#24 0x00007ffff7960a7f in Rf_applyClosure () from /usr/lib/R/lib/libR.so
#25 0x00007ffff795d6e0 in Rf_eval () from /usr/lib/R/lib/libR.so
#26 0x00007ffff7998055 in Rf_ReplIteration () from /usr/lib/R/lib/libR.so
#27 0x00007ffff79982e0 in ?? () from /usr/lib/R/lib/libR.so
#28 0x00007ffff7998370 in run_Rmainloop () from /usr/lib/R/lib/libR.so
#29 0x000000000040078b in main ()
#30 0x00007ffff72d930d in __libc_start_main () from /lib/x86_64-linux-gnu/libc.so.6
#31 0x00000000004007bd in _start ()


#0  0x00007ffff2aeeb67 in Kernel::dot (px=0x48eeb220, py=0x4b21890) at svm.cpp:295
#1  0x00007ffff2af7a25 in Kernel::kernel_rbf (this=<optimized out>, i=<optimized out>, j=<optimized out>) at svm.cpp:239
#2  0x00007ffff2af782c in SVC_Q::get_Q (this=0x7fffffff9870, i=187701, len=208039) at svm.cpp:1271
#3  0x00007ffff2aef9ab in Solver::Solve (this=0x7fffffff97f0, l=285724, Q=..., p_=<optimized out>, y_=<optimized out>, alpha_=0x6023fb60, Cp=1,
    Cn=1, eps=<optimized out>, si=0x7fffffff9980, shrinking=1) at svm.cpp:591
#4  0x00007ffff2af1747 in solve_c_svc (Cn=1, Cp=1, si=0x7fffffff9980, alpha=0x6023fb60, param=<optimized out>, prob=0x7fffffff9c30) at svm.cpp:1444
#5  svm_train_one (prob=0x7fffffff9c30, param=<optimized out>, Cp=1, Cn=1) at svm.cpp:1641
#6  0x00007ffff2af4a8e in svm_train (prob=<optimized out>, param=0x7fffffff9d40) at svm.cpp:2179
#7  0x00007ffff2aea281 in svmtrain (x=0x7fff7e698038, r=0x11c9b1e0, c=<optimized out>, y=<optimized out>, rowindex=<optimized out>,
    colindex=<optimized out>, svm_type=0x11c9b2a0, kernel_type=0x11c9b2d0, degree=0x11c9b300, gamma=0x356e3a28, coef0=0x356e3a60, cost=0x356e3ad0,
    nu=0x103589a8, weightlabels=0x0, weights=0x0, nweights=0x11c9b330, cache=0x103589e0, tolerance=0x10358a18, epsilon=0x10358a50,
    shrinking=0x11c9b360, cross=0x11c9b390, sparse=0x11c9b3c0, probability=0x1524dbb0, seed=0x1524dbe0, nclasses=0x1524dc10, nr=0x1524dc40,
    index=0x148a0fa8, labels=0xa3303b8, nSV=0xa330420, rho=0x170083e8, coefs=0x391dbb48, sigma=0x10358a88, probA=0xdf94678, probB=0xcbb7eb8,
    cresults=0x0, ctotal1=0x10358ac0, ctotal2=0x10358af8, error=0x10358b30) at Rsvm.c:275
#8  0x00007ffff792cefc in ?? () from /usr/lib/R/lib/libR.so
#9  0x00007ffff795da1d in Rf_eval () from /usr/lib/R/lib/libR.so
#10 0x00007ffff795f540 in ?? () from /usr/lib/R/lib/libR.so
#11 0x00007ffff795d7ff in Rf_eval () from /usr/lib/R/lib/libR.so
#12 0x00007ffff795f6c9 in ?? () from /usr/lib/R/lib/libR.so
#13 0x00007ffff795d7ff in Rf_eval () from /usr/lib/R/lib/libR.so
#14 0x00007ffff7960a7f in Rf_applyClosure () from /usr/lib/R/lib/libR.so
#15 0x00007ffff79ad784 in Rf_usemethod () from /usr/lib/R/lib/libR.so
#16 0x00007ffff79ada47 in ?? () from /usr/lib/R/lib/libR.so
#17 0x00007ffff795d7ff in Rf_eval () from /usr/lib/R/lib/libR.so
#18 0x00007ffff7960a7f in Rf_applyClosure () from /usr/lib/R/lib/libR.so
#19 0x00007ffff795d6e0 in Rf_eval () from /usr/lib/R/lib/libR.so
#20 0x00007ffff795f540 in ?? () from /usr/lib/R/lib/libR.so
#21 0x00007ffff795d7ff in Rf_eval () from /usr/lib/R/lib/libR.so
#22 0x00007ffff795db9b in ?? () from /usr/lib/R/lib/libR.so
#23 0x00007ffff795dad9 in Rf_eval () from /usr/lib/R/lib/libR.so
#24 0x00007ffff795f6c9 in ?? () from /usr/lib/R/lib/libR.so
#25 0x00007ffff795d7ff in Rf_eval () from /usr/lib/R/lib/libR.so
#26 0x00007ffff7960a7f in Rf_applyClosure () from /usr/lib/R/lib/libR.so
#27 0x00007ffff795d6e0 in Rf_eval () from /usr/lib/R/lib/libR.so
#28 0x00007ffff7998055 in Rf_ReplIteration () from /usr/lib/R/lib/libR.so
#29 0x00007ffff79982e0 in ?? () from /usr/lib/R/lib/libR.so
#30 0x00007ffff7998370 in run_Rmainloop () from /usr/lib/R/lib/libR.so
#31 0x000000000040078b in main ()
#32 0x00007ffff72d930d in __libc_start_main () from /lib/x86_64-linux-gnu/libc.so.6
#33 0x00000000004007bd in _start ()

This is R version 2.13.1 (2011-07-08) (as distributed with ubuntu).

* Sam Steingold <fqf at tah.bet> [2012-02-09 21:43:30 -0500]:

I did this:
nb <- naiveBayes(users, platform)
pl <- predict(nb,users)
nrow(users) ==> 314781
ncol(users) ==> 109

1. naiveBayes() was quite fast (~20 seconds), while predict() was slow
(tens of minutes).  why?

2. the predict results were completely off the mark (quite the opposite
of the expected overfitting).  suffice it to show the tables:

pl:

   android blackberry       ipad     iphone         lg      linux        mac 
         3          5         11         14     312723          5         11 
    mobile      nokia    samsung    symbian    unknown    windows 
      1864         17         16        112          0          0 

platform:
   android blackberry       ipad     iphone         lg      linux        mac 
     18013       1221       2647       1328          4       2936      34336 
    mobile      nokia    samsung    symbian    unknown    windows 
        18         88         39        103       2660     251388 

i.e., nb classified nearly everything as "lg" while in the actual data
"lg" is virtually nonexistent.

3. when I print "nb", I see "A-priori probabilities" (which are what I
expected) and "Conditional probabilities" which are confusing because
there are only two of them, e.g.:

             android    0.048464998 0.43946764
             blackberry 0.001638002 0.04045564
             ipad       0.322251606 1.84940588
             iphone     0.030873494 0.23250250
             lg         0.000000000 0.00000000
             linux      0.023501362 0.34698919
             mac        0.082653774 1.22535027
             mobile     0.000000000 0.00000000
             nokia      0.000000000 0.00000000
             samsung    0.000000000 0.00000000
             symbian    0.000000000 0.00000000
             unknown    0.003759398 0.08219078
             windows    0.021158528 0.32916970

the predictors are integers.
is the first column for the 0 predictors and the second for all non-0?
Is there a way to ask naiveBayes to differenciate between non-0 values?

thanks!

Sam Steingold (http://sds.podval.org/) on Ubuntu 11.10 (oneiric) X 11.0.11004000
http://www.childpsy.net/ http://pmw.org.il http://iris.org.il http://ffii.org
http://truepeace.org http://memri.org http://www.memritv.org
If a train station is a place where a train stops, what's a workstation?

Uwe Ligges

Sat, Feb 11, 2012 7:51 AM #

We don't have the data, but my guess is that you want to have some 
factors in your data that were integers when you tried the code below.

Uwe Ligges

On 10.02.2012 03:43, Sam Steingold wrote:

I did this:
nb<- naiveBayes(users, platform)
pl<- predict(nb,users)
nrow(users) ==>  314781
ncol(users) ==>  109

1. naiveBayes() was quite fast (~20 seconds), while predict() was slow
(tens of minutes).  why?

2. the predict results were completely off the mark (quite the opposite
of the expected overfitting).  suffice it to show the tables:

pl:

    android blackberry       ipad     iphone         lg      linux        mac
          3          5         11         14     312723          5         11
     mobile      nokia    samsung    symbian    unknown    windows
       1864         17         16        112          0          0

platform:
    android blackberry       ipad     iphone         lg      linux        mac
      18013       1221       2647       1328          4       2936      34336
     mobile      nokia    samsung    symbian    unknown    windows
         18         88         39        103       2660     251388

i.e., nb classified nearly everything as "lg" while in the actual data
"lg" is virtually nonexistent.

3. when I print "nb", I see "A-priori probabilities" (which are what I
expected) and "Conditional probabilities" which are confusing because
there are only two of them, e.g.:

              android    0.048464998 0.43946764
              blackberry 0.001638002 0.04045564
              ipad       0.322251606 1.84940588
              iphone     0.030873494 0.23250250
              lg         0.000000000 0.00000000
              linux      0.023501362 0.34698919
              mac        0.082653774 1.22535027
              mobile     0.000000000 0.00000000
              nokia      0.000000000 0.00000000
              samsung    0.000000000 0.00000000
              symbian    0.000000000 0.00000000
              unknown    0.003759398 0.08219078
              windows    0.021158528 0.32916970

the predictors are integers.
is the first column for the 0 predictors and the second for all non-0?
Is there a way to ask naiveBayes to differenciate between non-0 values?

thanks!