Skip to content

caretNWS and training data set sizes

4 messages · Max Kuhn, Tait, Peter

#
Hi,

I am using the caretNWS package to train some supervised regression models (gbm, lasso, random forest and mars). The problem I have encountered started when my training data set increased in the number of predictors and the number of observations.

The training data set has 347 numeric columns. The problem I have is when there are more then 2500 observations the 5 sleigh objects start but do not use any CPU resources and do not process any data.

N=100                     cpu(%)       memory(K)
Rgui.exe                   0           91737
5x sleighs (RTerm.exe)    15-25         ~27000

N=2500
Rgui.exe                  0             160000
5x sleighs (RTerm.exe)    15-25         ~74000

N=5000
Rgui.exe                  50             193000
5x sleighs (RTerm.exe)    0             ~19000


A 10% sample of my overall data is ~22000 observations.

Can someone give me an idea of the limitations of the nws and caretNWS packages in terms of the number of columns and rows of the training matrices and if there are other tuning/training functions that work faster on large datasets?

Thanks for your help.
Peter
_
platform       i386-pc-mingw32
arch           i386
os             mingw32
system         i386, mingw32
status
major          2
minor          6.2
year           2008
month          02
day            08
svn rev        44383
language       R
version.string R version 2.6.2 (2008-02-08)
[1] 2047
#
What version of caret and caretNWS are you using? Also, what version
of the nws server and twisted are you using? What kind of machine (#
processors, how much physical memory etc)?

I haven't seen any real limitations with one exception: if you are
running P jobs on the same machine, you are replicating the memory
needs P times.

I've been running jobs with 4K to 90K samples and 1200 predictors
without issues, so I'll need a lot more information to help you.

Max
On Mon, Mar 10, 2008 at 12:04 PM, Tait, Peter <ptait at skura.com> wrote:

  
    
#
Hi Max,
Thank you for the fast response.

Here are the versions of the R packages I am using:

caret 3.13
caretNWS 0.16
nws 1.62

Here are the python versions

Active Python 2.5.1.1
nws server 1.5.2 for py2.5
twisted 2.5.9 py2.5

The computer I am using has 1 Xeon dual core cpu at 1.86 GHz with 4 GB of RAM. R is currently set up to use 2 GB of it (it starts with "C:\Program Files\R\R-2.6.2\bin\Rgui.exe" --max-mem-size=2047M). The OS is Windows Server 2003 R2 with SP2.

I am running one R job/process (Rgui.exe) and almost nothing else on the computer while R is running (no databases, web servers, office apps etc..)

I really appreciate your help.
Cheers
Peter
#
Peter,

You are certainly up to date. Can you try replicating this using only
two nodes (since you only have two processors)? I'm not sure that
specifying 5 really helps. Using 2 nodes on my mac usually gets me
about a 30-40% decrease in time.

Also, are the processes just hanging or is there an error? These
models may take a while. Perhaps testing with pls, lm or some other
fast model might help troubleshoot.

If you are not passing a sleigh object into the trainNWS call, you can
do this by using

trainNWSControl(
         start = makeSleighStarter(workerCount = 2))

The only other thing I can suggest is to send me the data (or an
anonymized knock-off) so that I can test. You certainly should be able
to do this, but you may be limited by your machine.

Max
On Mon, Mar 10, 2008 at 1:18 PM, Tait, Peter <ptait at skura.com> wrote: