I've been trying to get some linear classifiers (LiblineaR, kernlab,
e1071) to work with a sparse matrix of feature data. In the case of
LiblineaR and kernlab, it seems I have to coerce my data into a dense
matrix in order to train a model. I've done a number of searches,
read through the manuals and vignettes, but I can't seem to see how to
use either of these packages with sparse matrices. I've tried using
both csr from SparseM and sparseMatrix from the Matrix library. You
can see a simple example recreating my results below.
Does anybody know if there's a trick to get this to work without
coercing the data into a dense matrix?
I'm currently playing with the KDDCUP 2010 datasets. I've written a
simple script to create hash kernel feature vectors for each of the
rows of training data. Right now I haven't added many features into
the hash vectors. For simplicity, I'm just creating a string token
for each feature, then hashing it and taking that hash mod 10007 and
10009 (so two buckets for each feature with a low likelihood of two
features colliding on both buckets). 10009 columns may seem like
overkill, but I figured if it was a sparse matrix the number of
columns really wouldn't matter that much. Right now I'm also only
playing with 99999 rows of input. When ever I make the mistake of
doing something which unintentionally coerces the sparse matrix into a
dense one, I end up eating up all my RAM, going to swap, and spending
the next 5 minutes trying to kill my session... So I'm looking for
something that scales relatively well without taking up too large a
memory footprint to run.
Thanks!
Jeff
See below for an example that recreates what my basic attempts at
using sparse matrices