Back to formatted view
Raw Message

Message-ID: <CAKH910-kcZcTdHzfNC8KRru6VWkz5do0AN4HN9HAiks-ck-=rg@mail.gmail.com>
Date: 2011-07-02T17:24:26Z
From: Håvard Wahl Kongsgård
Subject: Handling data with thousands of variables
In-Reply-To: <BANLkTinEV+PP+ZMdDi0_WgAaqUbUYMZkug@mail.gmail.com>

Ok, but what about memory usage. For now I have implemented my
analysis in python with numpy arrays with only 100 000 cases and 10
000 keywords.
But the memory required for large array and matrix is massive. In R
one possibility is the Bigmemory library,
but it's slow and if I remember correctly the bigmemory matrix is not
supported by other R libraries.

-H?vard


On Fri, Jul 1, 2011 at 10:02 AM, Han De Vries <handevries at gmail.com> wrote:
> Perhaps you want to store your data in a big 10 mln rows x 20000
> columns matrix, where each cell is 1 when the corresponding keyword
> applies to a record, zero otherwise. Because you will end up with very
> many zeroes, such a matrix can be stored as a sparse matrix in an
> efficient way (using the Matrix package). The Matrix package itself
> offers various analytical tools to quickly summarize by rows or
> columns, and many other types of estimations as long as they can be
> translated to matrix operations (like linear regression). Some other
> packages, such as glmnet, can read these matrices directly for more
> specific analyses. If you have sufficient memory (you want to keep the
> entire sparse matrix in memory), handling the data can be really fast.
>
> Because you're asking about personal experiences: I have been using
> this approach with (sparse) matrices up to a few million rows
> (records) and 20K columns (variables).
>
> Kind regards,
> Han
>
> _______________________________________________
> R-sig-hpc mailing list
> R-sig-hpc at r-project.org
> https://stat.ethz.ch/mailman/listinfo/r-sig-hpc
>